SlideShare a Scribd company logo
`
Building a Serverless
Cloud Data Lake
2 © Informatica. Proprietary and Confidential.2
Agenda
Introduction
Next Gen Analytics Use Cases
Architecting for Intelligent Cloud Data Lakes
Hands-On Lab
Best Practices for implementing a data lake
Final Q & A, Wrap-up
Technology Transformation
Storage
Analytics
Applications
Databases
Messaging
Compute
Technology Transformation
TABLES
Azure Data
Lake Store
Azure Blob
Storage
Google
Compute Engine
Azure
HDInsight
Amazon Web
Services EC2 & EMP
Storage
Analytics
Applications
Databases
Messaging
Compute
Technology Transformation
TABLES
Azure Data
Lake Store
Azure Blob
Storage
Google
Compute Engine
Azure
HDInsight
Amazon Web
Services EC2 & EMP
ADLS
Gen 2
Google Cloud
Storage AWS S3
Serverless
Google
Dataproc
Altus Data Warehouse
Azure Cosmos DB
AWS
Neptune
Storage
Analytics
Applications
Databases
Messaging
Compute
AI/ML Workspace
Data Science
Workbench
Azure ML
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
7 © Informatica. Proprietary and Confidential.7
Technology Transformation
Data
Latency
Users
Regulations
Cloud
Spark Code
package main.scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.udf
/**
* TPC-H Query 3
*
*/
class Q03 extends TpchQuery {
override def execute(sc: SparkContext, schemaProvider:
TpchSchemaProvider): DataFrame = {
// this is used to implicitly convert an RDD to a DataFrame.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import schemaProvider._
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val fcust = customer.filter($"c_mktsegment" === "BUILDING")
val forders = order.filter($"o_orderdate" < "1995-03-15")
val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15")
fcust.join(forders, $"c_custkey" === forders("o_custkey"))
.select($"o_orderkey", $"o_orderdate", $"o_shippriority")
.join(flineitems, $"o_orderkey" === flineitems("l_orderkey"))
.select($"l_orderkey",
decrease($"l_extendedprice", $"l_discount").as("volume"),
$"o_orderdate", $"o_shippriority")
.groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority")
.agg(sum($"volume").as("revenue"))
.sort($"revenue".desc, $"o_orderdate")
.limit(10)
}
}
SQL Query
select l_orderkey, sum(l_extendedprice * (1 -
l_discount)) as revenue, o_orderdate, o_shippriority
from CUSTOMER, ORDERS, LINEITEM where
c_mktsegment = 'AUTOMOBILE' and c_custkey =
o_custkey and l_orderkey = o_orderkey and o_orderdate
< date '1995-03-13' and l_shipdate > date '1995-03-13'
group by l_orderkey, o_orderdate, o_shippriority order
by revenue desc, o_orderdate limit 10;
8 © Informatica. Proprietary and Confidential.8
Disruptors
Data
Latency
Users
Regulations
Cloud
9 © Informatica. Proprietary and Confidential.9
Data Management Accelerators
Data
Latency
Users
Regulations
Cloud
ModernDataIntegrationPatterns
Metadata
AI/ML
Governance
Hybrid
10 © Informatica. Proprietary and Confidential.
Fragmented Approaches Create More Technical Debt And Complexity
Can your team afford to be on call to support and maintain these manual, one-off,
time-consuming, and complex approaches?
Integrate
Hand-coding Hand-coding
Protect PrepareCatalog Stream
Hand-coding
Match
Hand-coding
Enrich
Hand-coding
End-to-End Data Management for Next-Gen Analytics
ANY
DATA
ANY
REGULATION
ANY
USER
ANY CLOUD / ANY TECHNOLOGY
ANY
LATENCY
METADATA
GOVERNANCE
INGEST STREAM INTEGRATE CLEANSE PREPARE DEFINE CATALOG RELATE PROTECT DELIVERENRICH
HYBRID
MODERN DATA INTEGRATION PATTERNS
Cloud-Ready
Big Data
Management
13 © Informatica. Proprietary and Confidential.
Elastic
compute
S3, Redshift, …
Blob, ADLS, …
Fully Automated
Cloud Deployment
Managed Serverless
Deployment
Compute
as a Service
Storage as
a Service
The Ever-Evolving Big Data Technology
Big Data 3.0
On-premises
Manual Deployment
NoSQL
HDFS Map
Reduce
Big Data 1.0
Hosted
Manual Deployment
YARN, SQOOP, …
HDFS
Hive
MapR FS
Big Data 2.0
14 © Informatica. Proprietary and Confidential.
Big Data 2.0: On-Premises Big Data Deployment
Select
Hadoop
distro
Configure cluster
Monitor
cluster
Manually
scale cluster
• Requires Big Data skills
• Requires Hadoop admins
• High operational cost
Set up hardware
Design data flows
15 © Informatica. Proprietary and Confidential.
Big Data 3.0: Serverless Deployment with
Integration at Scale
Configure cluster
Monitor
cluster
Set up hardware
Select
Hadoop
distro
Design data flows
Auto
scale cluster
16 © Informatica. Proprietary and Confidential.16
Cloud is a Huge Leap Forward for Data Lakes
1. Separation of data compute and
storage
2. Low-cost, infinite-scale out data
persistence
3. Fit-for-purpose, scale-on-demand
compute engines
1. Tightly coupled data compute and
storage
2. Expansions are slow to provision
3. Contention for compute capacity
with highly variable compute profiles
Agility Inhibitors Agility Enablers
On-premises
Data Lakes
Cloud
Data Lakes
17 © Informatica. Proprietary and Confidential.17
Enhanced Cloud Ecosystem Support
Choose your cloud confidently for processing Big Data workloads
BigQueryCloud
Storage
Dataproc Bigtable Cloud
Datastore
Cloud
SQL
HDInsightBlob StorageCosmoDBADLS SQL DW Event
Hubs
Azure
Databricks
EMRS3 Redshift Kinesis
Firehose
Kinesis
© Informatica. Proprietary and Confidential.1818
ANALYTICS
DATA MANAGEMENT
INFRASTRUCTURE & STORAGE
Where Does Informatica Fit?
Ingest Cleanse Secure Catalog PrepareIntegrate Govern
© Informatica. Proprietary and Confidential.1919
Redshift
S3
Input bucket
EMR /
Qubole
S3
Output bucket
RELATIONAL
DEVICE DATA
WEBLOGS
Cloud Ready Reference Architecture
Amazon AWS
CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH
ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME
© Informatica. Proprietary and Confidential.2020
RELATIONAL
DEVICE DATA
WEBLOGS
Cloud Ready Reference Architecture
Microsoft Azure
CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH
ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME
SQL Data
Warehouse
ADLS /
BLOB
HDInsight /
Azure Databricks
ADLS /
BLOB
© Informatica. Proprietary and Confidential.2121
RELATIONAL
DEVICE DATA
WEBLOGS
Cloud Ready Reference Architecture
Google Cloud
CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH
ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME
Big
Query
Cloud
Storage
Dataproc Cloud
Storage
© Informatica. Proprietary and Confidential.2222
Steps to Build a Data Lake
Catalog, Govern and Secure
Data Ingestion
Data Preparation
Data Curation
Stream Processing
and Analytics
1
2
3
4
5
Data Delivery6
23 © Informatica. Proprietary and Confidential.23
Designing A Data Lake – High Level Architecture
Sources
LANDING
RAW
CURATION
Curated
Structured
Cleansed
ADVANCED
ANALYTICS
Models
DATA CATALOG
DATA QUALITY & GOVERNANCE
DATA PRIVACY AND PROTECTION
DATA INFRASTRUCTURE
Data Products
DISCOVERY ZONE
Self-Service
Analytics
ENTERPRISE ZONE
Enriched
Formatted
Consumption Ready
STREAM PROCESSING & ANALYTICS
Intelligent Data Lake Blueprint
FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
Batch/Mass
Ingest
ChangeData
Capture
EdgeData
Collect/Stream
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Metadata Intelligence Foundation
Data
Steward
Data Catalog
Data Governance & Quality
Data Privacy & Protection
Discover Classify Relationship Lineage Data Statistics
Standards Policies Procedures Quality Integrity Stewardship
Analyze Identify Classify Detect Risk Score User Behavior Analysis
1
2
3
4
5
#1
Supporting Layer of a
Data Lake
26 © Informatica. Proprietary and Confidential.26 © Informatica. Proprietary and Confidential.
Supporting Layer of A Data Lake1
• Analysts spend countless hours finding
the right data assets for their analysis.
• A properly governed catalog (that is a
catalog that’s been augmented by
business definitions and quality
information) greatly reduces this time to
discover
Data Catalog
• You can only trust your reports and
analysis if you trust the underlying data
• Governance provides the trust in the
data by showing where data came from
and how it’s been transformed
• And by adding quality rules and the
associated scores to the catalog
Data Governance & Quality
• Data governance provides the necessary
controls on who can access which data
in the data lake.
• It provides input for data security
mechanisms (such as authorization or
masking) to prevent mis-use of data.
Data Privacy & Protection
Automate Data Discovery, Cataloging, and Linking
Leverage ML and AI to find
critical data across structured
and unstructured sources
Onboard discovered data
automatically with oversight
and control
Automatically tag data with
business context to help users
assess relevance
Automate Lineage Discovery and Process Mapping
Automatically produce data
lineage from scanned and
discovered data movement
Onboard lineage and process
mapping automatically, with
oversight and control
Understand where data comes
from, how and where it’s used,
and who’s responsible for it
Automate Fit for Purpose Data Quality
Automate rule generation
from policy definitions and
metadata
Automate rule enforcement
in systems and business
processes
Monitor quality improvement
across systems and
processes over time
Automate Data Subject Discovery, Proliferation,
Consent and Protection
Define policies to protect sensitive
data, assign ownership and
accountability
Automate discovery and classification
of sensitive data, across structured
and unstructured sources
Automate identification of data
subjects, and assessment
of risk exposure
© Informatica. Proprietary and Confidential.3131
Demonstration
Enterprise Data Catalog
#2
Data Ingestion
Intelligent Data Lake Blueprint
FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Metadata Intelligence Foundation
Data
Steward
Data Catalog
Data Governance & Quality
Data Privacy & Protection
Discover Classify Relationship Lineage Data Statistics
Standards Policies Procedures Quality Integrity Stewardship
Analyze Identify Classify Detect Risk Score User Behavior Analysis
1
4
3
5
2
34 © Informatica. Proprietary and Confidential.
Data Ingestion
Data ingestion is the process of collecting data
from source systems or source locations, either
in a batch mode or in a real-time/streaming and
load this data into the data lake.
35 © Informatica. Proprietary and Confidential.35 © Informatica. Proprietary and Confidential.
Ingesting Data Into The Data Lake
• Ingest or replicate large amounts of
data in batch mode into
RDBMS/files/cloud DW/Hadoop.
• Either manual build of ingest
mapping (specialized ingestion
process) or using Dynamic Mapping
patterns to automate generation of
large amounts of ingestion
mappings.
• Or use Mass Ingestion Service for
initial- and delta loads
Batch Data Ingestion
• Use change data capture to capture only
changed data from a data source
• Wide array of sources available ranging
from mainframe to midrange to most
popular RDBMS’s that support change
capture.
• Ingest directly into target database (using
PowerCenter) or into Kafka queue for
further streaming processing (using
Power Exchange for CDC for Kafka
Publisher)
• Collect data from streaming data
sources like HTTP streams, sensors,
log files etc and push data directly into
Data Lake or onto Kafka queue
• For data pushed onto Kafka use Big
Data Streaming to
augment/enrich/validate/cleanse the
data and use it for streaming analytics.
• Generate events to trigger actions
based on stream processing.
(e.g. next best offer in retail, control
industry processes with predictive
maintenance)
• Embed Machine Learning models using
Python or Java,
Edge Data Streaming
2
Change Data Capture
36 © Informatica. Proprietary and Confidential.36 © Informatica. Proprietary and Confidential.
High-Speed Mass Ingestion
Rely on easy to use, fast, and scalable approach – no hand-coding
Ingest data from various source
systems including relational
tables into Cloud and Big Data
Uses high-performance
connectivity, mass ingestion, and
dynamic mappings
Self-service UI to perform mass
ingestion of initial as well as
incremental loads
37 © Informatica. Proprietary and Confidential.37
Change Data Capture
Bulk Movement
Bulk movement of data from one or more sources to targets
Source Data Target Database
Incrementally take source changes and apply them to target
Source Data Target Database
Continuously take source changes and apply them to target
Source Data Target Database
Step 1
Step 2
Step 3
38 © Informatica. Proprietary and Confidential.38
Enterprise Data Streaming
Big Data
Streaming
Real time offer Alert
Capture and Ingest
Enrich, Process &
Analyze
Relational Systems
Edge Data
Streaming
Real time dashboard
Machine
Data / IOT
Sensor
Data
WebLogs
Social
Media
Power
Exchange
CDC
Publisher
Message
Hub
Persist /Data Lake
Trigger Business processes
Changes
Extract Transform LoadSense Reason Act
Azure Event Hub
Filter Transform Aggregate Enrich
#3
Data Preparation
Intelligent Data Lake Blueprint
FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Metadata Intelligence Foundation
Data
Steward
Data Catalog
Data Governance & Quality
Data Privacy & Protection
Discover Classify Relationship Lineage Data Statistics
Standards Policies Procedures Quality Integrity Stewardship
Analyze Identify Classify Detect Risk Score User Behavior Analysis
1
4
4
5
2
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
Data
Analyst
3
41 © Informatica. Proprietary and Confidential.
Data Preparation
Data preparation is the collaborative self-service
process of discovering and preparing data by
data analysts to rapidly turn raw data into
insights with quality and governance
42 © Informatica. Proprietary and Confidential.42 © Informatica. Proprietary and Confidential.
Why Do We Need A Data Preparation Solution?
• Difficulty finding trusted data
• Limited access to the data
• Frustrated by slow response from IT
• Constrained by disparate tools, manual steps
• No way to collaborate, share, and update
curated datasets, reuse knowledge
• Can’t cope with growing demand for data
from the business
• No visibility into what the business is
doing with the data
• Struggling to deliver value to the business
• Losing the ability to govern and manage
data as an asset
Business/Data Analysts IT/Data Engineers
43 © Informatica. Proprietary and Confidential.43 © Informatica. Proprietary and Confidential.
Enterprise Data Preparation
• Discover data by using Enterprise
Data Catalog to search for relevant
data assets
• Understand the context and
meaning of the data an see lineage
to gain more trust in the data
Discover and Understand
• Prepare the data with an Excel style visual
data preparation module
• Supported by AI to make smarter
decisions and improve productivity
• Visually validate the results directly in the
preparation module
• Data preparation process is a fully
governed process so newly added
datasets are immediately added to the
catalog and have full lineage details
available.
• Operationalize that data preparation
recipe for operation at scale.
• Allow for visual inspection of data
using integration with Zeppelin.
• Save the recipe as a data pipeline that
can be maintained and scheduled by
the IT organization responsible for the
data lake.
Operationalize at scale
4
Intelligently Prepare Data
44 © Informatica. Proprietary and Confidential.
Enterprise Data Preparation
Collaborative Self-service data discovery and preparation at scale
Discover, search, and explore data
assets using AI driven Enterprise
Data Catalog
Use Excel-like interface for
Advanced data preparation to blend,
transform, cleanse, enrich, shape
and use 100s of pre-built DQ rules
Operationalize with Self-service
scheduling and re-usable workflows
with Spark support
Visualize with Apache Zeppelin
45 © Informatica. Proprietary and Confidential.45 © Informatica. Proprietary and Confidential.
Enterprise Data Preparation Steps for Data Prep
Search and
Discover
Prepare Publish Visualize
Operational
-ize Schedule
Upload
Download
Import
Export
Collaborate
Enterprise Data Catalog
Big Data Management, Big Data Quality, Big Data Masking
Data Lake
© Informatica. Proprietary and Confidential.4646
Enterprise Data Preparation – Differentiators
• Enterprise Data Catalog
• Advanced Data Wrangling
• Data Visualization Integration
• Operationalization for Business and
IT Collaboration
• Spark support with Autoscaling
• Dynamic Data Masking support
Holistic Self-service
• Discovery, Identification and
assessment for best data
• Next-Best-Action & data set
Recommendations
• Smart Chart Recommendations for
Data Visualization
CLAIRE
• Excel-like data preparation
• Extensibility with Rules built in
other tools
• 100s of Pre-built Data Quality Rules
for validation, parsing,
standardization, matching and
consolidation
Advanced Data Prep
© Informatica. Proprietary and Confidential.4747
Hands-on Lab
Data preparation with EDP
Leverage Enterprise data preparation to quickly discover, prepare and deliver data
for analytics.
#4
Data Curation
Intelligent Data Lake Blueprint
FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Metadata Intelligence Foundation
Data
Steward
Data Catalog
Data Governance & Quality
Data Privacy & Protection
Discover Classify Relationship Lineage Data Statistics
Standards Policies Procedures Quality Integrity Stewardship
Analyze Identify Classify Detect Risk Score User Behavior Analysis
1
3
3
5
2
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
4
Data
Engineer
50 © Informatica. Proprietary and Confidential.
Data Curation
Data curation is the process of ensuring data
has the right structure, quality and is stored in
the right format to provide trust so consumers
can be assured the data is correct.
© Informatica. Proprietary and Confidential.5151
Informatica Big Data Management
Graphical, no code,
Zero config, install & footprint
Best of breed data management
Simple
Latest Spark enhancements,
Data Science integration,
Better Operations
Robust
Agnostic to all end-to-end
Hybrid Cloud Big Data eco-
systems with no regression
Agnostic
52 © Informatica. Proprietary and Confidential.52
select l_orderkey, sum(l_extendedprice * (1 - l_discount))
as revenue, o_orderdate, o_shippriority from
CUSTOMER, ORDERS, LINEITEM where c_mktsegment =
'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey
= o_orderkey and o_orderdate < date '1995-03-13' and
l_shipdate > date '1995-03-13' group by l_orderkey,
o_orderdate, o_shippriority order by revenue desc,
o_orderdate limit 10;
SQL Query
Leverage the Power of No-Code Interface
Spark Code
package main.scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.udf
/**
* TPC-H Query 3
*
*/
class Q03 extends TpchQuery {
override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider):
DataFrame = {
// this is used to implicitly convert an RDD to a DataFrame.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import schemaProvider._
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val fcust = customer.filter($"c_mktsegment" === "BUILDING")
val forders = order.filter($"o_orderdate" < "1995-03-15")
val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15")
fcust.join(forders, $"c_custkey" === forders("o_custkey"))
.select($"o_orderkey", $"o_orderdate", $"o_shippriority")
.join(flineitems, $"o_orderkey" === flineitems("l_orderkey"))
.select($"l_orderkey",
decrease($"l_extendedprice", $"l_discount").as("volume"),
$"o_orderdate", $"o_shippriority")
.groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority")
.agg(sum($"volume").as("revenue"))
.sort($"revenue".desc, $"o_orderdate")
.limit(10)
}
}
BDM Mapping
Future proof your investments, design once and
run on best-of-breed engine
53 © Informatica. Proprietary and Confidential.53
Advanced Spark Support
Take advantage of latest innovation,
performance, and scaling benefits
54 © Informatica. Proprietary and Confidential.54
Spark Structured Streaming Support
Handle streaming data based on event time
instead of processing time
55 © Informatica. Proprietary and Confidential.55
Azure Databricks Support
Leverage the compute power of Databricks on
Azure for big data processing
56 © Informatica. Proprietary and Confidential.56
Integrated Data Science Support
Operationalize machine learning models with
Python transformations
57 © Informatica. Proprietary and Confidential.57
Schema Drift Handling
Handle complex structure and its changes for
both batch and streaming data
58 © Informatica. Proprietary and Confidential.58
Operational Insights
Deliver predictive operational insights about your
big data environments
© Informatica. Proprietary and Confidential.5959
Demo
Processing data at scale with
Big Data Management
#5
Stream Processing &
Analytics
Intelligent Data Lake Blueprint
FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Metadata Intelligence Foundation
Data
Steward
Data Catalog
Data Governance & Quality
Data Privacy & Protection
Discover Classify Relationship Lineage Data Statistics
Standards Policies Procedures Quality Integrity Stewardship
Analyze Identify Classify Detect Risk Score User Behavior Analysis
1
4
3
5
2
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
5
62 © Informatica. Proprietary and Confidential.
Streaming Data Management
Streaming enables customers to design data
flows to continuously capture, prepare and
process streams of unbounded data. It provides a
single platform for customers to discover insights
and build models that can be then operationalized
to run in near real-time and capture and realize
the value of high-velocity data.
© Informatica. Proprietary and Confidential.6363
Streaming Data Management
Collect streaming data from
various streaming and IoT
endpoints and support multi
latency ingestion into the lake or
messaging hub
Streaming Data Ingestion Streaming Data Enrichment
Operationalize actions
based on insights from
streaming data
Real-time Actions
Enrich and distribute streaming
data in real-time for business
user consumption
Stream processing and analytics use cases5
Identify stress signals
coming from devices & act
on them before its too late
Predictive Maintenance &
Smart Factory
Real-time offer
management
Combine web searches &
camera feeds to identify
the customer & roll out
real-time offers
Clinical Research
Optimization
Collect and process bedside
monitor data for clinical
researchers to more effectively
understand and detect disease
Real-time customer KPI
generation
KPI metric used to retain the
churning customer & make
offers
65 © Informatica. Proprietary and Confidential.65
Enterprise Streaming Data Management
Solution overview
Real-time offer alert
Relational
Systems
Edge Data
Streaming
Real time dashboard
Machine
Data/IOT
Sensor
Data
WebLogs
Social
Media
Power
Exchange
CDC
Publisher
Message
Hub
Persist/Data Lake
Trigger business processes
Changes
Sense Reason Act
IICS
Streaming
Ingestion
Big Data
Streaming
Enrich, Process
and Analyze
Filter Transform Aggregate Enrich
Azure Event
Hub
SMS
Capture
and Ingest
New!
66 © Informatica. Proprietary and Confidential.66
Edge Data Streaming
Highly-scalable stream data collection and ingestion
Distributed, broker-less, with
Lightweight agents for High-
performance event ingestion
Flexible and wide connectivity
with OOTB compression,
encryption, transformations
Easy administration,
configuration, monitoring and
auto-deployment
Sources: Flat files, JMS,
Syslog, TCP/UDP, HTTP,
WebSocket, Ultra
Messaging, MQTT, OPC-DA
Targets: Kafka (also Kerberized),
HDFS (CDH, HDP, MapR),
Cassandra NoSQL, WebSocket(S),
Amazon Kinesis, Azure Event Hub,
JMS, MQTT
Transformations: RegEx
Filtering, Timestamp, Insert
String, (+ Custom - with SDK)
67 © Informatica. Proprietary and Confidential.67
Informatica Big Data Streaming
Continuous event processing of unbounded big data
Informatica Intelligent Data PlatformReal-time
DataCentric
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs
MSGCentric
Batch Data Integration, Lineage, Governance, Security
Improve
Asset
Utilization
Increase
Operational
Efficiency
Reduce
Security and
Safety Risk
PWX
CDC
EDS
Kafka
JMS
MapR
Cloud
BDS
Stream
Handling and
Event
Processing
RDMBS
HDFS
HBase
Cloud
Kafka
JMS
MapR
Cloud
EDS
EMR
Zero Code: streaming ingestion
and integration with Apache
Kafka & Spark streaming
Flexible and agnostic: supports
all on-premise, hybrid or full
cloud Hadoop distributions
Integrated: Sliding & Tumbling
for moving average, Python for
Machine learning ..
68 © Informatica. Proprietary and Confidential.68
Informatica Streaming – Key Differentiators
Informatica
streaming
Metadata-
based design
and real-time
enrichments
Engine
abstraction
Stream data
ingestion and
processing
in cloud
Out-of-
the-box
connectivity to
on-prem
and cloud
endpoints
ML Model
operationalization
Parsing of complex
unstructured data
`
Best Practices
70 © Informatica. Proprietary and Confidential.
Best Practices for Cloud Data Lake Management
Ensure you apply data
governance and security
policies to protect sensitive
data
Leverage AI/Machine Learning
to enhance productivity of all
users of the platform
Empower collaboration so the
data lake is ‘everyone’s lake’
Catalog your data, prevent the
data lake from becoming a
swamp
Integrate data pipeline
development into your
CI/CD/DevOps flow
Curate and cleanse data for
consumption to increase trust
71 © Informatica. Proprietary and Confidential.
Parting thoughts
“Do the difficult things while
they are easy and do the great
things while they are small. A
journey of a thousand miles
must begin with a single step”
- Lao Tzu
“Do the difficult things while they are easy and do the
great things while they are small. A journey of a
thousand miles must begin with a single step”
- Lao Tzu

More Related Content

PPTX
Customer-Centric Data Management for Better Customer Experiences
PPTX
Azure Security Center- Zero to Hero
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
PPTX
Data Leakage Prevention
PPTX
Microsoft Azure
PPTX
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
PDF
Data Quality Best Practices
PPTX
Disaster Recovery Planning
Customer-Centric Data Management for Better Customer Experiences
Azure Security Center- Zero to Hero
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Data Leakage Prevention
Microsoft Azure
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
Data Quality Best Practices
Disaster Recovery Planning

What's hot (20)

PPTX
Chapter 4: Data Architecture Management
PDF
Data Architecture - The Foundation for Enterprise Architecture and Governance
PDF
Ebook - The Guide to Master Data Management
PDF
Artifacts to Enable Data Goverance
PPTX
Azure Fundamentals Part 1
 
PPTX
Azure Site Recovery
PPTX
Introduction to Azure monitor
PDF
Choosing the Right Cloud Provider
PDF
Mobile Backend as a Service(MBaaS)
PPTX
You Need a Data Catalog. Do You Know Why?
PDF
Microsoft Azure Security Overview
PPTX
Cloud computing and migration strategies to cloud
PPTX
Understanding cloud with Google Cloud Platform
PPTX
Aws storage
PDF
Microsoft Defender and Azure Sentinel
PDF
2 08 client-server architecture
PPTX
Deep dive into Microsoft Purview Data Loss Prevention
PPTX
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
PDF
[금융고객을 위한 Resiliency in the Cloud] 최근 대규모 장애 사태 여파에 따른 DR 도...
PDF
Introduction to Google Cloud Platform
Chapter 4: Data Architecture Management
Data Architecture - The Foundation for Enterprise Architecture and Governance
Ebook - The Guide to Master Data Management
Artifacts to Enable Data Goverance
Azure Fundamentals Part 1
 
Azure Site Recovery
Introduction to Azure monitor
Choosing the Right Cloud Provider
Mobile Backend as a Service(MBaaS)
You Need a Data Catalog. Do You Know Why?
Microsoft Azure Security Overview
Cloud computing and migration strategies to cloud
Understanding cloud with Google Cloud Platform
Aws storage
Microsoft Defender and Azure Sentinel
2 08 client-server architecture
Deep dive into Microsoft Purview Data Loss Prevention
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
[금융고객을 위한 Resiliency in the Cloud] 최근 대규모 장애 사태 여파에 따른 DR 도...
Introduction to Google Cloud Platform
Ad

Similar to How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics (20)

PPTX
Deploying a Governed Data Lake
PPTX
From raw data to business insights. A modern data lake
PDF
IBM Cloud Day January 2021 - A well architected data lake
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Top Trends in Building Data Lakes for Machine Learning and AI
PDF
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
PPTX
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
PPTX
Hadoop and Your Data Warehouse
PDF
When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Data modeling trends for analytics
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
PDF
Big Data & Analytics - Innovating at the Speed of Light
PDF
Building a modern data platform on AWS. Utrecht AWS Dev Day
PDF
Big data and Analytics on AWS
PDF
PDF
From ingest to insights with AWS
Deploying a Governed Data Lake
From raw data to business insights. A modern data lake
IBM Cloud Day January 2021 - A well architected data lake
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Top Trends in Building Data Lakes for Machine Learning and AI
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Hadoop and Your Data Warehouse
When and How Data Lakes Fit into a Modern Data Architecture
Data modeling trends for analytics
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Big Data & Analytics - Innovating at the Speed of Light
Building a modern data platform on AWS. Utrecht AWS Dev Day
Big data and Analytics on AWS
From ingest to insights with AWS
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25 Week I

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics

  • 2. 2 © Informatica. Proprietary and Confidential.2 Agenda Introduction Next Gen Analytics Use Cases Architecting for Intelligent Cloud Data Lakes Hands-On Lab Best Practices for implementing a data lake Final Q & A, Wrap-up
  • 4. Technology Transformation TABLES Azure Data Lake Store Azure Blob Storage Google Compute Engine Azure HDInsight Amazon Web Services EC2 & EMP Storage Analytics Applications Databases Messaging Compute
  • 5. Technology Transformation TABLES Azure Data Lake Store Azure Blob Storage Google Compute Engine Azure HDInsight Amazon Web Services EC2 & EMP ADLS Gen 2 Google Cloud Storage AWS S3 Serverless Google Dataproc Altus Data Warehouse Azure Cosmos DB AWS Neptune Storage Analytics Applications Databases Messaging Compute AI/ML Workspace Data Science Workbench Azure ML
  • 7. 7 © Informatica. Proprietary and Confidential.7 Technology Transformation Data Latency Users Regulations Cloud Spark Code package main.scala import org.apache.spark.sql.DataFrame import org.apache.spark.SparkContext import org.apache.spark.sql.functions.sum import org.apache.spark.sql.functions.udf /** * TPC-H Query 3 * */ class Q03 extends TpchQuery { override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider): DataFrame = { // this is used to implicitly convert an RDD to a DataFrame. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import schemaProvider._ val decrease = udf { (x: Double, y: Double) => x * (1 - y) } val fcust = customer.filter($"c_mktsegment" === "BUILDING") val forders = order.filter($"o_orderdate" < "1995-03-15") val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15") fcust.join(forders, $"c_custkey" === forders("o_custkey")) .select($"o_orderkey", $"o_orderdate", $"o_shippriority") .join(flineitems, $"o_orderkey" === flineitems("l_orderkey")) .select($"l_orderkey", decrease($"l_extendedprice", $"l_discount").as("volume"), $"o_orderdate", $"o_shippriority") .groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority") .agg(sum($"volume").as("revenue")) .sort($"revenue".desc, $"o_orderdate") .limit(10) } } SQL Query select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from CUSTOMER, ORDERS, LINEITEM where c_mktsegment = 'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-13' and l_shipdate > date '1995-03-13' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;
  • 8. 8 © Informatica. Proprietary and Confidential.8 Disruptors Data Latency Users Regulations Cloud
  • 9. 9 © Informatica. Proprietary and Confidential.9 Data Management Accelerators Data Latency Users Regulations Cloud ModernDataIntegrationPatterns Metadata AI/ML Governance Hybrid
  • 10. 10 © Informatica. Proprietary and Confidential. Fragmented Approaches Create More Technical Debt And Complexity Can your team afford to be on call to support and maintain these manual, one-off, time-consuming, and complex approaches? Integrate Hand-coding Hand-coding Protect PrepareCatalog Stream Hand-coding Match Hand-coding Enrich Hand-coding
  • 11. End-to-End Data Management for Next-Gen Analytics ANY DATA ANY REGULATION ANY USER ANY CLOUD / ANY TECHNOLOGY ANY LATENCY METADATA GOVERNANCE INGEST STREAM INTEGRATE CLEANSE PREPARE DEFINE CATALOG RELATE PROTECT DELIVERENRICH HYBRID MODERN DATA INTEGRATION PATTERNS
  • 13. 13 © Informatica. Proprietary and Confidential. Elastic compute S3, Redshift, … Blob, ADLS, … Fully Automated Cloud Deployment Managed Serverless Deployment Compute as a Service Storage as a Service The Ever-Evolving Big Data Technology Big Data 3.0 On-premises Manual Deployment NoSQL HDFS Map Reduce Big Data 1.0 Hosted Manual Deployment YARN, SQOOP, … HDFS Hive MapR FS Big Data 2.0
  • 14. 14 © Informatica. Proprietary and Confidential. Big Data 2.0: On-Premises Big Data Deployment Select Hadoop distro Configure cluster Monitor cluster Manually scale cluster • Requires Big Data skills • Requires Hadoop admins • High operational cost Set up hardware Design data flows
  • 15. 15 © Informatica. Proprietary and Confidential. Big Data 3.0: Serverless Deployment with Integration at Scale Configure cluster Monitor cluster Set up hardware Select Hadoop distro Design data flows Auto scale cluster
  • 16. 16 © Informatica. Proprietary and Confidential.16 Cloud is a Huge Leap Forward for Data Lakes 1. Separation of data compute and storage 2. Low-cost, infinite-scale out data persistence 3. Fit-for-purpose, scale-on-demand compute engines 1. Tightly coupled data compute and storage 2. Expansions are slow to provision 3. Contention for compute capacity with highly variable compute profiles Agility Inhibitors Agility Enablers On-premises Data Lakes Cloud Data Lakes
  • 17. 17 © Informatica. Proprietary and Confidential.17 Enhanced Cloud Ecosystem Support Choose your cloud confidently for processing Big Data workloads BigQueryCloud Storage Dataproc Bigtable Cloud Datastore Cloud SQL HDInsightBlob StorageCosmoDBADLS SQL DW Event Hubs Azure Databricks EMRS3 Redshift Kinesis Firehose Kinesis
  • 18. © Informatica. Proprietary and Confidential.1818 ANALYTICS DATA MANAGEMENT INFRASTRUCTURE & STORAGE Where Does Informatica Fit? Ingest Cleanse Secure Catalog PrepareIntegrate Govern
  • 19. © Informatica. Proprietary and Confidential.1919 Redshift S3 Input bucket EMR / Qubole S3 Output bucket RELATIONAL DEVICE DATA WEBLOGS Cloud Ready Reference Architecture Amazon AWS CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME
  • 20. © Informatica. Proprietary and Confidential.2020 RELATIONAL DEVICE DATA WEBLOGS Cloud Ready Reference Architecture Microsoft Azure CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME SQL Data Warehouse ADLS / BLOB HDInsight / Azure Databricks ADLS / BLOB
  • 21. © Informatica. Proprietary and Confidential.2121 RELATIONAL DEVICE DATA WEBLOGS Cloud Ready Reference Architecture Google Cloud CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME Big Query Cloud Storage Dataproc Cloud Storage
  • 22. © Informatica. Proprietary and Confidential.2222 Steps to Build a Data Lake Catalog, Govern and Secure Data Ingestion Data Preparation Data Curation Stream Processing and Analytics 1 2 3 4 5 Data Delivery6
  • 23. 23 © Informatica. Proprietary and Confidential.23 Designing A Data Lake – High Level Architecture Sources LANDING RAW CURATION Curated Structured Cleansed ADVANCED ANALYTICS Models DATA CATALOG DATA QUALITY & GOVERNANCE DATA PRIVACY AND PROTECTION DATA INFRASTRUCTURE Data Products DISCOVERY ZONE Self-Service Analytics ENTERPRISE ZONE Enriched Formatted Consumption Ready STREAM PROCESSING & ANALYTICS
  • 24. Intelligent Data Lake Blueprint FastLane Machine Data Cloud MobileSocialLog files Apps Data Warehouse Databases Application Servers Documents Mainframe Cloud Batch/Mass Ingest ChangeData Capture EdgeData Collect/Stream Data Curation Integrate ParseTransform MaskCleanse MatchProfile Monitor Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor Data Preparation Explore BatchLane Batch Sources Streaming Sources Data Engineer Data Analyst Stream Processing & Analytics StandardizeAnalyze CleanseProcess Deliver Data Engineer Ingest Publish/ Subscribe Deliver Ingest Data Warehouse Master Data Management Advanced Analytics Historical Analysis Machine Learning Real-Time Visualization Alerts Business Process Automation Metadata Intelligence Foundation Data Steward Data Catalog Data Governance & Quality Data Privacy & Protection Discover Classify Relationship Lineage Data Statistics Standards Policies Procedures Quality Integrity Stewardship Analyze Identify Classify Detect Risk Score User Behavior Analysis 1 2 3 4 5
  • 25. #1 Supporting Layer of a Data Lake
  • 26. 26 © Informatica. Proprietary and Confidential.26 © Informatica. Proprietary and Confidential. Supporting Layer of A Data Lake1 • Analysts spend countless hours finding the right data assets for their analysis. • A properly governed catalog (that is a catalog that’s been augmented by business definitions and quality information) greatly reduces this time to discover Data Catalog • You can only trust your reports and analysis if you trust the underlying data • Governance provides the trust in the data by showing where data came from and how it’s been transformed • And by adding quality rules and the associated scores to the catalog Data Governance & Quality • Data governance provides the necessary controls on who can access which data in the data lake. • It provides input for data security mechanisms (such as authorization or masking) to prevent mis-use of data. Data Privacy & Protection
  • 27. Automate Data Discovery, Cataloging, and Linking Leverage ML and AI to find critical data across structured and unstructured sources Onboard discovered data automatically with oversight and control Automatically tag data with business context to help users assess relevance
  • 28. Automate Lineage Discovery and Process Mapping Automatically produce data lineage from scanned and discovered data movement Onboard lineage and process mapping automatically, with oversight and control Understand where data comes from, how and where it’s used, and who’s responsible for it
  • 29. Automate Fit for Purpose Data Quality Automate rule generation from policy definitions and metadata Automate rule enforcement in systems and business processes Monitor quality improvement across systems and processes over time
  • 30. Automate Data Subject Discovery, Proliferation, Consent and Protection Define policies to protect sensitive data, assign ownership and accountability Automate discovery and classification of sensitive data, across structured and unstructured sources Automate identification of data subjects, and assessment of risk exposure
  • 31. © Informatica. Proprietary and Confidential.3131 Demonstration Enterprise Data Catalog
  • 33. Intelligent Data Lake Blueprint FastLane Machine Data Cloud MobileSocialLog files Apps Data Warehouse Databases Application Servers Documents Mainframe Cloud MassIngest ChangeData Capture EdgeData Collect Data Curation Integrate ParseTransform MaskCleanse MatchProfile Monitor Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor Data Preparation Explore BatchLane Batch Sources Streaming Sources Data Engineer Data Analyst Stream Processing & Analytics StandardizeAnalyze CleanseProcess Deliver Data Engineer Ingest Publish/ Subscribe Deliver Ingest Data Warehouse Master Data Management Advanced Analytics Historical Analysis Machine Learning Real-Time Visualization Alerts Business Process Automation Metadata Intelligence Foundation Data Steward Data Catalog Data Governance & Quality Data Privacy & Protection Discover Classify Relationship Lineage Data Statistics Standards Policies Procedures Quality Integrity Stewardship Analyze Identify Classify Detect Risk Score User Behavior Analysis 1 4 3 5 2
  • 34. 34 © Informatica. Proprietary and Confidential. Data Ingestion Data ingestion is the process of collecting data from source systems or source locations, either in a batch mode or in a real-time/streaming and load this data into the data lake.
  • 35. 35 © Informatica. Proprietary and Confidential.35 © Informatica. Proprietary and Confidential. Ingesting Data Into The Data Lake • Ingest or replicate large amounts of data in batch mode into RDBMS/files/cloud DW/Hadoop. • Either manual build of ingest mapping (specialized ingestion process) or using Dynamic Mapping patterns to automate generation of large amounts of ingestion mappings. • Or use Mass Ingestion Service for initial- and delta loads Batch Data Ingestion • Use change data capture to capture only changed data from a data source • Wide array of sources available ranging from mainframe to midrange to most popular RDBMS’s that support change capture. • Ingest directly into target database (using PowerCenter) or into Kafka queue for further streaming processing (using Power Exchange for CDC for Kafka Publisher) • Collect data from streaming data sources like HTTP streams, sensors, log files etc and push data directly into Data Lake or onto Kafka queue • For data pushed onto Kafka use Big Data Streaming to augment/enrich/validate/cleanse the data and use it for streaming analytics. • Generate events to trigger actions based on stream processing. (e.g. next best offer in retail, control industry processes with predictive maintenance) • Embed Machine Learning models using Python or Java, Edge Data Streaming 2 Change Data Capture
  • 36. 36 © Informatica. Proprietary and Confidential.36 © Informatica. Proprietary and Confidential. High-Speed Mass Ingestion Rely on easy to use, fast, and scalable approach – no hand-coding Ingest data from various source systems including relational tables into Cloud and Big Data Uses high-performance connectivity, mass ingestion, and dynamic mappings Self-service UI to perform mass ingestion of initial as well as incremental loads
  • 37. 37 © Informatica. Proprietary and Confidential.37 Change Data Capture Bulk Movement Bulk movement of data from one or more sources to targets Source Data Target Database Incrementally take source changes and apply them to target Source Data Target Database Continuously take source changes and apply them to target Source Data Target Database Step 1 Step 2 Step 3
  • 38. 38 © Informatica. Proprietary and Confidential.38 Enterprise Data Streaming Big Data Streaming Real time offer Alert Capture and Ingest Enrich, Process & Analyze Relational Systems Edge Data Streaming Real time dashboard Machine Data / IOT Sensor Data WebLogs Social Media Power Exchange CDC Publisher Message Hub Persist /Data Lake Trigger Business processes Changes Extract Transform LoadSense Reason Act Azure Event Hub Filter Transform Aggregate Enrich
  • 40. Intelligent Data Lake Blueprint FastLane Machine Data Cloud MobileSocialLog files Apps Data Warehouse Databases Application Servers Documents Mainframe Cloud MassIngest ChangeData Capture EdgeData Collect Data Curation Integrate ParseTransform MaskCleanse MatchProfile Monitor Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor Data Preparation Explore BatchLane Batch Sources Streaming Sources Data Engineer Data Analyst Stream Processing & Analytics StandardizeAnalyze CleanseProcess Deliver Data Engineer Ingest Publish/ Subscribe Deliver Ingest Data Warehouse Master Data Management Advanced Analytics Historical Analysis Machine Learning Real-Time Visualization Alerts Business Process Automation Metadata Intelligence Foundation Data Steward Data Catalog Data Governance & Quality Data Privacy & Protection Discover Classify Relationship Lineage Data Statistics Standards Policies Procedures Quality Integrity Stewardship Analyze Identify Classify Detect Risk Score User Behavior Analysis 1 4 4 5 2 Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor Data Preparation Explore Data Analyst 3
  • 41. 41 © Informatica. Proprietary and Confidential. Data Preparation Data preparation is the collaborative self-service process of discovering and preparing data by data analysts to rapidly turn raw data into insights with quality and governance
  • 42. 42 © Informatica. Proprietary and Confidential.42 © Informatica. Proprietary and Confidential. Why Do We Need A Data Preparation Solution? • Difficulty finding trusted data • Limited access to the data • Frustrated by slow response from IT • Constrained by disparate tools, manual steps • No way to collaborate, share, and update curated datasets, reuse knowledge • Can’t cope with growing demand for data from the business • No visibility into what the business is doing with the data • Struggling to deliver value to the business • Losing the ability to govern and manage data as an asset Business/Data Analysts IT/Data Engineers
  • 43. 43 © Informatica. Proprietary and Confidential.43 © Informatica. Proprietary and Confidential. Enterprise Data Preparation • Discover data by using Enterprise Data Catalog to search for relevant data assets • Understand the context and meaning of the data an see lineage to gain more trust in the data Discover and Understand • Prepare the data with an Excel style visual data preparation module • Supported by AI to make smarter decisions and improve productivity • Visually validate the results directly in the preparation module • Data preparation process is a fully governed process so newly added datasets are immediately added to the catalog and have full lineage details available. • Operationalize that data preparation recipe for operation at scale. • Allow for visual inspection of data using integration with Zeppelin. • Save the recipe as a data pipeline that can be maintained and scheduled by the IT organization responsible for the data lake. Operationalize at scale 4 Intelligently Prepare Data
  • 44. 44 © Informatica. Proprietary and Confidential. Enterprise Data Preparation Collaborative Self-service data discovery and preparation at scale Discover, search, and explore data assets using AI driven Enterprise Data Catalog Use Excel-like interface for Advanced data preparation to blend, transform, cleanse, enrich, shape and use 100s of pre-built DQ rules Operationalize with Self-service scheduling and re-usable workflows with Spark support Visualize with Apache Zeppelin
  • 45. 45 © Informatica. Proprietary and Confidential.45 © Informatica. Proprietary and Confidential. Enterprise Data Preparation Steps for Data Prep Search and Discover Prepare Publish Visualize Operational -ize Schedule Upload Download Import Export Collaborate Enterprise Data Catalog Big Data Management, Big Data Quality, Big Data Masking Data Lake
  • 46. © Informatica. Proprietary and Confidential.4646 Enterprise Data Preparation – Differentiators • Enterprise Data Catalog • Advanced Data Wrangling • Data Visualization Integration • Operationalization for Business and IT Collaboration • Spark support with Autoscaling • Dynamic Data Masking support Holistic Self-service • Discovery, Identification and assessment for best data • Next-Best-Action & data set Recommendations • Smart Chart Recommendations for Data Visualization CLAIRE • Excel-like data preparation • Extensibility with Rules built in other tools • 100s of Pre-built Data Quality Rules for validation, parsing, standardization, matching and consolidation Advanced Data Prep
  • 47. © Informatica. Proprietary and Confidential.4747 Hands-on Lab Data preparation with EDP Leverage Enterprise data preparation to quickly discover, prepare and deliver data for analytics.
  • 49. Intelligent Data Lake Blueprint FastLane Machine Data Cloud MobileSocialLog files Apps Data Warehouse Databases Application Servers Documents Mainframe Cloud MassIngest ChangeData Capture EdgeData Collect Data Curation Integrate ParseTransform MaskCleanse MatchProfile Monitor Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor Data Preparation Explore BatchLane Batch Sources Streaming Sources Data Engineer Data Analyst Stream Processing & Analytics StandardizeAnalyze CleanseProcess Deliver Data Engineer Ingest Publish/ Subscribe Deliver Ingest Data Warehouse Master Data Management Advanced Analytics Historical Analysis Machine Learning Real-Time Visualization Alerts Business Process Automation Metadata Intelligence Foundation Data Steward Data Catalog Data Governance & Quality Data Privacy & Protection Discover Classify Relationship Lineage Data Statistics Standards Policies Procedures Quality Integrity Stewardship Analyze Identify Classify Detect Risk Score User Behavior Analysis 1 3 3 5 2 Data Curation Integrate ParseTransform MaskCleanse MatchProfile Monitor 4 Data Engineer
  • 50. 50 © Informatica. Proprietary and Confidential. Data Curation Data curation is the process of ensuring data has the right structure, quality and is stored in the right format to provide trust so consumers can be assured the data is correct.
  • 51. © Informatica. Proprietary and Confidential.5151 Informatica Big Data Management Graphical, no code, Zero config, install & footprint Best of breed data management Simple Latest Spark enhancements, Data Science integration, Better Operations Robust Agnostic to all end-to-end Hybrid Cloud Big Data eco- systems with no regression Agnostic
  • 52. 52 © Informatica. Proprietary and Confidential.52 select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from CUSTOMER, ORDERS, LINEITEM where c_mktsegment = 'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-13' and l_shipdate > date '1995-03-13' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; SQL Query Leverage the Power of No-Code Interface Spark Code package main.scala import org.apache.spark.sql.DataFrame import org.apache.spark.SparkContext import org.apache.spark.sql.functions.sum import org.apache.spark.sql.functions.udf /** * TPC-H Query 3 * */ class Q03 extends TpchQuery { override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider): DataFrame = { // this is used to implicitly convert an RDD to a DataFrame. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import schemaProvider._ val decrease = udf { (x: Double, y: Double) => x * (1 - y) } val fcust = customer.filter($"c_mktsegment" === "BUILDING") val forders = order.filter($"o_orderdate" < "1995-03-15") val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15") fcust.join(forders, $"c_custkey" === forders("o_custkey")) .select($"o_orderkey", $"o_orderdate", $"o_shippriority") .join(flineitems, $"o_orderkey" === flineitems("l_orderkey")) .select($"l_orderkey", decrease($"l_extendedprice", $"l_discount").as("volume"), $"o_orderdate", $"o_shippriority") .groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority") .agg(sum($"volume").as("revenue")) .sort($"revenue".desc, $"o_orderdate") .limit(10) } } BDM Mapping Future proof your investments, design once and run on best-of-breed engine
  • 53. 53 © Informatica. Proprietary and Confidential.53 Advanced Spark Support Take advantage of latest innovation, performance, and scaling benefits
  • 54. 54 © Informatica. Proprietary and Confidential.54 Spark Structured Streaming Support Handle streaming data based on event time instead of processing time
  • 55. 55 © Informatica. Proprietary and Confidential.55 Azure Databricks Support Leverage the compute power of Databricks on Azure for big data processing
  • 56. 56 © Informatica. Proprietary and Confidential.56 Integrated Data Science Support Operationalize machine learning models with Python transformations
  • 57. 57 © Informatica. Proprietary and Confidential.57 Schema Drift Handling Handle complex structure and its changes for both batch and streaming data
  • 58. 58 © Informatica. Proprietary and Confidential.58 Operational Insights Deliver predictive operational insights about your big data environments
  • 59. © Informatica. Proprietary and Confidential.5959 Demo Processing data at scale with Big Data Management
  • 61. Intelligent Data Lake Blueprint FastLane Machine Data Cloud MobileSocialLog files Apps Data Warehouse Databases Application Servers Documents Mainframe Cloud MassIngest ChangeData Capture EdgeData Collect Data Curation Integrate ParseTransform MaskCleanse MatchProfile Monitor Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor Data Preparation Explore BatchLane Batch Sources Streaming Sources Data Engineer Data Analyst Stream Processing & Analytics StandardizeAnalyze CleanseProcess Deliver Data Engineer Ingest Publish/ Subscribe Deliver Ingest Data Warehouse Master Data Management Advanced Analytics Historical Analysis Machine Learning Real-Time Visualization Alerts Business Process Automation Metadata Intelligence Foundation Data Steward Data Catalog Data Governance & Quality Data Privacy & Protection Discover Classify Relationship Lineage Data Statistics Standards Policies Procedures Quality Integrity Stewardship Analyze Identify Classify Detect Risk Score User Behavior Analysis 1 4 3 5 2 Data Analyst Stream Processing & Analytics StandardizeAnalyze CleanseProcess Deliver Data Engineer 5
  • 62. 62 © Informatica. Proprietary and Confidential. Streaming Data Management Streaming enables customers to design data flows to continuously capture, prepare and process streams of unbounded data. It provides a single platform for customers to discover insights and build models that can be then operationalized to run in near real-time and capture and realize the value of high-velocity data.
  • 63. © Informatica. Proprietary and Confidential.6363 Streaming Data Management Collect streaming data from various streaming and IoT endpoints and support multi latency ingestion into the lake or messaging hub Streaming Data Ingestion Streaming Data Enrichment Operationalize actions based on insights from streaming data Real-time Actions Enrich and distribute streaming data in real-time for business user consumption
  • 64. Stream processing and analytics use cases5 Identify stress signals coming from devices & act on them before its too late Predictive Maintenance & Smart Factory Real-time offer management Combine web searches & camera feeds to identify the customer & roll out real-time offers Clinical Research Optimization Collect and process bedside monitor data for clinical researchers to more effectively understand and detect disease Real-time customer KPI generation KPI metric used to retain the churning customer & make offers
  • 65. 65 © Informatica. Proprietary and Confidential.65 Enterprise Streaming Data Management Solution overview Real-time offer alert Relational Systems Edge Data Streaming Real time dashboard Machine Data/IOT Sensor Data WebLogs Social Media Power Exchange CDC Publisher Message Hub Persist/Data Lake Trigger business processes Changes Sense Reason Act IICS Streaming Ingestion Big Data Streaming Enrich, Process and Analyze Filter Transform Aggregate Enrich Azure Event Hub SMS Capture and Ingest New!
  • 66. 66 © Informatica. Proprietary and Confidential.66 Edge Data Streaming Highly-scalable stream data collection and ingestion Distributed, broker-less, with Lightweight agents for High- performance event ingestion Flexible and wide connectivity with OOTB compression, encryption, transformations Easy administration, configuration, monitoring and auto-deployment Sources: Flat files, JMS, Syslog, TCP/UDP, HTTP, WebSocket, Ultra Messaging, MQTT, OPC-DA Targets: Kafka (also Kerberized), HDFS (CDH, HDP, MapR), Cassandra NoSQL, WebSocket(S), Amazon Kinesis, Azure Event Hub, JMS, MQTT Transformations: RegEx Filtering, Timestamp, Insert String, (+ Custom - with SDK)
  • 67. 67 © Informatica. Proprietary and Confidential.67 Informatica Big Data Streaming Continuous event processing of unbounded big data Informatica Intelligent Data PlatformReal-time DataCentric Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs MSGCentric Batch Data Integration, Lineage, Governance, Security Improve Asset Utilization Increase Operational Efficiency Reduce Security and Safety Risk PWX CDC EDS Kafka JMS MapR Cloud BDS Stream Handling and Event Processing RDMBS HDFS HBase Cloud Kafka JMS MapR Cloud EDS EMR Zero Code: streaming ingestion and integration with Apache Kafka & Spark streaming Flexible and agnostic: supports all on-premise, hybrid or full cloud Hadoop distributions Integrated: Sliding & Tumbling for moving average, Python for Machine learning ..
  • 68. 68 © Informatica. Proprietary and Confidential.68 Informatica Streaming – Key Differentiators Informatica streaming Metadata- based design and real-time enrichments Engine abstraction Stream data ingestion and processing in cloud Out-of- the-box connectivity to on-prem and cloud endpoints ML Model operationalization Parsing of complex unstructured data
  • 70. 70 © Informatica. Proprietary and Confidential. Best Practices for Cloud Data Lake Management Ensure you apply data governance and security policies to protect sensitive data Leverage AI/Machine Learning to enhance productivity of all users of the platform Empower collaboration so the data lake is ‘everyone’s lake’ Catalog your data, prevent the data lake from becoming a swamp Integrate data pipeline development into your CI/CD/DevOps flow Curate and cleanse data for consumption to increase trust
  • 71. 71 © Informatica. Proprietary and Confidential. Parting thoughts “Do the difficult things while they are easy and do the great things while they are small. A journey of a thousand miles must begin with a single step” - Lao Tzu “Do the difficult things while they are easy and do the great things while they are small. A journey of a thousand miles must begin with a single step” - Lao Tzu