How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics

`
Building a Serverless
Cloud Data Lake

2 © Informatica. Proprietary and Confidential.2
Agenda
Introduction
Next Gen Analytics Use Cases
Architecting for Intelligent Cloud Data Lakes
Hands-On Lab
Best Practices for implementing a data lake
Final Q & A, Wrap-up

Technology Transformation
Storage
Analytics
Applications
Databases
Messaging
Compute

TABLES
Azure Data
Lake Store
Azure Blob
Storage
Google
Compute Engine
Azure
HDInsight
Amazon Web
Services EC2 & EMP
Storage
Analytics
Applications
Databases
Messaging
Compute

TABLES
Azure Data
Lake Store
Azure Blob
Storage
Google
Compute Engine
Azure
HDInsight
Amazon Web
Services EC2 & EMP
ADLS
Gen 2
Google Cloud
Storage AWS S3
Serverless
Google
Dataproc
Altus Data Warehouse
Azure Cosmos DB
AWS
Neptune
Storage
Analytics
Applications
Databases
Messaging
Compute
AI/ML Workspace
Data Science
Workbench
Azure ML

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics

Data
Latency
Users
Regulations
Cloud
Spark Code
package main.scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.udf
/**
* TPC-H Query 3
*
*/
class Q03 extends TpchQuery {
override def execute(sc: SparkContext, schemaProvider:
TpchSchemaProvider): DataFrame = {
// this is used to implicitly convert an RDD to a DataFrame.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import schemaProvider._
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val fcust = customer.filter($"c_mktsegment" === "BUILDING")
val forders = order.filter($"o_orderdate" < "1995-03-15")
val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15")
fcust.join(forders, $"c_custkey" === forders("o_custkey"))
.select($"o_orderkey", $"o_orderdate", $"o_shippriority")
.join(flineitems, $"o_orderkey" === flineitems("l_orderkey"))
.select($"l_orderkey",
decrease($"l_extendedprice", $"l_discount").as("volume"),
$"o_orderdate", $"o_shippriority")
.groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority")
.agg(sum($"volume").as("revenue"))
.sort($"revenue".desc, $"o_orderdate")
.limit(10)
}
}
SQL Query
select l_orderkey, sum(l_extendedprice * (1 -
l_discount)) as revenue, o_orderdate, o_shippriority
from CUSTOMER, ORDERS, LINEITEM where
c_mktsegment = 'AUTOMOBILE' and c_custkey =
o_custkey and l_orderkey = o_orderkey and o_orderdate
< date '1995-03-13' and l_shipdate > date '1995-03-13'
group by l_orderkey, o_orderdate, o_shippriority order
by revenue desc, o_orderdate limit 10;

Disruptors
Data
Latency
Users
Regulations
Cloud

Data Management Accelerators
Data
Latency
Users
Regulations
Cloud
ModernDataIntegrationPatterns
Metadata
AI/ML
Governance
Hybrid

10 © Informatica. Proprietary and Confidential.
Fragmented Approaches Create More Technical Debt And Complexity
Can your team afford to be on call to support and maintain these manual, one-off,
time-consuming, and complex approaches?
Integrate
Hand-coding Hand-coding
Protect PrepareCatalog Stream
Hand-coding
Match
Hand-coding
Enrich
Hand-coding

End-to-End Data Management for Next-Gen Analytics
ANY
DATA
ANY
REGULATION
ANY
USER
ANY CLOUD / ANY TECHNOLOGY
ANY
LATENCY
METADATA
GOVERNANCE
INGEST STREAM INTEGRATE CLEANSE PREPARE DEFINE CATALOG RELATE PROTECT DELIVERENRICH
HYBRID
MODERN DATA INTEGRATION PATTERNS

Cloud-Ready
Big Data
Management

Elastic
compute
S3, Redshift, …
Blob, ADLS, …
Fully Automated
Cloud Deployment
Managed Serverless
Deployment
Compute
as a Service
Storage as
a Service
The Ever-Evolving Big Data Technology
Big Data 3.0
On-premises
Manual Deployment
NoSQL
HDFS Map
Reduce
Big Data 1.0
Hosted
Manual Deployment
YARN, SQOOP, …
HDFS
Hive
MapR FS
Big Data 2.0

Big Data 2.0: On-Premises Big Data Deployment
Select
Hadoop
distro
Configure cluster
Monitor
cluster
Manually
scale cluster
• Requires Big Data skills
• Requires Hadoop admins
• High operational cost
Set up hardware
Design data flows

Big Data 3.0: Serverless Deployment with
Integration at Scale
Configure cluster
Monitor
cluster
Set up hardware
Select
Hadoop
distro
Design data flows
Auto
scale cluster

Cloud is a Huge Leap Forward for Data Lakes
1. Separation of data compute and
storage
2. Low-cost, infinite-scale out data
persistence
3. Fit-for-purpose, scale-on-demand
compute engines
1. Tightly coupled data compute and
storage
2. Expansions are slow to provision
3. Contention for compute capacity
with highly variable compute profiles
Agility Inhibitors Agility Enablers
On-premises
Data Lakes
Cloud
Data Lakes

Enhanced Cloud Ecosystem Support
Choose your cloud confidently for processing Big Data workloads
BigQueryCloud
Storage
Dataproc Bigtable Cloud
Datastore
Cloud
SQL
HDInsightBlob StorageCosmoDBADLS SQL DW Event
Hubs
Azure
Databricks
EMRS3 Redshift Kinesis
Firehose
Kinesis

© Informatica. Proprietary and Confidential.1818
ANALYTICS
DATA MANAGEMENT
INFRASTRUCTURE & STORAGE
Where Does Informatica Fit?
Ingest Cleanse Secure Catalog PrepareIntegrate Govern

Redshift
S3
Input bucket
EMR /
Qubole
S3
Output bucket
RELATIONAL
DEVICE DATA
WEBLOGS
Cloud Ready Reference Architecture
Amazon AWS
CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH
ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME

RELATIONAL
DEVICE DATA
WEBLOGS
Microsoft Azure
SQL Data
Warehouse
ADLS /
BLOB
HDInsight /
Azure Databricks
ADLS /
BLOB

RELATIONAL
DEVICE DATA
WEBLOGS
Google Cloud
Big
Query
Cloud
Storage
Dataproc Cloud
Storage

Steps to Build a Data Lake
Catalog, Govern and Secure
Data Ingestion
Data Preparation
Data Curation
Stream Processing
and Analytics
1
2
3
4
5
Data Delivery6

Designing A Data Lake – High Level Architecture
Sources
LANDING
RAW
CURATION
Curated
Structured
Cleansed
ADVANCED
ANALYTICS
Models
DATA CATALOG
DATA QUALITY & GOVERNANCE
DATA PRIVACY AND PROTECTION
DATA INFRASTRUCTURE
Data Products
DISCOVERY ZONE
Self-Service
Analytics
ENTERPRISE ZONE
Enriched
Formatted
Consumption Ready
STREAM PROCESSING & ANALYTICS

Intelligent Data Lake Blueprint
FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
Batch/Mass
Ingest
ChangeData
Capture
EdgeData
Collect/Stream
Data Curation
Integrate ParseTransform MaskCleanse MatchProfile Monitor
Recommend Collaborate OperationalizeDiscover Prepare Publish Monitor
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Stream Processing & Analytics
StandardizeAnalyze CleanseProcess Deliver
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Metadata Intelligence Foundation
Data
Steward
Data Catalog
Data Governance & Quality
Data Privacy & Protection
Discover Classify Relationship Lineage Data Statistics
Standards Policies Procedures Quality Integrity Stewardship
Analyze Identify Classify Detect Risk Score User Behavior Analysis
1
2
3
4
5

#1
Supporting Layer of a
Data Lake

26 © Informatica. Proprietary and Confidential.26 © Informatica. Proprietary and Confidential.
Supporting Layer of A Data Lake1
• Analysts spend countless hours finding
the right data assets for their analysis.
• A properly governed catalog (that is a
catalog that’s been augmented by
business definitions and quality
information) greatly reduces this time to
discover
Data Catalog
• You can only trust your reports and
analysis if you trust the underlying data
• Governance provides the trust in the
data by showing where data came from
and how it’s been transformed
• And by adding quality rules and the
associated scores to the catalog
• Data governance provides the necessary
controls on who can access which data
in the data lake.
• It provides input for data security
mechanisms (such as authorization or
masking) to prevent mis-use of data.

Automate Data Discovery, Cataloging, and Linking
Leverage ML and AI to find
critical data across structured
and unstructured sources
Onboard discovered data
automatically with oversight
and control
Automatically tag data with
business context to help users
assess relevance

Automate Lineage Discovery and Process Mapping
Automatically produce data
lineage from scanned and
discovered data movement
Onboard lineage and process
mapping automatically, with
oversight and control
Understand where data comes
from, how and where it’s used,
and who’s responsible for it

Automate Fit for Purpose Data Quality
Automate rule generation
from policy definitions and
metadata
Automate rule enforcement
in systems and business
processes
Monitor quality improvement
across systems and
processes over time

Automate Data Subject Discovery, Proliferation,
Consent and Protection
Define policies to protect sensitive
data, assign ownership and
accountability
Automate discovery and classification
of sensitive data, across structured
and unstructured sources
Automate identification of data
subjects, and assessment
of risk exposure

Demonstration
Enterprise Data Catalog

FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Data
Steward
Data Catalog
1
4
3
5
2

Data Ingestion
Data ingestion is the process of collecting data
from source systems or source locations, either
in a batch mode or in a real-time/streaming and
load this data into the data lake.

Ingesting Data Into The Data Lake
• Ingest or replicate large amounts of
data in batch mode into
RDBMS/files/cloud DW/Hadoop.
• Either manual build of ingest
mapping (specialized ingestion
process) or using Dynamic Mapping
patterns to automate generation of
large amounts of ingestion
mappings.
• Or use Mass Ingestion Service for
initial- and delta loads
Batch Data Ingestion
• Use change data capture to capture only
changed data from a data source
• Wide array of sources available ranging
from mainframe to midrange to most
popular RDBMS’s that support change
capture.
• Ingest directly into target database (using
PowerCenter) or into Kafka queue for
further streaming processing (using
Power Exchange for CDC for Kafka
Publisher)
• Collect data from streaming data
sources like HTTP streams, sensors,
log files etc and push data directly into
Data Lake or onto Kafka queue
• For data pushed onto Kafka use Big
Data Streaming to
augment/enrich/validate/cleanse the
data and use it for streaming analytics.
• Generate events to trigger actions
based on stream processing.
(e.g. next best offer in retail, control
industry processes with predictive
maintenance)
• Embed Machine Learning models using
Python or Java,
Edge Data Streaming
2
Change Data Capture

High-Speed Mass Ingestion
Rely on easy to use, fast, and scalable approach – no hand-coding
Ingest data from various source
systems including relational
tables into Cloud and Big Data
Uses high-performance
connectivity, mass ingestion, and
dynamic mappings
Self-service UI to perform mass
ingestion of initial as well as
incremental loads

Change Data Capture
Bulk Movement
Bulk movement of data from one or more sources to targets
Source Data Target Database
Incrementally take source changes and apply them to target
Continuously take source changes and apply them to target
Step 1
Step 2
Step 3

Enterprise Data Streaming
Big Data
Streaming
Real time offer Alert
Capture and Ingest
Enrich, Process &
Analyze
Relational Systems
Edge Data
Streaming
Real time dashboard
Machine
Data / IOT
Sensor
Data
WebLogs
Social
Media
Power
Exchange
CDC
Publisher
Message
Hub
Persist /Data Lake
Trigger Business processes
Changes
Extract Transform LoadSense Reason Act
Azure Event Hub
Filter Transform Aggregate Enrich

FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Data
Steward
Data Catalog
1
4
4
5
2
Data Preparation
Explore
Data
Analyst
3

Data Preparation
Data preparation is the collaborative self-service
process of discovering and preparing data by
data analysts to rapidly turn raw data into
insights with quality and governance

Why Do We Need A Data Preparation Solution?
• Difficulty finding trusted data
• Limited access to the data
• Frustrated by slow response from IT
• Constrained by disparate tools, manual steps
• No way to collaborate, share, and update
curated datasets, reuse knowledge
• Can’t cope with growing demand for data
from the business
• No visibility into what the business is
doing with the data
• Struggling to deliver value to the business
• Losing the ability to govern and manage
data as an asset
Business/Data Analysts IT/Data Engineers

Enterprise Data Preparation
• Discover data by using Enterprise
Data Catalog to search for relevant
data assets
• Understand the context and
meaning of the data an see lineage
to gain more trust in the data
Discover and Understand
• Prepare the data with an Excel style visual
data preparation module
• Supported by AI to make smarter
decisions and improve productivity
• Visually validate the results directly in the
preparation module
• Data preparation process is a fully
governed process so newly added
datasets are immediately added to the
catalog and have full lineage details
available.
• Operationalize that data preparation
recipe for operation at scale.
• Allow for visual inspection of data
using integration with Zeppelin.
• Save the recipe as a data pipeline that
can be maintained and scheduled by
the IT organization responsible for the
data lake.
Operationalize at scale
4
Intelligently Prepare Data

Enterprise Data Preparation
Collaborative Self-service data discovery and preparation at scale
Discover, search, and explore data
assets using AI driven Enterprise
Data Catalog
Use Excel-like interface for
Advanced data preparation to blend,
transform, cleanse, enrich, shape
and use 100s of pre-built DQ rules
Operationalize with Self-service
scheduling and re-usable workflows
with Spark support
Visualize with Apache Zeppelin

Enterprise Data Preparation Steps for Data Prep
Search and
Discover
Prepare Publish Visualize
Operational
-ize Schedule
Upload
Download
Import
Export
Collaborate
Enterprise Data Catalog
Big Data Management, Big Data Quality, Big Data Masking
Data Lake

Enterprise Data Preparation – Differentiators
• Enterprise Data Catalog
• Advanced Data Wrangling
• Data Visualization Integration
• Operationalization for Business and
IT Collaboration
• Spark support with Autoscaling
• Dynamic Data Masking support
Holistic Self-service
• Discovery, Identification and
assessment for best data
• Next-Best-Action & data set
Recommendations
• Smart Chart Recommendations for
Data Visualization
CLAIRE
• Excel-like data preparation
• Extensibility with Rules built in
other tools
• 100s of Pre-built Data Quality Rules
for validation, parsing,
standardization, matching and
consolidation
Advanced Data Prep

Hands-on Lab
Data preparation with EDP
Leverage Enterprise data preparation to quickly discover, prepare and deliver data
for analytics.

FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Data
Steward
Data Catalog
1
3
3
5
2
Data Curation
4
Data
Engineer

Data Curation
Data curation is the process of ensuring data
has the right structure, quality and is stored in
the right format to provide trust so consumers
can be assured the data is correct.

Informatica Big Data Management
Graphical, no code,
Zero config, install & footprint
Best of breed data management
Simple
Latest Spark enhancements,
Data Science integration,
Better Operations
Robust
Agnostic to all end-to-end
Hybrid Cloud Big Data eco-
systems with no regression
Agnostic

select l_orderkey, sum(l_extendedprice * (1 - l_discount))
as revenue, o_orderdate, o_shippriority from
CUSTOMER, ORDERS, LINEITEM where c_mktsegment =
'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey
= o_orderkey and o_orderdate < date '1995-03-13' and
l_shipdate > date '1995-03-13' group by l_orderkey,
o_orderdate, o_shippriority order by revenue desc,
o_orderdate limit 10;
SQL Query
Leverage the Power of No-Code Interface
Spark Code
package main.scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.udf
/**
* TPC-H Query 3
*
*/
class Q03 extends TpchQuery {
override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider):
DataFrame = {
// this is used to implicitly convert an RDD to a DataFrame.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import schemaProvider._
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val fcust = customer.filter($"c_mktsegment" === "BUILDING")
val forders = order.filter($"o_orderdate" < "1995-03-15")
val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15")
fcust.join(forders, $"c_custkey" === forders("o_custkey"))
.select($"o_orderkey", $"o_orderdate", $"o_shippriority")
.join(flineitems, $"o_orderkey" === flineitems("l_orderkey"))
.select($"l_orderkey",
decrease($"l_extendedprice", $"l_discount").as("volume"),
$"o_orderdate", $"o_shippriority")
.groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority")
.agg(sum($"volume").as("revenue"))
.sort($"revenue".desc, $"o_orderdate")
.limit(10)
}
}
BDM Mapping
Future proof your investments, design once and
run on best-of-breed engine

Advanced Spark Support
Take advantage of latest innovation,
performance, and scaling benefits

Spark Structured Streaming Support
Handle streaming data based on event time
instead of processing time

Azure Databricks Support
Leverage the compute power of Databricks on
Azure for big data processing

Integrated Data Science Support
Operationalize machine learning models with
Python transformations

Schema Drift Handling
Handle complex structure and its changes for
both batch and streaming data

Operational Insights
Deliver predictive operational insights about your
big data environments

Demo
Processing data at scale with
Big Data Management

#5
Stream Processing &
Analytics

FastLane
Machine
Data
Cloud
MobileSocialLog
files
Apps
Data
Warehouse
Databases
Application
Servers
Documents
Mainframe
Cloud
MassIngest
ChangeData
Capture
EdgeData
Collect
Data Curation
Data Preparation
Explore
BatchLane
Batch Sources
Streaming Sources
Data
Engineer
Data
Analyst
Data
Engineer
Ingest
Publish/
Subscribe
Deliver
Ingest
Data
Warehouse
Master Data
Management
Advanced
Analytics
Historical
Analysis
Machine
Learning
Real-Time
Visualization
Alerts
Business
Process
Automation
Data
Steward
Data Catalog
1
4
3
5
2
Data
Analyst
Data
Engineer
5

Streaming Data Management
Streaming enables customers to design data
flows to continuously capture, prepare and
process streams of unbounded data. It provides a
single platform for customers to discover insights
and build models that can be then operationalized
to run in near real-time and capture and realize
the value of high-velocity data.

Streaming Data Management
Collect streaming data from
various streaming and IoT
endpoints and support multi
latency ingestion into the lake or
messaging hub
Streaming Data Ingestion Streaming Data Enrichment
Operationalize actions
based on insights from
streaming data
Real-time Actions
Enrich and distribute streaming
data in real-time for business
user consumption

Stream processing and analytics use cases5
Identify stress signals
coming from devices & act
on them before its too late
Predictive Maintenance &
Smart Factory
Real-time offer
management
Combine web searches &
camera feeds to identify
the customer & roll out
real-time offers
Clinical Research
Optimization
Collect and process bedside
monitor data for clinical
researchers to more effectively
understand and detect disease
Real-time customer KPI
generation
KPI metric used to retain the
churning customer & make
offers

Enterprise Streaming Data Management
Solution overview
Real-time offer alert
Relational
Systems
Edge Data
Streaming
Real time dashboard
Machine
Data/IOT
Sensor
Data
WebLogs
Social
Media
Power
Exchange
CDC
Publisher
Message
Hub
Persist/Data Lake
Trigger business processes
Changes
Sense Reason Act
IICS
Streaming
Ingestion
Big Data
Streaming
Enrich, Process
and Analyze
Filter Transform Aggregate Enrich
Azure Event
Hub
SMS
Capture
and Ingest
New!

Edge Data Streaming
Highly-scalable stream data collection and ingestion
Distributed, broker-less, with
Lightweight agents for High-
performance event ingestion
Flexible and wide connectivity
with OOTB compression,
encryption, transformations
Easy administration,
configuration, monitoring and
auto-deployment
Sources: Flat files, JMS,
Syslog, TCP/UDP, HTTP,
WebSocket, Ultra
Messaging, MQTT, OPC-DA
Targets: Kafka (also Kerberized),
HDFS (CDH, HDP, MapR),
Cassandra NoSQL, WebSocket(S),
Amazon Kinesis, Azure Event Hub,
JMS, MQTT
Transformations: RegEx
Filtering, Timestamp, Insert
String, (+ Custom - with SDK)

Informatica Big Data Streaming
Continuous event processing of unbounded big data
Informatica Intelligent Data PlatformReal-time
DataCentric
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs
MSGCentric
Batch Data Integration, Lineage, Governance, Security
Improve
Asset
Utilization
Increase
Operational
Efficiency
Reduce
Security and
Safety Risk
PWX
CDC
EDS
Kafka
JMS
MapR
Cloud
BDS
Stream
Handling and
Event
Processing
RDMBS
HDFS
HBase
Cloud
Kafka
JMS
MapR
Cloud
EDS
EMR
Zero Code: streaming ingestion
and integration with Apache
Kafka & Spark streaming
Flexible and agnostic: supports
all on-premise, hybrid or full
cloud Hadoop distributions
Integrated: Sliding & Tumbling
for moving average, Python for
Machine learning ..

Informatica Streaming – Key Differentiators
Informatica
streaming
Metadata-
based design
and real-time
enrichments
Engine
abstraction
Stream data
ingestion and
processing
in cloud
Out-of-
the-box
connectivity to
on-prem
and cloud
endpoints
ML Model
operationalization
Parsing of complex
unstructured data

Best Practices for Cloud Data Lake Management
Ensure you apply data
governance and security
policies to protect sensitive
data
Leverage AI/Machine Learning
to enhance productivity of all
users of the platform
Empower collaboration so the
data lake is ‘everyone’s lake’
Catalog your data, prevent the
data lake from becoming a
swamp
Integrate data pipeline
development into your
CI/CD/DevOps flow
Curate and cleanse data for
consumption to increase trust

Parting thoughts
“Do the difficult things while
they are easy and do the great
things while they are small. A
journey of a thousand miles
must begin with a single step”
- Lao Tzu
“Do the difficult things while they are easy and do the
great things while they are small. A journey of a
thousand miles must begin with a single step”
- Lao Tzu

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics

More Related Content

What's hot (20)

Similar to How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics (20)

Recently uploaded (20)

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics