SlideShare a Scribd company logo
Building a Modern Big Data & Advanced
Analytics Pipeline
(Ideas for building UDAP)
About Us
• Emerging technology firm focused on helping enterprises build breakthrough
software solutions
• Building software solutions powered by disruptive enterprise software trends
-Machine learning and data science
-Cyber-security
-Enterprise IOT
-Powered by Cloud and Mobile
• Bringing innovation from startups and academic institutions to the enterprise
• Award winning agencies: Inc 500, American Business Awards, International
Business Awards
• The principles of big data and advanced analytics pipelines
• Some inspiration
• Capabilities
• Building a big data and advanced analytics pipeline
Agenda
The principles of an enterprise big data
infrastructure
Data Needs
Vision
Solutions
Data Science
Data Infrastructure
Data Access
There are only a few
technology choices….
Data Needs
Some inspiration….
Netflix
Data Access
Data Fetching:
Falcor(https://github.com/Ne
tflix/falcor )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Federated Job Execution
Engine:
Genie(https://github.com/Net
flix/genie )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
SQL Querying: Presto
(https://prestodb.io/ )
Data Discovery : Metacat
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Scikit-
learn( http://scikit-
learn.org/stable/ )
Tools & Solutions
Netflix big data portal
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
Workflow visualization
(https://github.com/Netflix/Li
pstick )
Netflix Big Data Portal
Netflix Lipstick
Netflix Data Pipeline
Spotify
Data Access
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
SQL Aggregation: Apache
Crunch(https://crunch.apache
.org/ )
Fast Data Access: Apache
Cassandra(http://cassandra.a
pache.org/ )
Workflow Manager
:Luigi(https://github.com/spo
tify/luigi )
Data Transformation: Apache
Falcon(http://hortonworks.co
m/hadoop/falcon/ )
Data Science
Data Visualization: Sting
Machine learning: Spark
MLib( http://scikit-
learn.org/stable/ )
Data Discovery: Raynor
Tools & Solutions
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
Raynor
LinkedIn
Data Access
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache
Spark(http://www.project-
voldemort.com/voldemort/ )
Fast Data Access:
Voldemort(http://cassandra.a
pache.org/ )
Stream Analytics : Apache
Samza(http://samza.apache.o
rg/ )
Real Time Search : Zoie
(http://javasoze.github.io/zoi
e/ )
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Scikit-
learn( http://scikit-
learn.org/stable/ )
Data Discovery: Raynor
Tools & Solutions
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
LinkedIn Stream Data Processing
LinkedIn Rewinder
Goldman Sachs
Data Access
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Data Infrastructure
Data Lakes: Apache
Hadoop/HBase
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
Data Transformation: Apache
Pig(http://hortonworks.com/
hadoop/falcon/ )
Stream Analytics: Apache
Storm
(http://storm.apache.org/ )
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Spark
MLib( http://scikit-
learn.org/stable/ )
Data Discovery: Custom data
catalog
Tools & Solutions
Secure data exchange:
Symphony
(http://www.goldmansachs.co
m/what-we-
do/engineering/see-our-
work/inside-symphony.html )
Goldman Sachs Data Exchange Architecture
Capabilities of a big data pipeline
Data Access….
• Provide the foundation for data collection and data ingestion methods at an enterprise
scale
• Support different data collection models in a consistent architecture
• Incorporate and remove data sources without impacting the overall infrastructure
Goals
• On-demand data access
• Batch data access
• Stream data access
• Data transformation
Foundational Capabilities
• Enable standard data access
protocols for line of business
systems
• Empower client applications
with data querying capabilities
• Provide data access
infrastructure building blocks
such as caching across business
data sources
On-Demand Data Access
Best Practices Interesting Technologies
• GraphQL(https://facebook.github.io/
react/blog/2015/05/01/graphql-
introduction.html )
• Odata(http://odata.org )
• Falcor
(http://netflix.github.io/falcor/ )
• Enable agile ETL models
• Support federated job
processing
Batch Data Access
Best Practices Interesting Technologies
• Genie(https://github.com/Netfl
ix/genie )
• Luigi(https://github.com/spoti
fy/luigi )
• Apache
Pig(https://pig.apache.org/ )
• Enable streaming data from
line of business systems
• Provide the infrastructure to
incorporate new data sources
such as sensors, web streams
etc
• Provide a consistent model for
data integration between line of
business systems
Stream Data Access
Best Practices Interesting Technologies
• Apache
Kafka(http://kafka.apache.org/
)
• RabbitMQ(https://www.rabbit
mq.com/ )
• ZeroMQ(http://zeromq.org/ )
• Many others….
• Enable federated aggregation of
disparate data sources
• Focus on small data sources
• Enable standard protocols to
access the federated data
sources
Data Virtualization
Best Practices Interesting Technologies
• Denodo(http://www.denodo.co
m/en )
• JBoss Data
Virtualization(http://www.jbos
s.org/products/datavirt/overvie
w/ )
Data Infrastructure….
• Store heterogeneous business data at scale
• Provide consistent models to aggregate and compose data sources from different data
sources
• Manage and curate business data sources
• Discover and consume data available in your organization
Goals
• Data lakes
• Data quality
• Data discovery
• Data transformation
Foundational Capabilities
• Focus on complementing and
expanding our data warehouse
capabilities
• Optimize the data lake to
incorporate heterogeneous data
sources
• Support multiple data ingestion
models
• Consider a hybrid cloud
strategy (pilot vs. production )
Data Lakes
Best Practices Interesting Technologies
• Hadoop(http://hadoop.apache.org/ )
• Hive(https://hive.apache.org/ )
• Hbase(https://hbase.apache.org/ )
• Spark(http://spark.apache.org/ )
• Greenplum(http://greenplum.org/ )
• Many others….
• Avoid traditional data quality
methodologies
• Leverage machine learning to
streamline data quality rules
• Leverage modern data quality
platforms
• Crowsourced vs. centralized
data quality models
Data Quality
Best Practices Interesting Technologies
• Trifacta(http://trifacta.com )
• Tamr(http://tamr.com )
• Alation(https://alation.com/ )
• Paxata(http://www.paxata.com/ )
• Master management solutions
don’t work with modern data
sources
• Promote crow-sourced vs.
centralized data publishing
• Focus on user experience
• Consider build vs. buy options
Data Discovery
Best Practices Interesting Technologies
• Tamr(http://tamr.com )
• Custom solutions…
• Spotify Raynor
• Netflix big data portal
• Enable programmable ETLs
• Support data transformations
for both batch and real time
data sources
• Agility over robustness
Data Transformations
Best Practices Interesting Technologies
• Apache
Pig(https://pig.apache.org/ )
• Streamsets(https://streamsets.
com/ )
• Apache Spark
(http://spark.apache.org/ )
Data Science….
• Discover insights of business data sources
• Integrate machine learning capabilities as part of the enterprise data pipeline
• Provide the foundation for predictive analytic capabilities across the enterprise
• Enable programmatic execution of machine learning models
Goals
• Data visualization & self-service BI
• Predictive analytics
• Stream analytics
• Proactive analytics
Foundational Capabilities
• Access business data sources
from mainstream data
visualization tools like Excel ,
Tableau, QlickView, Datameer,
etc.
• Publish data visualizations so
that they can be discovered by
other information workers
• Embed visualization as part of
existing line of business
solutions
Data Visualization and Self-Service BI
Best Practices Interesting Technologies
• Tableau(http://www.tableau.com/ )
• PowerBI(https://powerbi.microsoft.co
m/en-us/ )
• Datameer(http://www.datameer.com/
)
• QlikView(http://www.qlik.com/ )
• Visualization libraries
• ….
• Implement the tools and
frameworks to author machine
learning models using business
data sources
• Expose predictive models via
programmable APIs
• Provide the infrastructure to
test, train and evaluate machine
learning models
Predictive Analytics
Best Practices Interesting Technologies
• Spark
Mlib(http://spark.apache.org/docs/la
test/mllib-guide.html )
• Scikit-Learn(http://scikit-learn.org/ )
• Dato(https://dato.com/ )
• H20.ai(http://www.h2o.ai/ )
• ….
• Aggregate data real time from
diverse data sources
• Model static queries over
dynamic streams of data
• Create simulations and replays
of real data streams
Stream Analytics
Best Practices Interesting Technologies
• Apache
Storm(http://storm.apache.org/ )
• Spark Streaming
(http://spark.apache.org/streaming/
)
• Apache
Samza(http://samza.apache.org/ )
• ….
• Automate actions based on the
output of predictive models
• Use programmatic models to
script proactive analytics
business rules
• Continuously test and validate
proactive rules
Proactive Analytics
Best Practices Interesting Technologies
• Spark
Mlib(http://spark.apache.org/d
ocs/latest/mllib-guide.html )
• Scikit-Learn(http://scikit-
learn.org/ )
Solutions….
• Leverage a consistent data pipeline as part of all solutions
• Empower different teams to contribute to different aspects of the big data pipeline
• Keep track of key metrics about the big data pipeline such as time to deliver solutions,
data volume over time, data quality metrics, etc
Enterprise Data Solutions
• Data discovery
• Data quality
• Data testing tools
• …
Some Examples
• Mobile analytics
• Embedded analytics capabilities (ex: Salesforce Wave, Workday)
• Aggregation with external sources
• Video & image analytics
• Deep learning
• ….
Other Interesting Capabilities
Building a big data and advanced analytics
pipeline
Infrastructure-Driven
Data Storage Data Aggregation Data Transformation
Data Discovery Others…
Stream Analytics Predictive Analytics
Proactive Analytics Others…
Real time data access Batch data access Stream data access
Solution
Solution
Solution
Data
Access
Data
Infrastructure
Data
Science
Solutions
Domain-Driven
Data Storage Data Aggregation Data Transformation
Data Discovery Others…
Stream Analytics Predictive Analytics
Proactive Analytics Others…
Real time data access Batch data access Stream data access
Solution
Solution
Solution
Data
Access
Data
Infrastructure
Data
Science
Solutions
• Lead by the architecture team
• Military discipline
• Commitment from business
stakeholders
Infrastructure-Drives vs. Domain-Driven Approaches
Infrastructure-Driven Domain-Driven
• Federated data teams
• Rapid releases
• Pervasive communications
• Establish a vision across all levels of the data pipeline
• You can’t buy everything…Is likely you will build custom data infrastructure building
blocks
• Deliver infrastructure and functional capabilities incrementally
• Establish a data innovation group responsible for piloting infrastructure capabilities
ahead of production schedules
• Encourage adoption even in early stages
• Iterate
Some General Rules
Summary
• Big data and advanced analytics pipelines are based on 4 fundamental elements: data access,
data infrastructure, data science, data solutions….
• A lot of inspiration can be learned from the big data solutions built by lead internet vendors
• Establish a common vision and mission
• Start small….iterate….
Thanks
http://Tellago.com
Info@Tellago.com

More Related Content

PDF
Big Data Architecture
PPTX
Data lake ppt
PPT
Introduction to Data Warehouse
PDF
Straight Talk to Demystify Data Lineage
PPTX
Big data and Hadoop
PPTX
Data Lake Overview
PPT
Data Warehouse Modeling
PPTX
Big data architectures and the data lake
Big Data Architecture
Data lake ppt
Introduction to Data Warehouse
Straight Talk to Demystify Data Lineage
Big data and Hadoop
Data Lake Overview
Data Warehouse Modeling
Big data architectures and the data lake

What's hot (20)

PPTX
Big Data Open Source Technologies
PDF
Big Data Evolution
PDF
Five Things to Consider About Data Mesh and Data Governance
PPTX
Oltp vs olap
PPTX
Relational and non relational database 7
PDF
Big Data
PDF
Data Architecture for Solutions.pdf
PPT
Graph database
PPT
Datawarehouse and OLAP
PDF
Data Warehousing 2016
PDF
NOSQL- Presentation on NoSQL
PPTX
Data warehouse architecture
PPTX
Snowflake Architecture.pptx
PPTX
Data Analytics Life Cycle
PPTX
Challenges in Building a Data Pipeline
PPT
Hive(ppt)
PPTX
PPTX
Apache PIG
Big Data Open Source Technologies
Big Data Evolution
Five Things to Consider About Data Mesh and Data Governance
Oltp vs olap
Relational and non relational database 7
Big Data
Data Architecture for Solutions.pdf
Graph database
Datawarehouse and OLAP
Data Warehousing 2016
NOSQL- Presentation on NoSQL
Data warehouse architecture
Snowflake Architecture.pptx
Data Analytics Life Cycle
Challenges in Building a Data Pipeline
Hive(ppt)
Apache PIG
Ad

Viewers also liked (14)

PDF
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
PDF
What Is Visualization?
PPTX
An Introduction to Evaluation in Medical Visualization
PDF
Information Visualization for Medical Informatics
PPT
Info vis 4-22-2013-dc-vis-meetup-shneiderman
PPTX
Theius: A Streaming Visualization Suite for Hadoop Clusters
PDF
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
PPTX
Text and text stream mining tutorial
PPTX
In Memory Analytics with Apache Spark
PDF
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
PPT
Web 2 0 Projects Elementary
PPT
Towards Utilizing GPUs in Information Visualization
PPTX
Presentation Brucon - Anubisnetworks and PTCoresec
PDF
Stream Processing with Kafka in Uber, Danny Yuan
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
What Is Visualization?
An Introduction to Evaluation in Medical Visualization
Information Visualization for Medical Informatics
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Theius: A Streaming Visualization Suite for Hadoop Clusters
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Text and text stream mining tutorial
In Memory Analytics with Apache Spark
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
Web 2 0 Projects Elementary
Towards Utilizing GPUs in Information Visualization
Presentation Brucon - Anubisnetworks and PTCoresec
Stream Processing with Kafka in Uber, Danny Yuan
Ad

Similar to Building a Big Data Pipeline (20)

PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PPTX
Data lake-itweekend-sharif university-vahid amiry
PPTX
Demystifying data engineering
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
Modern data warehouse
PDF
Modern data warehouse
PPTX
Big Data Practice_Planning_steps_RK
PDF
Data Pipelines with Spark & DataStax Enterprise
PPTX
Architecting Your First Big Data Implementation
PPSX
Big Data
PDF
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
PPTX
Big Data Analytics with Hadoop
PDF
Hadoop-based architecture approaches
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
PPTX
Big Data Analytics .pptx
PDF
DAMA - Innovations in DG Architecture and Analytics (online)
20160331 sa introduction to big data pipelining berlin meetup 0.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Data lake-itweekend-sharif university-vahid amiry
Demystifying data engineering
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Modern data warehouse
Modern data warehouse
Big Data Practice_Planning_steps_RK
Data Pipelines with Spark & DataStax Enterprise
Architecting Your First Big Data Implementation
Big Data
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
Big Data Analytics with Hadoop
Hadoop-based architecture approaches
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
Big Data Analytics .pptx
DAMA - Innovations in DG Architecture and Analytics (online)

More from Jesus Rodriguez (20)

PPTX
The Emergence of DeFi Micro-Primitives
PPTX
ChatGPT, Foundation Models and Web3.pptx
PPTX
DeFi Opportunities and Challenges in the Current Crypto Market
PPTX
MEV Deep Dive .pptx
PPTX
Quant in Crypto Land
PPTX
The Polygon Blockchain by the Numbers
PPTX
Social Analytics for Cryptocurrencies
PPTX
DeFi Quant Yield-Generating Strategies
PPTX
High Frequency Trading and DeFi
PPTX
Simple DeFi Analytics Any Crypto-Investor Should Know About
PPTX
15 Minutes of DeFi Analytics
PPTX
DeFi Trading Strategies: Opportunities and Challenges
PPTX
Practical Crypto Asset Predictions rev
PPTX
Better Technical Analysis with Blockchain Indicators
PPTX
Price Predictions for Cryptocurrencies
PPTX
Fascinating Metrics and Analytics About Cryptocurrencies
PPTX
Price PRedictions for Crypto-Assets Using Deep Learning
PPTX
Demystifying Centralized Crypto Exchanges using Data Science
PPTX
Crypto assets are a data science heaven rev
PPTX
Implementing Machine Learning in the Real World
The Emergence of DeFi Micro-Primitives
ChatGPT, Foundation Models and Web3.pptx
DeFi Opportunities and Challenges in the Current Crypto Market
MEV Deep Dive .pptx
Quant in Crypto Land
The Polygon Blockchain by the Numbers
Social Analytics for Cryptocurrencies
DeFi Quant Yield-Generating Strategies
High Frequency Trading and DeFi
Simple DeFi Analytics Any Crypto-Investor Should Know About
15 Minutes of DeFi Analytics
DeFi Trading Strategies: Opportunities and Challenges
Practical Crypto Asset Predictions rev
Better Technical Analysis with Blockchain Indicators
Price Predictions for Cryptocurrencies
Fascinating Metrics and Analytics About Cryptocurrencies
Price PRedictions for Crypto-Assets Using Deep Learning
Demystifying Centralized Crypto Exchanges using Data Science
Crypto assets are a data science heaven rev
Implementing Machine Learning in the Real World

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PPT
JAVA ppt tutorial basics to learn java programming
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
history of c programming in notes for students .pptx
PDF
System and Network Administraation Chapter 3
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
Introduction to Artificial Intelligence
JAVA ppt tutorial basics to learn java programming
Which alternative to Crystal Reports is best for small or large businesses.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
history of c programming in notes for students .pptx
System and Network Administraation Chapter 3
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
The Five Best AI Cover Tools in 2025.docx
Softaken Excel to vCard Converter Software.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Operating system designcfffgfgggggggvggggggggg
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 41
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Essential Infomation Tech presentation.pptx
Online Work Permit System for Fast Permit Processing
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Migrate SBCGlobal Email to Yahoo Easily

Building a Big Data Pipeline

  • 1. Building a Modern Big Data & Advanced Analytics Pipeline (Ideas for building UDAP)
  • 2. About Us • Emerging technology firm focused on helping enterprises build breakthrough software solutions • Building software solutions powered by disruptive enterprise software trends -Machine learning and data science -Cyber-security -Enterprise IOT -Powered by Cloud and Mobile • Bringing innovation from startups and academic institutions to the enterprise • Award winning agencies: Inc 500, American Business Awards, International Business Awards
  • 3. • The principles of big data and advanced analytics pipelines • Some inspiration • Capabilities • Building a big data and advanced analytics pipeline Agenda
  • 4. The principles of an enterprise big data infrastructure
  • 6. There are only a few technology choices….
  • 9. Netflix Data Access Data Fetching: Falcor(https://github.com/Ne tflix/falcor ) Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Federated Job Execution Engine: Genie(https://github.com/Net flix/genie ) Data Infrastructure Data Lakes: Apache Hadoop (http://hadoop.apache.org/ ) Data Compute: Apache Spark SQL Querying: Presto (https://prestodb.io/ ) Data Discovery : Metacat Data Science Multidimensional analysis: Druid (http://druid.io/ ) Data Visualization: Sting Machine learning: Scikit- learn( http://scikit- learn.org/stable/ ) Tools & Solutions Netflix big data portal Hadoop Search: Inviso(https://github.com/Net flix/inviso ) Workflow visualization (https://github.com/Netflix/Li pstick )
  • 13. Spotify Data Access Data Fetching: GraphQL(https://facebook.git hub.io/react/blog/2015/05/0 1/graphql-introduction.html ) Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Data Infrastructure Data Lakes: Apache Hadoop (http://hadoop.apache.org/ ) Data Compute: Apache Spark SQL Aggregation: Apache Crunch(https://crunch.apache .org/ ) Fast Data Access: Apache Cassandra(http://cassandra.a pache.org/ ) Workflow Manager :Luigi(https://github.com/spo tify/luigi ) Data Transformation: Apache Falcon(http://hortonworks.co m/hadoop/falcon/ ) Data Science Data Visualization: Sting Machine learning: Spark MLib( http://scikit- learn.org/stable/ ) Data Discovery: Raynor Tools & Solutions Hadoop Search: Inviso(https://github.com/Net flix/inviso )
  • 15. LinkedIn Data Access Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Data Fetching: GraphQL(https://facebook.git hub.io/react/blog/2015/05/0 1/graphql-introduction.html ) Data Infrastructure Data Lakes: Apache Hadoop (http://hadoop.apache.org/ ) Data Compute: Apache Spark(http://www.project- voldemort.com/voldemort/ ) Fast Data Access: Voldemort(http://cassandra.a pache.org/ ) Stream Analytics : Apache Samza(http://samza.apache.o rg/ ) Real Time Search : Zoie (http://javasoze.github.io/zoi e/ ) Data Science Multidimensional analysis: Druid (http://druid.io/ ) Data Visualization: Sting Machine learning: Scikit- learn( http://scikit- learn.org/stable/ ) Data Discovery: Raynor Tools & Solutions Hadoop Search: Inviso(https://github.com/Net flix/inviso )
  • 16. LinkedIn Stream Data Processing
  • 18. Goldman Sachs Data Access Data Fetching: GraphQL(https://facebook.git hub.io/react/blog/2015/05/0 1/graphql-introduction.html ) Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Data Infrastructure Data Lakes: Apache Hadoop/HBase (http://hadoop.apache.org/ ) Data Compute: Apache Spark Data Transformation: Apache Pig(http://hortonworks.com/ hadoop/falcon/ ) Stream Analytics: Apache Storm (http://storm.apache.org/ ) Data Science Multidimensional analysis: Druid (http://druid.io/ ) Data Visualization: Sting Machine learning: Spark MLib( http://scikit- learn.org/stable/ ) Data Discovery: Custom data catalog Tools & Solutions Secure data exchange: Symphony (http://www.goldmansachs.co m/what-we- do/engineering/see-our- work/inside-symphony.html )
  • 19. Goldman Sachs Data Exchange Architecture
  • 20. Capabilities of a big data pipeline
  • 22. • Provide the foundation for data collection and data ingestion methods at an enterprise scale • Support different data collection models in a consistent architecture • Incorporate and remove data sources without impacting the overall infrastructure Goals
  • 23. • On-demand data access • Batch data access • Stream data access • Data transformation Foundational Capabilities
  • 24. • Enable standard data access protocols for line of business systems • Empower client applications with data querying capabilities • Provide data access infrastructure building blocks such as caching across business data sources On-Demand Data Access Best Practices Interesting Technologies • GraphQL(https://facebook.github.io/ react/blog/2015/05/01/graphql- introduction.html ) • Odata(http://odata.org ) • Falcor (http://netflix.github.io/falcor/ )
  • 25. • Enable agile ETL models • Support federated job processing Batch Data Access Best Practices Interesting Technologies • Genie(https://github.com/Netfl ix/genie ) • Luigi(https://github.com/spoti fy/luigi ) • Apache Pig(https://pig.apache.org/ )
  • 26. • Enable streaming data from line of business systems • Provide the infrastructure to incorporate new data sources such as sensors, web streams etc • Provide a consistent model for data integration between line of business systems Stream Data Access Best Practices Interesting Technologies • Apache Kafka(http://kafka.apache.org/ ) • RabbitMQ(https://www.rabbit mq.com/ ) • ZeroMQ(http://zeromq.org/ ) • Many others….
  • 27. • Enable federated aggregation of disparate data sources • Focus on small data sources • Enable standard protocols to access the federated data sources Data Virtualization Best Practices Interesting Technologies • Denodo(http://www.denodo.co m/en ) • JBoss Data Virtualization(http://www.jbos s.org/products/datavirt/overvie w/ )
  • 29. • Store heterogeneous business data at scale • Provide consistent models to aggregate and compose data sources from different data sources • Manage and curate business data sources • Discover and consume data available in your organization Goals
  • 30. • Data lakes • Data quality • Data discovery • Data transformation Foundational Capabilities
  • 31. • Focus on complementing and expanding our data warehouse capabilities • Optimize the data lake to incorporate heterogeneous data sources • Support multiple data ingestion models • Consider a hybrid cloud strategy (pilot vs. production ) Data Lakes Best Practices Interesting Technologies • Hadoop(http://hadoop.apache.org/ ) • Hive(https://hive.apache.org/ ) • Hbase(https://hbase.apache.org/ ) • Spark(http://spark.apache.org/ ) • Greenplum(http://greenplum.org/ ) • Many others….
  • 32. • Avoid traditional data quality methodologies • Leverage machine learning to streamline data quality rules • Leverage modern data quality platforms • Crowsourced vs. centralized data quality models Data Quality Best Practices Interesting Technologies • Trifacta(http://trifacta.com ) • Tamr(http://tamr.com ) • Alation(https://alation.com/ ) • Paxata(http://www.paxata.com/ )
  • 33. • Master management solutions don’t work with modern data sources • Promote crow-sourced vs. centralized data publishing • Focus on user experience • Consider build vs. buy options Data Discovery Best Practices Interesting Technologies • Tamr(http://tamr.com ) • Custom solutions… • Spotify Raynor • Netflix big data portal
  • 34. • Enable programmable ETLs • Support data transformations for both batch and real time data sources • Agility over robustness Data Transformations Best Practices Interesting Technologies • Apache Pig(https://pig.apache.org/ ) • Streamsets(https://streamsets. com/ ) • Apache Spark (http://spark.apache.org/ )
  • 36. • Discover insights of business data sources • Integrate machine learning capabilities as part of the enterprise data pipeline • Provide the foundation for predictive analytic capabilities across the enterprise • Enable programmatic execution of machine learning models Goals
  • 37. • Data visualization & self-service BI • Predictive analytics • Stream analytics • Proactive analytics Foundational Capabilities
  • 38. • Access business data sources from mainstream data visualization tools like Excel , Tableau, QlickView, Datameer, etc. • Publish data visualizations so that they can be discovered by other information workers • Embed visualization as part of existing line of business solutions Data Visualization and Self-Service BI Best Practices Interesting Technologies • Tableau(http://www.tableau.com/ ) • PowerBI(https://powerbi.microsoft.co m/en-us/ ) • Datameer(http://www.datameer.com/ ) • QlikView(http://www.qlik.com/ ) • Visualization libraries • ….
  • 39. • Implement the tools and frameworks to author machine learning models using business data sources • Expose predictive models via programmable APIs • Provide the infrastructure to test, train and evaluate machine learning models Predictive Analytics Best Practices Interesting Technologies • Spark Mlib(http://spark.apache.org/docs/la test/mllib-guide.html ) • Scikit-Learn(http://scikit-learn.org/ ) • Dato(https://dato.com/ ) • H20.ai(http://www.h2o.ai/ ) • ….
  • 40. • Aggregate data real time from diverse data sources • Model static queries over dynamic streams of data • Create simulations and replays of real data streams Stream Analytics Best Practices Interesting Technologies • Apache Storm(http://storm.apache.org/ ) • Spark Streaming (http://spark.apache.org/streaming/ ) • Apache Samza(http://samza.apache.org/ ) • ….
  • 41. • Automate actions based on the output of predictive models • Use programmatic models to script proactive analytics business rules • Continuously test and validate proactive rules Proactive Analytics Best Practices Interesting Technologies • Spark Mlib(http://spark.apache.org/d ocs/latest/mllib-guide.html ) • Scikit-Learn(http://scikit- learn.org/ )
  • 43. • Leverage a consistent data pipeline as part of all solutions • Empower different teams to contribute to different aspects of the big data pipeline • Keep track of key metrics about the big data pipeline such as time to deliver solutions, data volume over time, data quality metrics, etc Enterprise Data Solutions
  • 44. • Data discovery • Data quality • Data testing tools • … Some Examples
  • 45. • Mobile analytics • Embedded analytics capabilities (ex: Salesforce Wave, Workday) • Aggregation with external sources • Video & image analytics • Deep learning • …. Other Interesting Capabilities
  • 46. Building a big data and advanced analytics pipeline
  • 47. Infrastructure-Driven Data Storage Data Aggregation Data Transformation Data Discovery Others… Stream Analytics Predictive Analytics Proactive Analytics Others… Real time data access Batch data access Stream data access Solution Solution Solution Data Access Data Infrastructure Data Science Solutions
  • 48. Domain-Driven Data Storage Data Aggregation Data Transformation Data Discovery Others… Stream Analytics Predictive Analytics Proactive Analytics Others… Real time data access Batch data access Stream data access Solution Solution Solution Data Access Data Infrastructure Data Science Solutions
  • 49. • Lead by the architecture team • Military discipline • Commitment from business stakeholders Infrastructure-Drives vs. Domain-Driven Approaches Infrastructure-Driven Domain-Driven • Federated data teams • Rapid releases • Pervasive communications
  • 50. • Establish a vision across all levels of the data pipeline • You can’t buy everything…Is likely you will build custom data infrastructure building blocks • Deliver infrastructure and functional capabilities incrementally • Establish a data innovation group responsible for piloting infrastructure capabilities ahead of production schedules • Encourage adoption even in early stages • Iterate Some General Rules
  • 51. Summary • Big data and advanced analytics pipelines are based on 4 fundamental elements: data access, data infrastructure, data science, data solutions…. • A lot of inspiration can be learned from the big data solutions built by lead internet vendors • Establish a common vision and mission • Start small….iterate….