Building a Big Data Pipeline

Building a Modern Big Data & Advanced
Analytics Pipeline
(Ideas for building UDAP)

About Us
• Emerging technology firm focused on helping enterprises build breakthrough
software solutions
• Building software solutions powered by disruptive enterprise software trends
-Machine learning and data science
-Cyber-security
-Enterprise IOT
-Powered by Cloud and Mobile
• Bringing innovation from startups and academic institutions to the enterprise
• Award winning agencies: Inc 500, American Business Awards, International
Business Awards

• The principles of big data and advanced analytics pipelines
• Some inspiration
• Capabilities
• Building a big data and advanced analytics pipeline
Agenda

The principles of an enterprise big data
infrastructure

Data Needs
Vision
Solutions
Data Science
Data Infrastructure
Data Access

There are only a few
technology choices….

Netflix
Data Access
Data Fetching:
Falcor(https://github.com/Ne
tflix/falcor )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Federated Job Execution
Engine:
Genie(https://github.com/Net
flix/genie )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
SQL Querying: Presto
(https://prestodb.io/ )
Data Discovery : Metacat
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Scikit-
learn( http://scikit-
learn.org/stable/ )
Tools & Solutions
Netflix big data portal
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
Workflow visualization
(https://github.com/Netflix/Li
pstick )

Spotify
Data Access
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Infrastructure
SQL Aggregation: Apache
Crunch(https://crunch.apache
.org/ )
Fast Data Access: Apache
Cassandra(http://cassandra.a
pache.org/ )
Workflow Manager
:Luigi(https://github.com/spo
tify/luigi )
Data Transformation: Apache
Falcon(http://hortonworks.co
m/hadoop/falcon/ )
Data Science
Machine learning: Spark
MLib( http://scikit-
learn.org/stable/ )
Data Discovery: Raynor
Tools & Solutions
Hadoop Search:
flix/inviso )

LinkedIn
Data Access
Data Fetching:
Data Infrastructure
Data Compute: Apache
Spark(http://www.project-
voldemort.com/voldemort/ )
Fast Data Access:
Voldemort(http://cassandra.a
pache.org/ )
Stream Analytics : Apache
Samza(http://samza.apache.o
rg/ )
Real Time Search : Zoie
(http://javasoze.github.io/zoi
e/ )
Data Science
Machine learning: Scikit-
learn( http://scikit-
learn.org/stable/ )
Data Discovery: Raynor
Tools & Solutions
Hadoop Search:
flix/inviso )

LinkedIn Stream Data Processing

Goldman Sachs
Data Access
Data Fetching:
Data Infrastructure
Data Lakes: Apache
Hadoop/HBase
Data Transformation: Apache
Pig(http://hortonworks.com/
hadoop/falcon/ )
Stream Analytics: Apache
Storm
(http://storm.apache.org/ )
Data Science
Machine learning: Spark
MLib( http://scikit-
learn.org/stable/ )
Data Discovery: Custom data
catalog
Tools & Solutions
Secure data exchange:
Symphony
(http://www.goldmansachs.co
m/what-we-
do/engineering/see-our-
work/inside-symphony.html )

Goldman Sachs Data Exchange Architecture

Capabilities of a big data pipeline

• Provide the foundation for data collection and data ingestion methods at an enterprise
scale
• Support different data collection models in a consistent architecture
• Incorporate and remove data sources without impacting the overall infrastructure
Goals

• On-demand data access
• Batch data access
• Stream data access
• Data transformation
Foundational Capabilities

• Enable standard data access
protocols for line of business
systems
• Empower client applications
with data querying capabilities
• Provide data access
infrastructure building blocks
such as caching across business
data sources
On-Demand Data Access
Best Practices Interesting Technologies
• GraphQL(https://facebook.github.io/
react/blog/2015/05/01/graphql-
introduction.html )
• Odata(http://odata.org )
• Falcor
(http://netflix.github.io/falcor/ )

• Enable agile ETL models
• Support federated job
processing
Batch Data Access
• Genie(https://github.com/Netfl
ix/genie )
• Luigi(https://github.com/spoti
fy/luigi )
• Apache
Pig(https://pig.apache.org/ )

• Enable streaming data from
line of business systems
• Provide the infrastructure to
incorporate new data sources
such as sensors, web streams
etc
• Provide a consistent model for
data integration between line of
business systems
Stream Data Access
• Apache
Kafka(http://kafka.apache.org/
)
• RabbitMQ(https://www.rabbit
mq.com/ )
• ZeroMQ(http://zeromq.org/ )
• Many others….

• Enable federated aggregation of
disparate data sources
• Focus on small data sources
• Enable standard protocols to
access the federated data
sources
Data Virtualization
• Denodo(http://www.denodo.co
m/en )
• JBoss Data
Virtualization(http://www.jbos
s.org/products/datavirt/overvie
w/ )

• Store heterogeneous business data at scale
• Provide consistent models to aggregate and compose data sources from different data
sources
• Manage and curate business data sources
• Discover and consume data available in your organization
Goals

• Data lakes
• Data quality
• Data discovery
• Data transformation

• Focus on complementing and
expanding our data warehouse
capabilities
• Optimize the data lake to
incorporate heterogeneous data
sources
• Support multiple data ingestion
models
• Consider a hybrid cloud
strategy (pilot vs. production )
Data Lakes
• Hadoop(http://hadoop.apache.org/ )
• Hive(https://hive.apache.org/ )
• Hbase(https://hbase.apache.org/ )
• Spark(http://spark.apache.org/ )
• Greenplum(http://greenplum.org/ )
• Many others….

• Avoid traditional data quality
methodologies
• Leverage machine learning to
streamline data quality rules
• Leverage modern data quality
platforms
• Crowsourced vs. centralized
data quality models
Data Quality
• Trifacta(http://trifacta.com )
• Tamr(http://tamr.com )
• Alation(https://alation.com/ )
• Paxata(http://www.paxata.com/ )

• Master management solutions
don’t work with modern data
sources
• Promote crow-sourced vs.
centralized data publishing
• Focus on user experience
• Consider build vs. buy options
Data Discovery
• Tamr(http://tamr.com )
• Custom solutions…
• Spotify Raynor
• Netflix big data portal

• Enable programmable ETLs
• Support data transformations
for both batch and real time
data sources
• Agility over robustness
Data Transformations
• Apache
Pig(https://pig.apache.org/ )
• Streamsets(https://streamsets.
com/ )
• Apache Spark
(http://spark.apache.org/ )

• Discover insights of business data sources
• Integrate machine learning capabilities as part of the enterprise data pipeline
• Provide the foundation for predictive analytic capabilities across the enterprise
• Enable programmatic execution of machine learning models
Goals

• Data visualization & self-service BI
• Predictive analytics
• Stream analytics
• Proactive analytics

• Access business data sources
from mainstream data
visualization tools like Excel ,
Tableau, QlickView, Datameer,
etc.
• Publish data visualizations so
that they can be discovered by
other information workers
• Embed visualization as part of
existing line of business
solutions
Data Visualization and Self-Service BI
• Tableau(http://www.tableau.com/ )
• PowerBI(https://powerbi.microsoft.co
m/en-us/ )
• Datameer(http://www.datameer.com/
)
• QlikView(http://www.qlik.com/ )
• Visualization libraries
• ….

• Implement the tools and
frameworks to author machine
learning models using business
data sources
• Expose predictive models via
programmable APIs
• Provide the infrastructure to
test, train and evaluate machine
learning models
Predictive Analytics
• Spark
Mlib(http://spark.apache.org/docs/la
test/mllib-guide.html )
• Scikit-Learn(http://scikit-learn.org/ )
• Dato(https://dato.com/ )
• H20.ai(http://www.h2o.ai/ )
• ….

• Aggregate data real time from
diverse data sources
• Model static queries over
dynamic streams of data
• Create simulations and replays
of real data streams
Stream Analytics
• Apache
Storm(http://storm.apache.org/ )
• Spark Streaming
(http://spark.apache.org/streaming/
)
• Apache
Samza(http://samza.apache.org/ )
• ….

• Automate actions based on the
output of predictive models
• Use programmatic models to
script proactive analytics
business rules
• Continuously test and validate
proactive rules
Proactive Analytics
• Spark
Mlib(http://spark.apache.org/d
ocs/latest/mllib-guide.html )
• Scikit-Learn(http://scikit-
learn.org/ )

• Leverage a consistent data pipeline as part of all solutions
• Empower different teams to contribute to different aspects of the big data pipeline
• Keep track of key metrics about the big data pipeline such as time to deliver solutions,
data volume over time, data quality metrics, etc
Enterprise Data Solutions

• Data discovery
• Data quality
• Data testing tools
• …
Some Examples

• Mobile analytics
• Embedded analytics capabilities (ex: Salesforce Wave, Workday)
• Aggregation with external sources
• Video & image analytics
• Deep learning
• ….
Other Interesting Capabilities

Building a big data and advanced analytics
pipeline

Infrastructure-Driven
Data Storage Data Aggregation Data Transformation
Data Discovery Others…
Stream Analytics Predictive Analytics
Proactive Analytics Others…
Real time data access Batch data access Stream data access
Solution
Solution
Solution
Data
Access
Data
Infrastructure
Data
Science
Solutions

Domain-Driven
Data Storage Data Aggregation Data Transformation
Data Discovery Others…
Stream Analytics Predictive Analytics
Proactive Analytics Others…
Real time data access Batch data access Stream data access
Solution
Solution
Solution
Data
Access
Data
Infrastructure
Data
Science
Solutions

• Lead by the architecture team
• Military discipline
• Commitment from business
stakeholders
Infrastructure-Drives vs. Domain-Driven Approaches
Infrastructure-Driven Domain-Driven
• Federated data teams
• Rapid releases
• Pervasive communications

• Establish a vision across all levels of the data pipeline
• You can’t buy everything…Is likely you will build custom data infrastructure building
blocks
• Deliver infrastructure and functional capabilities incrementally
• Establish a data innovation group responsible for piloting infrastructure capabilities
ahead of production schedules
• Encourage adoption even in early stages
• Iterate
Some General Rules

Summary
• Big data and advanced analytics pipelines are based on 4 fundamental elements: data access,
data infrastructure, data science, data solutions….
• A lot of inspiration can be learned from the big data solutions built by lead internet vendors
• Establish a common vision and mission
• Start small….iterate….

Thanks
http://Tellago.com
Info@Tellago.com

Building a Big Data Pipeline

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Building a Big Data Pipeline (20)

More from Jesus Rodriguez (20)

Recently uploaded (20)

Building a Big Data Pipeline