SlideShare a Scribd company logo
Seshu Edala, Dave Schaefer, Nghia Ngo – IT Architects
November 2015
Gobblin @ Intel
2
Legal Message
THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT
SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL'S
EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN
SIMILAR RESULTS
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS
SUMMARY.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2016, Intel Corporation. All rights reserved.
Outline
 Integrated Analytics Vision
 Data Ingestion Challenges
 Solution
 What we would like to do
 What we did
 Challenges
 Need Help
 Summary
3
Integrated Analytics Vision & Mission
Our Vision: Customers are empowered to easily make rapid, impactful business decisions and
uncover new revenue channels through connected data & analytics
Our Mission: Provide clean, relatable, integrated data using a consistent approach to deliver
business recommendations and insights through visual and interactive usage
Transformed and
Connected Data
Raw Data Advanced
Analytics
4
As Is – Data Ingestion Architecture
Firewall and
Proxy Channels
External Source
Systems
IT BI Hadoop Cluster
Gateway Node
Camel
Hadoop Storage
Internal Source
Systems
Logs
DataMart
EDW
DataMart
RDBMSFlat/CSV
Files
SFTP
Vendor
utility
Hadoop
Put
Python
script
HDFS Hive
Hadoop
Put
Custom
utility
Hadoop
Put
Hadoop
Put
Hadoop
Put
Data
Consumption
Transformation
Visualization
tools
Client Tools
Sales CRM
Marketing
campaign
management
Content
Tagging
Webinar
5
Data Ingestion Challenges
Ingesting a variety of internal/external data sources, such as enterprise data warehouse,
enterprise master data, spreadsheets, social media feeds, marketing data, retailer data, etc.
This resulted in variety of challenges including:
• Individual project teams instrumenting their own methods for ingesting data from various
sources and building their own data pipelines
• Operational Complexity to manage the individual pipelines
• No reusability as each project team created redundant methods/codebases for ingesting
data sources
• High development cost as each team built their own data ingestion pipelines
• Inconsistency in the quality of project teams’ data ingestion codebases impacting data
qualify and reliability
• Job failures resulting from data format, quality, schema evolution and availability issues
• Skillset challenges
6
No standardized reusable framework for data ingestion
Solution: Data Ingestion Architecture with Gobblin/Kite
Firewall and
Proxy Channels
External Source
Systems
IT BI Hadoop
Cluster
Gateway Node
DataMart
EDW
DataMart
Data Ingestion
Reusable Framework
Kafka
Validation
RestFul
APIs
And many
more….
Hadoop
Storage
Hive / HDFS /
Hbase
Internal Source
Systems
RDBMSFlat/CSV
Files
SFTP
Vendor
APIs
Gobblin
Interface
Logs
File
Adapter
Config
Files
Alert
CSV
Adapter
RDBMS
JDBC Connector
Data
Consumption
Visualization
tool
Client Tools
Sales CRM
Marketing
campaign
management
Content
tagging
Webinar
Retailer
Social media
feeds
K
i
t
e
7
UI
8
What we set out to do?
Functionally evaluate Gobblin for ingesting and integrating data.
Prototype a non OOB source to extract data out of an “online campaign
automation provider”
Acceptance Criteria
 Bulk RestAPI
 Validate the correctness of data
 Data Consistency from end to end
 Notification, status and error logging
 Ability to log kickout records
 Training plan for implementation and adoption plan
9
What we did
Data Scope
• 4 objects
• accounts
• contacts
• 9 activities
• 59 custom objects
Parallel load data
• Hive (not using compaction) *
• HDFS (BaseDataPublisher)
Functional UI ready
• Scheduling
• Job History
• Authoring job configurations
Functional backend ready
• Enterprise scheduler
• Gobblin Standalone
• Gobblin Map-Reduce *
Quality checking policies
• Row level
• Task level
Enterprise features
• Alerting
• Monitoring
• Profiling *
• Logging
* Needs more attention
10
Process Flow
Establish connection
•Authentication
•Endpoint indirection
Object Determination
•Get Object Listing
•Get Schema Definition
•Slice Schema
Create Intent
•Create Exports
Establish size boundaries
•Create Syncs
•Poll Syncs
•Slice batches
Download
•Parallel batches
Rebuild data
•Reassemble
•Schema inferencing
•Data Conversion
Data Publishing
•Hive/Impala load
•View Definition
•Quality enforcement
Parallel download and reassembly of data blocks
11
Gobblin Challenges
User Interface – Visual Execution and Evaluation
Data Routing – Complex enterprise integration patterns routing challenging to
implement
public enum Result {
PASSED, // The test passed
FAILED // The test failed
}
12
Need Gobblin Community Help
 Address adoption challenges
 Intake process for third-party contributions.
– New Source - “online campaign automation provider”
– Spark based ingestion candidates (parquet, avro, json, JDBC, s3) and runtime
– Kite SDK
 Partnership with key big data vendors – CDH, HDP, MAPR – for internalizing Gobblin
capability
– Deployment, Management, Metrics, and Lineage Integration
 Implement queuing or pluggable schedulers that do not rely on PID and workdir states;
better integration with enterprise schedulers.
 Make Hive publishers native; versus offline compactions.
 Publish documentation for user community
13
Summary
 Gobblin is a robust data integration framework that meets the scale, quality,
enterprise readiness imperatives expected;
 However, some features like usability, enterprise integration patterns,
scheduling, profiling, lineage, deployment, documentation could be improved.
Gobblin for Data Analytics

More Related Content

PDF
Gobblin @ NerdWallet (Nov 2015)
PPTX
Gobbin config-meetup-june-2016
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
PDF
gobblin-meetup-yarn
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PPTX
Gobblin: Unifying Data Ingestion for Hadoop
PDF
Empowering Zillow’s Developers with Self-Service ETL
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
Gobblin @ NerdWallet (Nov 2015)
Gobbin config-meetup-june-2016
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
gobblin-meetup-yarn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Gobblin: Unifying Data Ingestion for Hadoop
Empowering Zillow’s Developers with Self-Service ETL
Flink SQL & TableAPI in Large Scale Production at Alibaba

What's hot (20)

PDF
How to use Parquet as a Sasis for ETL and Analytics
PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
PPTX
Data Engineering Roles
PDF
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
PDF
How InfluxDB Enables NodeSource to Run Extreme Levels of Node.js Processes
PDF
Migrating Your Data Platform At a High Growth Startup
PPTX
What’s new in Apache Spark 2.3
PDF
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
PPTX
Whats New in Postgres 12
 
PDF
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
PDF
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PPTX
Taking Splunk to the Next Level - Architecture
PDF
Data Science Across Data Sources with Apache Arrow
PDF
Kafka & Hadoop in Rakuten
PDF
Which Change Data Capture Strategy is Right for You?
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
PDF
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
How to use Parquet as a Sasis for ETL and Analytics
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Data Engineering Roles
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
How InfluxDB Enables NodeSource to Run Extreme Levels of Node.js Processes
Migrating Your Data Platform At a High Growth Startup
What’s new in Apache Spark 2.3
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
KFServing, Model Monitoring with Apache Spark and a Feature Store
Whats New in Postgres 12
 
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Scaling your Data Pipelines with Apache Spark on Kubernetes
Taking Splunk to the Next Level - Architecture
Data Science Across Data Sources with Apache Arrow
Kafka & Hadoop in Rakuten
Which Change Data Capture Strategy is Right for You?
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
Ad

Similar to Gobblin for Data Analytics (20)

PPTX
Migrating Analytics to the Cloud at Fannie Mae
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
PPTX
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
PPTX
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
PDF
Long-Term Outcomes: Customer-Centered Product Strategy For Machine Intelligen...
PPTX
Skilwise Big data
PPTX
Skillwise Big Data part 2
PDF
Enabling a hardware accelerated deep learning data science experience for Apa...
DOC
Himel_Sen_Resume
PDF
Big Data + PeopleSoft = BIG WIN!
PDF
Big SQL 3.0 - Fast and easy SQL on Hadoop
DOC
Sandeep_Rampalle_Resume
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
PDF
BI, Hive or Big Data Analytics?
PPTX
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
PDF
SD Big Data Monthly Meetup #4 - Session 1 - IBM
PDF
OC Big Data Monthly Meetup #6 - Session 1 - IBM
PDF
Hadoop and Your Enterprise Data Warehouse
DOC
Priti - ETL Engineer
PPTX
Building a Modern Analytic Database with Cloudera 5.8
Migrating Analytics to the Cloud at Fannie Mae
Gobblin' Big Data With Ease @ QConSF 2014
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Long-Term Outcomes: Customer-Centered Product Strategy For Machine Intelligen...
Skilwise Big data
Skillwise Big Data part 2
Enabling a hardware accelerated deep learning data science experience for Apa...
Himel_Sen_Resume
Big Data + PeopleSoft = BIG WIN!
Big SQL 3.0 - Fast and easy SQL on Hadoop
Sandeep_Rampalle_Resume
Hadoop and SQL: Delivery Analytics Across the Organization
BI, Hive or Big Data Analytics?
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
SD Big Data Monthly Meetup #4 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
Hadoop and Your Enterprise Data Warehouse
Priti - ETL Engineer
Building a Modern Analytic Database with Cloudera 5.8
Ad

More from Intel IT Center (20)

PDF
AI Crash Course- Supercomputing
PPTX
FPGA Inference - DellEMC SURFsara
PDF
High Memory Bandwidth Demo @ One Intel Station
PDF
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
PDF
Disrupt Hackers With Robust User Authentication
PDF
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
PDF
Harness Digital Disruption to Create 2022’s Workplace Today
PPTX
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
PDF
Achieve Unconstrained Collaboration in a Digital World
PDF
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
PDF
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
PPTX
Identity Protection for the Digital Age
PDF
Three Steps to Making a Digital Workplace a Reality
PDF
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
PDF
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
PDF
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
AI Crash Course- Supercomputing
FPGA Inference - DellEMC SURFsara
High Memory Bandwidth Demo @ One Intel Station
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
Disrupt Hackers With Robust User Authentication
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Harness Digital Disruption to Create 2022’s Workplace Today
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Achieve Unconstrained Collaboration in a Digital World
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
Identity Protection for the Digital Age
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Lecture1 pattern recognition............
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Global journeys: estimating international migration
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Launch Your Data Science Career in Kochi – 2025
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
Lecture1 pattern recognition............
Clinical guidelines as a resource for EBP(1).pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Global journeys: estimating international migration
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx

Gobblin for Data Analytics

  • 1. Seshu Edala, Dave Schaefer, Nghia Ngo – IT Architects November 2015 Gobblin @ Intel
  • 2. 2 Legal Message THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2016, Intel Corporation. All rights reserved.
  • 3. Outline  Integrated Analytics Vision  Data Ingestion Challenges  Solution  What we would like to do  What we did  Challenges  Need Help  Summary 3
  • 4. Integrated Analytics Vision & Mission Our Vision: Customers are empowered to easily make rapid, impactful business decisions and uncover new revenue channels through connected data & analytics Our Mission: Provide clean, relatable, integrated data using a consistent approach to deliver business recommendations and insights through visual and interactive usage Transformed and Connected Data Raw Data Advanced Analytics 4
  • 5. As Is – Data Ingestion Architecture Firewall and Proxy Channels External Source Systems IT BI Hadoop Cluster Gateway Node Camel Hadoop Storage Internal Source Systems Logs DataMart EDW DataMart RDBMSFlat/CSV Files SFTP Vendor utility Hadoop Put Python script HDFS Hive Hadoop Put Custom utility Hadoop Put Hadoop Put Hadoop Put Data Consumption Transformation Visualization tools Client Tools Sales CRM Marketing campaign management Content Tagging Webinar 5
  • 6. Data Ingestion Challenges Ingesting a variety of internal/external data sources, such as enterprise data warehouse, enterprise master data, spreadsheets, social media feeds, marketing data, retailer data, etc. This resulted in variety of challenges including: • Individual project teams instrumenting their own methods for ingesting data from various sources and building their own data pipelines • Operational Complexity to manage the individual pipelines • No reusability as each project team created redundant methods/codebases for ingesting data sources • High development cost as each team built their own data ingestion pipelines • Inconsistency in the quality of project teams’ data ingestion codebases impacting data qualify and reliability • Job failures resulting from data format, quality, schema evolution and availability issues • Skillset challenges 6 No standardized reusable framework for data ingestion
  • 7. Solution: Data Ingestion Architecture with Gobblin/Kite Firewall and Proxy Channels External Source Systems IT BI Hadoop Cluster Gateway Node DataMart EDW DataMart Data Ingestion Reusable Framework Kafka Validation RestFul APIs And many more…. Hadoop Storage Hive / HDFS / Hbase Internal Source Systems RDBMSFlat/CSV Files SFTP Vendor APIs Gobblin Interface Logs File Adapter Config Files Alert CSV Adapter RDBMS JDBC Connector Data Consumption Visualization tool Client Tools Sales CRM Marketing campaign management Content tagging Webinar Retailer Social media feeds K i t e 7 UI
  • 8. 8 What we set out to do? Functionally evaluate Gobblin for ingesting and integrating data. Prototype a non OOB source to extract data out of an “online campaign automation provider” Acceptance Criteria  Bulk RestAPI  Validate the correctness of data  Data Consistency from end to end  Notification, status and error logging  Ability to log kickout records  Training plan for implementation and adoption plan
  • 9. 9 What we did Data Scope • 4 objects • accounts • contacts • 9 activities • 59 custom objects Parallel load data • Hive (not using compaction) * • HDFS (BaseDataPublisher) Functional UI ready • Scheduling • Job History • Authoring job configurations Functional backend ready • Enterprise scheduler • Gobblin Standalone • Gobblin Map-Reduce * Quality checking policies • Row level • Task level Enterprise features • Alerting • Monitoring • Profiling * • Logging * Needs more attention
  • 10. 10 Process Flow Establish connection •Authentication •Endpoint indirection Object Determination •Get Object Listing •Get Schema Definition •Slice Schema Create Intent •Create Exports Establish size boundaries •Create Syncs •Poll Syncs •Slice batches Download •Parallel batches Rebuild data •Reassemble •Schema inferencing •Data Conversion Data Publishing •Hive/Impala load •View Definition •Quality enforcement Parallel download and reassembly of data blocks
  • 11. 11 Gobblin Challenges User Interface – Visual Execution and Evaluation Data Routing – Complex enterprise integration patterns routing challenging to implement public enum Result { PASSED, // The test passed FAILED // The test failed }
  • 12. 12 Need Gobblin Community Help  Address adoption challenges  Intake process for third-party contributions. – New Source - “online campaign automation provider” – Spark based ingestion candidates (parquet, avro, json, JDBC, s3) and runtime – Kite SDK  Partnership with key big data vendors – CDH, HDP, MAPR – for internalizing Gobblin capability – Deployment, Management, Metrics, and Lineage Integration  Implement queuing or pluggable schedulers that do not rely on PID and workdir states; better integration with enterprise schedulers.  Make Hive publishers native; versus offline compactions.  Publish documentation for user community
  • 13. 13 Summary  Gobblin is a robust data integration framework that meets the scale, quality, enterprise readiness imperatives expected;  However, some features like usability, enterprise integration patterns, scheduling, profiling, lineage, deployment, documentation could be improved.