SlideShare a Scribd company logo
Data Integration in 2013:
  A working session
  Adam Muise
  March 26 2013




Note: This deck is purposely sparse. Want value?
Join the conversation in the Toronto Hadoop User
Group:
http://www.meetup.com/TorontoHUG/

  © Hortonworks Inc. 2012
Proposed Agenda
•   Introductions
•   Discuss common Data Integration Patterns
•   Round-table of User Group Member CDC/ETL Use Cases
•   New Data Integration Solutions: A change from the Old Guard:
    –   Hadoop and the Data Lake
    –   Streaming (+ Hadoop)
    –   Data Lake Governance / Management (InfoTrellis)
    –   Databus (LinkedIn)




                                                              Page 2
         © Hortonworks Inc. 2012
Introductions
Who let you in?




                               Page 3
     © Hortonworks Inc. 2012
General Data Integration Patterns
• Enterprise Application Integration*
       – Metadata lookup
       – Validation
       – Extra-app communication

• Enterprise Service Bus (SOA, Message Bus/Hub)*
• Federation*
       – Bridging multiple databases with a query layer
       – Eg: Composite

• Extract Transform Load (ETL)*
       – Collection
       – Aggregation
       – Format/Schema transformation

• Data Lake
       – Landing Zone for multiple datasets in one store
       – Mixed schema, often raw structured/unstructured data
       – Eg: Hadoop

* Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press.


                                                                                                                                                      Page 4
                © Hortonworks Inc. 2012
Use Case Roundtable
Data that’s keeping you up at night…




                                       Page 5
     © Hortonworks Inc. 2012
Scotia iTrade: Geoffrey Li




                              Page 6
    © Hortonworks Inc. 2012
New Data Integration Solutions

Fresh Ideas to new and old problems…




                                       Page 7
     © Hortonworks Inc. 2012
Hadoop: The Data Lake


                                               Publish Event
                                                Signal Data
                                              Transformation


                                 Model/                  Transform &
                             Apply Metadata               Aggregate
                                                                        Publish
                                                                       Exchange




                                                                       Explore
                                                                       Visualize
      Extract &                                                         Report
      Load




                                                                          Analyze




                                                                                    Page 8
   © Hortonworks Inc. 2012
Streaming & Hadoop




http://developer.yahoo.com/blogs/ydn/posts/2013/02/storm-and-hadoop-convergence-of-big-data-and-low-latency-processing/

                                                                                                                      Page 9
            © Hortonworks Inc. 2012
Streaming & Hadoop




http://developer.yahoo.com/blogs/ydn/posts/2013/02/storm-and-hadoop-convergence-of-big-data-and-low-latency-processing/

                                                                                                                     Page 10
            © Hortonworks Inc. 2012
DataBus (LinkedIn)
Databus is a low latency change capture system which has become an
integral part of LinkedIn’s data processing pipeline. Databus addresses a
fundamental requirement to reliably capture, flow and processes primary
data changes. Databus provides the following features:
   1.    Isolation between sources and consumers
   2.    Guaranteed in order and at least once delivery with high availability
   3.    Consumption from an arbitrary time point in the change stream including full bootstrap
         capability of the entire data.
   4.    Partitioned consumption
   5.    Source consistency preservation




  https://github.com/linkedin/databus/wiki

                                                                                                  Page 11
           © Hortonworks Inc. 2012
DataBus (LinkedIn)




 https://github.com/linkedin/databus/wiki


                                            Page 12
          © Hortonworks Inc. 2012

More Related Content

PDF
Integrated Data Warehouse with Hadoop and Oracle Database
PDF
Moving to a data-centric architecture: Toronto Data Unconference 2015
PDF
2015 nov 27_thug_paytm_rt_ingest_brief_final
PPT
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PDF
Next Generation Hadoop Introduction
PDF
Filling the Data Lake
PPTX
NTT Data - Shinichi Yamada - Hadoop World 2010
Integrated Data Warehouse with Hadoop and Oracle Database
Moving to a data-centric architecture: Toronto Data Unconference 2015
2015 nov 27_thug_paytm_rt_ingest_brief_final
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Hadoop Reporting and Analysis - Jaspersoft
Next Generation Hadoop Introduction
Filling the Data Lake
NTT Data - Shinichi Yamada - Hadoop World 2010

What's hot (20)

PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
PPTX
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
PPTX
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
PPTX
Securing your Big Data Environments in the Cloud
PPTX
Hadoop vs. RDBMS for Advanced Analytics
PDF
Semantic Web For Dummies
PPTX
Hadoop Powers Modern Enterprise Data Architectures
PDF
Building a Data Lake - An App Dev's Perspective
PDF
Benefits of Hadoop as Platform as a Service
PDF
Planing and optimizing data lake architecture
PDF
Data-In-Motion Unleashed
PDF
Big Data Architecture and Deployment
PDF
Implementing and running a secure datalake from the trenches
PPTX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
PPTX
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
ETL big data with apache hadoop
PPTX
Hadoop and Hive in Enterprises
PPTX
Big Data Platforms: An Overview
PPTX
Integrating hadoop - Big Data TechCon 2013
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Securing your Big Data Environments in the Cloud
Hadoop vs. RDBMS for Advanced Analytics
Semantic Web For Dummies
Hadoop Powers Modern Enterprise Data Architectures
Building a Data Lake - An App Dev's Perspective
Benefits of Hadoop as Platform as a Service
Planing and optimizing data lake architecture
Data-In-Motion Unleashed
Big Data Architecture and Deployment
Implementing and running a secure datalake from the trenches
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
Innovation in the Data Warehouse - StampedeCon 2016
ETL big data with apache hadoop
Hadoop and Hive in Enterprises
Big Data Platforms: An Overview
Integrating hadoop - Big Data TechCon 2013
Ad

Similar to 2013 march 26_thug_etl_cdc_talking_points (20)

PPTX
Why hadoop for data science?
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
PDF
Building a Modern Data Architecture with Enterprise Hadoop
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PDF
Data Lake for the Cloud: Extending your Hadoop Implementation
PDF
A Reference Architecture for ETL 2.0
PDF
Hadoop Overview
 
KEY
Agile analytics applications on hadoop
KEY
Hortonworks: Agile Analytics Applications
PDF
Introduction to Hadoop
PDF
Webinar future dataintegration-datamesh-and-goldengatekafka
PDF
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
PDF
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
PDF
Hortonworks Big Data & Hadoop
PPTX
Hortonworks Oracle Big Data Integration
PPTX
Create a Smarter Data Lake with HP Haven and Apache Hadoop
PPTX
EMC config Hadoop
PPTX
201305 hadoop jpl-v3
PDF
Hadoop data-lake-white-paper
Why hadoop for data science?
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Building a Modern Data Architecture with Enterprise Hadoop
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Data Lake for the Cloud: Extending your Hadoop Implementation
A Reference Architecture for ETL 2.0
Hadoop Overview
 
Agile analytics applications on hadoop
Hortonworks: Agile Analytics Applications
Introduction to Hadoop
Webinar future dataintegration-datamesh-and-goldengatekafka
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks Big Data & Hadoop
Hortonworks Oracle Big Data Integration
Create a Smarter Data Lake with HP Haven and Apache Hadoop
EMC config Hadoop
201305 hadoop jpl-v3
Hadoop data-lake-white-paper
Ad

More from Adam Muise (20)

PDF
Paytm labs soyouwanttodatascience
PDF
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
PDF
Hadoop at the Center: The Next Generation of Hadoop
PDF
2014 sept 26_thug_lambda_part1
PDF
2014 sept 4_hadoop_security
PPTX
2014 july 24_what_ishadoop
PDF
May 29, 2014 Toronto Hadoop User Group - Micro ETL
PDF
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
PDF
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
PDF
2014 feb 5_what_ishadoop_mda
PDF
2013 Dec 9 Data Marketing 2013 - Hadoop
PDF
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
PDF
What is Hadoop? Nov 20 2013 - IRMAC
PDF
What is Hadoop? Oct 17 2013
PDF
Sept 17 2013 - THUG - HBase a Technical Introduction
PDF
2013 July 23 Toronto Hadoop User Group Hive Tuning
PDF
2013 feb 20_thug_h_catalog
PPTX
KnittingBoar Toronto Hadoop User Group Nov 27 2012
PPTX
2012 sept 18_thug_biotech
PPTX
hadoop 101 aug 21 2012 tohug
Paytm labs soyouwanttodatascience
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
Hadoop at the Center: The Next Generation of Hadoop
2014 sept 26_thug_lambda_part1
2014 sept 4_hadoop_security
2014 july 24_what_ishadoop
May 29, 2014 Toronto Hadoop User Group - Micro ETL
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 5_what_ishadoop_mda
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Oct 17 2013
Sept 17 2013 - THUG - HBase a Technical Introduction
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 feb 20_thug_h_catalog
KnittingBoar Toronto Hadoop User Group Nov 27 2012
2012 sept 18_thug_biotech
hadoop 101 aug 21 2012 tohug

2013 march 26_thug_etl_cdc_talking_points

  • 1. Data Integration in 2013: A working session Adam Muise March 26 2013 Note: This deck is purposely sparse. Want value? Join the conversation in the Toronto Hadoop User Group: http://www.meetup.com/TorontoHUG/ © Hortonworks Inc. 2012
  • 2. Proposed Agenda • Introductions • Discuss common Data Integration Patterns • Round-table of User Group Member CDC/ETL Use Cases • New Data Integration Solutions: A change from the Old Guard: – Hadoop and the Data Lake – Streaming (+ Hadoop) – Data Lake Governance / Management (InfoTrellis) – Databus (LinkedIn) Page 2 © Hortonworks Inc. 2012
  • 3. Introductions Who let you in? Page 3 © Hortonworks Inc. 2012
  • 4. General Data Integration Patterns • Enterprise Application Integration* – Metadata lookup – Validation – Extra-app communication • Enterprise Service Bus (SOA, Message Bus/Hub)* • Federation* – Bridging multiple databases with a query layer – Eg: Composite • Extract Transform Load (ETL)* – Collection – Aggregation – Format/Schema transformation • Data Lake – Landing Zone for multiple datasets in one store – Mixed schema, often raw structured/unstructured data – Eg: Hadoop * Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press. Page 4 © Hortonworks Inc. 2012
  • 5. Use Case Roundtable Data that’s keeping you up at night… Page 5 © Hortonworks Inc. 2012
  • 6. Scotia iTrade: Geoffrey Li Page 6 © Hortonworks Inc. 2012
  • 7. New Data Integration Solutions Fresh Ideas to new and old problems… Page 7 © Hortonworks Inc. 2012
  • 8. Hadoop: The Data Lake Publish Event Signal Data Transformation Model/ Transform & Apply Metadata Aggregate Publish Exchange Explore Visualize Extract & Report Load Analyze Page 8 © Hortonworks Inc. 2012
  • 11. DataBus (LinkedIn) Databus is a low latency change capture system which has become an integral part of LinkedIn’s data processing pipeline. Databus addresses a fundamental requirement to reliably capture, flow and processes primary data changes. Databus provides the following features: 1. Isolation between sources and consumers 2. Guaranteed in order and at least once delivery with high availability 3. Consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data. 4. Partitioned consumption 5. Source consistency preservation https://github.com/linkedin/databus/wiki Page 11 © Hortonworks Inc. 2012