SlideShare a Scribd company logo
Extending the Enterprise Data Warehouse with Hadoop

                 Robert Lancaster and Jonathan Seidman
                                    Hadoop World 2011
                                     November 8 | 2011
Who We Are


•  Robert Lancaster
  –  Solutions Architect, Hotel Supply Team
  –  rlancaster@orbitz.com
  –  @rob1lancaster
  –  Co-organizer of Chicago Big Data and Chicago Machine
     Learning Study Group.
•  Jonathan Seidman
  –  Lead Engineer, Business Intelligence/Big Data Team
  –  Co-founder/organizer of Chicago Hadoop User Group and
     Chicago Big Data
  –  jseidman@orbitz.com
  –  @jseidman

                                                             page 2
Launched in 2001




                   Over 160 million
                   bookings



                                      page 3
Some History…




                page 4
In 2009…


•  The Machine Learning team is formed to improve site
   performance. For example, improving hotel search results.
•  This required access to large volumes of behavioral data for
   analysis.
   –  Fortunately, the required data was collected in session data
      stored in web analytics logs.




                                                                     page 5
The Problem…


•  The only archive of the required data went back about two
   weeks.




       Non-transactional Data       Transactional data
           (e.g. searches)          (e.g. bookings) and
                                      aggregated Non-
                                     transactional data



                                  Data Warehouse



                                                               page 6
Hadoop Provided a Solution…




          Detailed non-
       transactional data
     (what every user sees,
           clicks, etc.)
                              Transactional data
                              (e.g. bookings) and
                                aggregated Non-
                               transactional data

                              Data Warehouse



          Hadoop

                                                    page 7
Deploying Hadoop Enabled Multiple Applications…

100.00%
                                                              Queries
90.00%
80.00%                                                        Searches
     71.67%
70.00%
60.00%
50.00%
40.00%                                      34.30%
     31.87%
30.00%
20.00%
10.00%
                                            2.78%
 0.00%
          1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20




                                                                               page 8
And Useful Analyses…




                       page 9
But Brought New Challenges…


•  Most of these efforts are driven by development teams.
•  The challenge now is unlocking the value of this data for non-
   technical users.




                                                                    page 10
In Early 2011…


•  Big Data team is formed under Business Intelligence team at
   Orbitz Worldwide.
•  Allows the Big Data team to work more closely with the data
   warehouse and BI teams.
•  Reflects the importance of big data to the future of the
   company.




                                                                 page 11
A View Shared Beyond Orbitz…


“We strongly believe that Hadoop is the nucleus of the next-
generation cloud EDW…”

 “…but that promise is still three to five years from fruition.”*




                  *James Kobielus, Forrester Research,
                  “Hadoop, Is It Soup Yet?”


                                                            page 12
Two Primary Ways We Use Hadoop to Complement the EDW


•  Extraction and transformation of data for loading into the data
   warehouse – “ETL”.
•  Off-loading of analysis from the data warehouse.




                                                                     page 13
ETL Example: Proposed Dimensional Model




      Raw logs          Hadoop            Dimensional
                                             model




                                                        page 14
ETL Example: Click Data Processing




Web
                                               Data
Server	

 Web                                           Cleansing
  Web
 Server	

   Logs     ETL           DW         (Stored            DW
  Servers
                                               procedure)

                            Several hours of processing     ~20% original
                                                              data size

                    Current Processing in Data Warehouse


                                                                       page 15
ETL Example: Click Data Processing


•  Moving to Hadoop will facilitate:
   –  Remove load from the data warehouse.
   –  Adding additional attributes for processing.
   –  Allow processing to be run more frequently.


 Web                                   Data
 Server	

  Web                                  Cleansing
   Web
  Server	

   Logs   Hadoop            (MapReduce)   DW
   Servers




                Proposed Processing in Hadoop


                                                          page 16
Analysis Example: Geo-Targeting Ads


•  Facilitated analysis that allows for more personalized ad
   content.
•  Allowed marketing team to analyze over a years worth of
   search data.
•  Provided analysis that was difficult to perform in the data
   warehouse.




                                                                 page 17
BI Vendors Are Working on Hadoop Integration

Both big (relatively)…




                                               page 18
And small…




             page 19
Example Processing Pipeline for Web Analytics Data




                                                     page 20
Example Use Case: Selection Errors




                                     page 21
Use Case – Selection Errors: Introduction


                                 •  Multiple points of entry.
                                 •  Multiple paths through site.
                                 •  Goal: tie events together to
                                    form picture of customer
                                    behavior.




                                                                page 22
Use Case – Selection Errors: Processing




                                          page 23
Use Case – Selection Errors: Visualization




                                             page 24
Example Use Case: Beta Data




                              page 25
Use Case – Beta Data: Introduction




                              •  Hotel Sort Optimization
                              •  Compare A vs. B
                              •  Web Analytics Data
                                 –  What user saw.
                                 –  How user behaved
                              •  Server Log Data
                                 –  Sorting behavior used.




                                                             page 26
Use Case – Beta Data Processing




                                  page 27
Use Case – Beta Data: Visualization




                                      page 28
Example Use Case: RCDC




                         page 29
Use Case – RCDC: Introduction


•  Understand and improve cache behavior.
•  Improve “coverage”
  –  Traditionally search 1 page of hotels at a time.
  –  Get “just enough” information to present to consumers.
  –  Increase amount of availability information we have when
     consumer performs a search.
•  Data needed to support needs beyond reporting.




                                                                page 30
Use Case – RCDC: Processing




                              page 31
Use Case – RCDC: Visualization




                                 page 32
Conclusions


•  Hadoop market is still immature, but growing quickly. Better
   tools are on the way.
   –  Look beyond the usual (enterprise) suspects. Many of the
      most interesting companies in the big data space are small
      startups.
•  Hadoop will not replace your data warehouse, but any
   organization with a large data warehouse should at least be
   exploring Hadoop as a complement to their BI infrastructure.




                                                                   page 33
Conclusions


•  Work closely with your existing data management teams.
   –  Your idea of what constitutes big data might quickly
      diverge from theirs.
•  The flip-side to this is that Hadoop can be an excellent tool to
   off-load resource-consuming jobs from your data warehouse.




                                                                      page 34

More Related Content

PDF
Extending the EDW with Hadoop - Chicago Data Summit 2011
PDF
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PDF
Hadoop and Hive at Orbitz, Hadoop World 2010
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PDF
Emergent Distributed Data Storage
PDF
Big Data Real Time Applications
PPSX
Big data with Hadoop - Introduction
Extending the EDW with Hadoop - Chicago Data Summit 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Hadoop and Hive at Orbitz, Hadoop World 2010
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Emergent Distributed Data Storage
Big Data Real Time Applications
Big data with Hadoop - Introduction

What's hot (20)

PDF
BI, Hive or Big Data Analytics?
PPTX
Big data processing with apache spark part1
PPTX
Hadoop: An Industry Perspective
PDF
Introduction to Big data & Hadoop -I
PPTX
NTT Data - Shinichi Yamada - Hadoop World 2010
PPTX
Hadoop Powers Modern Enterprise Data Architectures
PPTX
Whatisbigdataandwhylearnhadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Big Data & Hadoop Tutorial
PPTX
Integrating hadoop - Big Data TechCon 2013
PDF
VMUGIT UC 2013 - 08a VMware Hadoop
KEY
Flexible In-Situ Indexing for Hadoop via Elephant Twin
PPTX
Data infrastructure and Hadoop at LinkedIn
PPT
Big Data and Hadoop Basics
PPTX
Big Data Concepts
PDF
Introduction to Big Data and Hadoop
PPTX
Big data analytics with hadoop volume 2
PDF
What is hadoop
PDF
Intro to HDFS and MapReduce
PDF
Building a Data Lake - An App Dev's Perspective
BI, Hive or Big Data Analytics?
Big data processing with apache spark part1
Hadoop: An Industry Perspective
Introduction to Big data & Hadoop -I
NTT Data - Shinichi Yamada - Hadoop World 2010
Hadoop Powers Modern Enterprise Data Architectures
Whatisbigdataandwhylearnhadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data & Hadoop Tutorial
Integrating hadoop - Big Data TechCon 2013
VMUGIT UC 2013 - 08a VMware Hadoop
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Data infrastructure and Hadoop at LinkedIn
Big Data and Hadoop Basics
Big Data Concepts
Introduction to Big Data and Hadoop
Big data analytics with hadoop volume 2
What is hadoop
Intro to HDFS and MapReduce
Building a Data Lake - An App Dev's Perspective
Ad

Viewers also liked (8)

PPTX
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
PDF
Architecting next generation big data platform
PDF
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
PDF
A Reference Architecture for ETL 2.0
PPTX
Hadoop and Enterprise Data Warehouse
PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
PPTX
Data warehousing with Hadoop
KEY
Large scale ETL with Hadoop
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Architecting next generation big data platform
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
A Reference Architecture for ETL 2.0
Hadoop and Enterprise Data Warehouse
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Data warehousing with Hadoop
Large scale ETL with Hadoop
Ad

Similar to Extending the Data Warehouse with Hadoop - Hadoop world 2011 (20)

PPT
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
PPT
Gartner peer forum sept 2011 orbitz
PDF
Analyzing Multi-Structured Data
PPT
Big Data = Big Decisions
PPTX
Hadoop as data refinery
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Tech4Africa - Opportunities around Big Data
PPTX
2012 06 hortonworks paris hug
PPTX
Steve Watt Presentation
KEY
UK - Agile Data Applications on Hadoop
PPTX
Pass bac jd_sm
PDF
Big Data and Implications on Platform Architecture
PDF
Big Data World Forum
PDF
Hadoop Data Reservoir Webinar
PPT
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
PDF
Hortonworks roadshow
PDF
Hadoop: What It Is and What It's Not
PDF
20130117 - Big Data Architectures
KEY
Paris HUG - Agile Analytics Applications on Hadoop
KEY
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Gartner peer forum sept 2011 orbitz
Analyzing Multi-Structured Data
Big Data = Big Decisions
Hadoop as data refinery
Hadoop as Data Refinery - Steve Loughran
Tech4Africa - Opportunities around Big Data
2012 06 hortonworks paris hug
Steve Watt Presentation
UK - Agile Data Applications on Hadoop
Pass bac jd_sm
Big Data and Implications on Platform Architecture
Big Data World Forum
Hadoop Data Reservoir Webinar
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Hortonworks roadshow
Hadoop: What It Is and What It's Not
20130117 - Big Data Architectures
Paris HUG - Agile Analytics Applications on Hadoop
Utrecht NL-HUG/Data Science-NL - Agile Data Slides

More from Jonathan Seidman (10)

PDF
Foundations for Successful Data Projects – Strata London 2019
PDF
Foundations strata sf-2019_final
PDF
Architecting a Next Gen Data Platform – Strata New York 2018
PDF
Architecting a Next Gen Data Platform – Strata London 2018
PDF
Architecting a Next Generation Data Platform – Strata Singapore 2017
PDF
Application architectures with hadoop – big data techcon 2014
PDF
Distributed Data Analysis with Hadoop and R - OSCON 2011
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
PDF
Real World Machine Learning at Orbitz, Strata 2011
PDF
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Foundations for Successful Data Projects – Strata London 2019
Foundations strata sf-2019_final
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Generation Data Platform – Strata Singapore 2017
Application architectures with hadoop – big data techcon 2014
Distributed Data Analysis with Hadoop and R - OSCON 2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Real World Machine Learning at Orbitz, Strata 2011
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
Machine Learning_overview_presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
Programs and apps: productivity, graphics, security and other tools
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Assigned Numbers - 2025 - Bluetooth® Document
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Extending the Data Warehouse with Hadoop - Hadoop world 2011

  • 1. Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Hadoop World 2011 November 8 | 2011
  • 2. Who We Are •  Robert Lancaster –  Solutions Architect, Hotel Supply Team –  rlancaster@orbitz.com –  @rob1lancaster –  Co-organizer of Chicago Big Data and Chicago Machine Learning Study Group. •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data –  jseidman@orbitz.com –  @jseidman page 2
  • 3. Launched in 2001 Over 160 million bookings page 3
  • 5. In 2009… •  The Machine Learning team is formed to improve site performance. For example, improving hotel search results. •  This required access to large volumes of behavioral data for analysis. –  Fortunately, the required data was collected in session data stored in web analytics logs. page 5
  • 6. The Problem… •  The only archive of the required data went back about two weeks. Non-transactional Data Transactional data (e.g. searches) (e.g. bookings) and aggregated Non- transactional data Data Warehouse page 6
  • 7. Hadoop Provided a Solution… Detailed non- transactional data (what every user sees, clicks, etc.) Transactional data (e.g. bookings) and aggregated Non- transactional data Data Warehouse Hadoop page 7
  • 8. Deploying Hadoop Enabled Multiple Applications… 100.00% Queries 90.00% 80.00% Searches 71.67% 70.00% 60.00% 50.00% 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 8
  • 10. But Brought New Challenges… •  Most of these efforts are driven by development teams. •  The challenge now is unlocking the value of this data for non- technical users. page 10
  • 11. In Early 2011… •  Big Data team is formed under Business Intelligence team at Orbitz Worldwide. •  Allows the Big Data team to work more closely with the data warehouse and BI teams. •  Reflects the importance of big data to the future of the company. page 11
  • 12. A View Shared Beyond Orbitz… “We strongly believe that Hadoop is the nucleus of the next- generation cloud EDW…” “…but that promise is still three to five years from fruition.”* *James Kobielus, Forrester Research, “Hadoop, Is It Soup Yet?” page 12
  • 13. Two Primary Ways We Use Hadoop to Complement the EDW •  Extraction and transformation of data for loading into the data warehouse – “ETL”. •  Off-loading of analysis from the data warehouse. page 13
  • 14. ETL Example: Proposed Dimensional Model Raw logs Hadoop Dimensional model page 14
  • 15. ETL Example: Click Data Processing Web Data Server Web Cleansing Web Server Logs ETL DW (Stored DW Servers procedure) Several hours of processing ~20% original data size Current Processing in Data Warehouse page 15
  • 16. ETL Example: Click Data Processing •  Moving to Hadoop will facilitate: –  Remove load from the data warehouse. –  Adding additional attributes for processing. –  Allow processing to be run more frequently. Web Data Server Web Cleansing Web Server Logs Hadoop (MapReduce) DW Servers Proposed Processing in Hadoop page 16
  • 17. Analysis Example: Geo-Targeting Ads •  Facilitated analysis that allows for more personalized ad content. •  Allowed marketing team to analyze over a years worth of search data. •  Provided analysis that was difficult to perform in the data warehouse. page 17
  • 18. BI Vendors Are Working on Hadoop Integration Both big (relatively)… page 18
  • 19. And small… page 19
  • 20. Example Processing Pipeline for Web Analytics Data page 20
  • 21. Example Use Case: Selection Errors page 21
  • 22. Use Case – Selection Errors: Introduction •  Multiple points of entry. •  Multiple paths through site. •  Goal: tie events together to form picture of customer behavior. page 22
  • 23. Use Case – Selection Errors: Processing page 23
  • 24. Use Case – Selection Errors: Visualization page 24
  • 25. Example Use Case: Beta Data page 25
  • 26. Use Case – Beta Data: Introduction •  Hotel Sort Optimization •  Compare A vs. B •  Web Analytics Data –  What user saw. –  How user behaved •  Server Log Data –  Sorting behavior used. page 26
  • 27. Use Case – Beta Data Processing page 27
  • 28. Use Case – Beta Data: Visualization page 28
  • 29. Example Use Case: RCDC page 29
  • 30. Use Case – RCDC: Introduction •  Understand and improve cache behavior. •  Improve “coverage” –  Traditionally search 1 page of hotels at a time. –  Get “just enough” information to present to consumers. –  Increase amount of availability information we have when consumer performs a search. •  Data needed to support needs beyond reporting. page 30
  • 31. Use Case – RCDC: Processing page 31
  • 32. Use Case – RCDC: Visualization page 32
  • 33. Conclusions •  Hadoop market is still immature, but growing quickly. Better tools are on the way. –  Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups. •  Hadoop will not replace your data warehouse, but any organization with a large data warehouse should at least be exploring Hadoop as a complement to their BI infrastructure. page 33
  • 34. Conclusions •  Work closely with your existing data management teams. –  Your idea of what constitutes big data might quickly diverge from theirs. •  The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse. page 34