SlideShare a Scribd company logo
Hadoop as a Data Refinery

Steve Loughran– Hortonworks
@steveloughran
London, October 2012




© Hortonworks Inc. 2012
About me:
• HP Labs:
   –Deployment, cloud infrastructure, Hadoop-in-Cloud
• Apache – member and committer
   –Ant, Axis ; author: Ant in Action
   –Hadoop
       –Dynamic deployments
       –Diagnostics on failures
       –Cloud infrastructure integration
• Joined Hortonworks in 2012
   –UK based: R&D



                                                        Page 2
      © Hortonworks Inc. 2012
What is Apache Hadoop?


• Collection of Open Source Projects          One of the best examples of
   – Apache Software Foundation (ASF)        open source driving innovation
   – commercial and community development       and creating a market



                                       • Foundation for Big Data Solutions
                                            – Stores petabytes of data reliably
                                            – Runs highly distributed computation
                                            – Commodity servers & storage
                                            – Powers data-driven business




                                                                           Page 3
          © Hortonworks Inc. 2012
Why Hadoop?
    Business Pressure
1   Opportunity to enable innovative new business models

2   Potential new insights that drive competitive advantage

    Technical Pressure
3   Data collected and stored continues to grow exponentially

4   Data is increasingly everywhere and in many formats

5   Traditional solutions not designed for new requirements

    Financial Pressure
6   Cost of data systems, as % of IT spend, continues to grow

7   Cost advantages of commodity hardware & open source

                                                                Page 4
      © Hortonworks Inc. 2012
The data refinery in an enterprise
 Audio,                                 Web, Mobile, CRM,
 Video,                                      ERP, SCM, …
Images
           New Data                                           Business
                                                            Transactions
 Docs,     Sources
 Text,                                                      & Interactions
 XML

                                         HDFS
  Web
 Logs,
 Clicks
                           Big Data
                                                             SQL   NoSQL     NewSQL
Social,                    Refinery
Graph,                                                                                ETL
Feeds

                                                             EDW    MPP      NewSQL
Sensors,
Devices,
 RFID

                                                               Business
                                                    Pig
Spatial,                                                      Intelligence
 GPS                    Apache Hadoop
                                                              & Analytics
Events,
 Other                                   Dashboards, Reports,
                                              Visualization, …

                                                                                       Page 5
            © Hortonworks Inc. 2012
Modernising Business Intelligence
• Before:
  – Current records & short history
  – Analytics/BI systems keep conformed / cleaned / digested data
  – Unstructured data locked silos, archived offline
  Inflexible, new questions require system redesigns


• Now
  – Keep raw data in Hadoop for a long time
  – Reprocess/enhance analytics/BI data on-demand
  – Can directly experiment on all raw data
  – New products / services can be added very quickly
  Storage and agility justifies new infrastructure


                                                                    Page 6
        © Hortonworks Inc. 2012
Refineries pull in raw data
Internal: pipelines with Apache Flume
  – Web site logs
  – Real-world events: retail, financial, vehicle movements
  – New data sources you create
   The data you couldn't afford to keep

External: pipelines and bulk deliveries
  – Correlating data: weather, market, competition
  – New sources -twitter feeds, infochimps, open government
  – Real-world events: retail, financial
  – Apache Sqoop
   To help understand your own data

                                                              Page 8
      © Hortonworks Inc. 2012
Refineries refine raw data
• Clean up raw data
• Filter “cleaned” data

• Forward data to different destinations:
  – Existing BI infrastructure
  – New “Agile Data” infrastructures


• Offload work from the core Data Warehouse
  – ETL operations
  – Report and Chart Generation
  – Ad-hoc queries


      Needs: query, workflow and reporting tools
                                                   Page 9
      © Hortonworks Inc. 2012
Refineries can store data
• Retain historical transaction data, analyses
• Store (cleaned, filtered, compressed) raw data
• Provide the history for more advanced analysis in
  future applications and queries

• Needs: storage, query tools
  – Storage: HDFS and HBase
  – Languages: Pig & Hive
  – Workflow for scheduled jobs: Oozie
  – Shared schema repository: HCatalog



Hadoop makes storing bulk & historical data affordable
                                                      Page 10
     © Hortonworks Inc. 2012
What if I didn't have a Data
Warehouse?




                               Page 12
© Hortonworks Inc. 2012
Congratulations!


1. HBase: scale, Hadoop integration

2. mongoDB, CouchDB, Riak
   good for web UIs

3. Postgres, MySQL, …
   transactions
                                  Page 13
    © Hortonworks Inc. 2012
Agile Data




                          Page 14
© Hortonworks Inc. 2012
Agile Data
• SQL Experts: Hive HQL queries
• Ad-hoc queries: Pig
• Statistics platform: R + Hadoop
• Visualisation tools –including Excel
• New web UI applications




 Because you don’t know all that you are looking for
            when you collect the data


                                                   Page 15
      © Hortonworks Inc. 2012
Page 16
© Hortonworks Inc. 2012
Pig: an Agile Data language
• Optimised for refining data
• Dataflow-driven –much higher level than Java
• Macros and User Defined Functions
• ILLUSTRATE aids development
• For ad-hoc and production use




                                                 Page 17
     © Hortonworks Inc. 2012
Example: Packetpig
snort_alerts = LOAD '$pcap'
  USING
com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts
  GENERATE
    com.packetloop.packetpig.udf.geoip.Country(src) as country,
    priority;

countries = GROUP countries BY country;

countries = FOREACH countries
  GENERATE
    group,
    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');


                                                                       Page 18
          © Hortonworks Inc. 2012
web UI: d3.js




                              Page 19
    © Hortonworks Inc. 2012
Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative




                                                        Page 20
     © Hortonworks Inc. 2012
Developers: learn statistics via Pig

Data Scientists & Statisticians:
learn Pig (and R)


Russ Jurney @ HUG UK in November
meetup.com/hadoop-users-group-uk/
                                       Page 21
    © Hortonworks Inc. 2012
Challenge:
Becoming a data-driven organisation




                                      Page 22
© Hortonworks Inc. 2012
Challenges
• Thinking of the right questions to ask

• Conducting valid experiments:
  A/B testing, surveys with effective sampling, …
  – Not: "try a web new design for a week"
  – Not: "please do a site survey" pop-up dialog


• Accepting negative results
  – "no design was better than the other"


• Accepting results you don't agree with
  – “trials imply the proposed strategy won't work”

                                                      Page 23
      © Hortonworks Inc. 2012
Example: Yahoo!
• Online Application logic driven by big lookup tables

• Lookup data computed periodically on Hadoop
  – Machine learning, other expensive computation offline
  – Personalization, classification, fraud, value analysis…


• Application development requires data science
  – Huge amounts of actually observed data key to modern apps
  – Hadoop used as the science platform




      Architecting
      © Hortonworks Inc. 2012 the Future of Big Data
                                                                Page 24
Yahoo! Homepage


 • Serving Maps                               SCIENCE       » Machine learning to build ever
        • Users - Interests                      HADOOP       better categorization models
                                                 CLUSTER
 • Five Minute                                                  CATEGORIZATION
                                   USER
   Production                  BEHAVIOR                         MODELS (weekly)


 • Weekly                                     PRODUCTION
   Categorization                                 HADOOP    » Identify user interests using
                                                  CLUSTER
   models                      SERVING                         Categorization models
                                  MAPS
                       (every 5 minutes)
                                                  USER
                                                BEHAVIOR



                            SERVING SYSTEMS                    ENGAGED USERS


   Build customised home pages with latest data (thousands / second)
Copyright Yahoo 2011                                                                          25
Conclusions

Hadoop can live alongside existing BI
systems –as a data refinery

•   Store, refine bulk & unstructured data
•   Archive data for long-term analysis
•   Support ad-hoc queries over bulk data
•   Become the data-science platform



               26
Thank You!
Questions & Answers

hortonworks.com/download




                              Page 27
    © Hortonworks Inc. 2012

More Related Content

PPTX
Introduction to Microsoft HDInsight and BI Tools
PPTX
Introduction to Hortonworks Data Platform for Windows
PDF
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
PDF
Apache Hadoop on the Open Cloud
PDF
Software Architecture and Predictive Models in R
KEY
Agile analytics applications on hadoop
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PPTX
Oncrawl elasticsearch meetup france #12
Introduction to Microsoft HDInsight and BI Tools
Introduction to Hortonworks Data Platform for Windows
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Apache Hadoop on the Open Cloud
Software Architecture and Predictive Models in R
Agile analytics applications on hadoop
Hadoop Reporting and Analysis - Jaspersoft
Oncrawl elasticsearch meetup france #12

What's hot (19)

PDF
Emergent Distributed Data Storage
PDF
Using hadoop to expand data warehousing
PDF
Modern Data Architecture: In-Memory with Hadoop - the new BI
PDF
Hortonworks kognitio webinar 10 dec 2013
PPTX
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
PPTX
Bigger Data For Your Budget
PDF
Introduction to Hadoop
PDF
Extending the Data Warehouse with Hadoop - Hadoop world 2011
PDF
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
PDF
BI, Hive or Big Data Analytics?
PPTX
Hadoop Powers Modern Enterprise Data Architectures
PDF
Building Big Data Applications
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PDF
Extending the EDW with Hadoop - Chicago Data Summit 2011
PDF
Cloudian 451-hortonworks - webinar
PPTX
Real-time Analytics for Data-Driven Applications
PDF
How Salesforce.com uses Hadoop
PPTX
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
Emergent Distributed Data Storage
Using hadoop to expand data warehousing
Modern Data Architecture: In-Memory with Hadoop - the new BI
Hortonworks kognitio webinar 10 dec 2013
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Bigger Data For Your Budget
Introduction to Hadoop
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
BI, Hive or Big Data Analytics?
Hadoop Powers Modern Enterprise Data Architectures
Building Big Data Applications
Hadoop 2.0: YARN to Further Optimize Data Processing
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Extending the EDW with Hadoop - Chicago Data Summit 2011
Cloudian 451-hortonworks - webinar
Real-time Analytics for Data-Driven Applications
How Salesforce.com uses Hadoop
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
Ad

Viewers also liked (20)

PDF
Simplified Data Management And Process Scheduling in Hadoop
PPTX
Big Data and Hadoop - An Introduction
PPTX
Amazon Elastic Computing 2
PPTX
Hadoop Cluster Configuration and Data Loading - Module 2
PPTX
Taller hadoop
PPTX
Hadoop administration
PPTX
Introduction to Hadoop and Hadoop component
PDF
Hadoop Trends
PDF
Apache Flume NG
PPTX
Hadoop fault-tolerance
PPTX
Introduction to Apache Hadoop Ecosystem
PDF
Hadoop, HDFS and MapReduce
PDF
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
PDF
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
ODP
Hadoop admin
PDF
Apache Avro and You
PPTX
Building data pipelines with kite
PPTX
Big Data vs Data Warehousing
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Simplified Data Management And Process Scheduling in Hadoop
Big Data and Hadoop - An Introduction
Amazon Elastic Computing 2
Hadoop Cluster Configuration and Data Loading - Module 2
Taller hadoop
Hadoop administration
Introduction to Hadoop and Hadoop component
Hadoop Trends
Apache Flume NG
Hadoop fault-tolerance
Introduction to Apache Hadoop Ecosystem
Hadoop, HDFS and MapReduce
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Hadoop admin
Apache Avro and You
Building data pipelines with kite
Big Data vs Data Warehousing
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Ad

Similar to Hadoop as data refinery (20)

PPTX
201305 hadoop jpl-v3
PPTX
Create a Smarter Data Lake with HP Haven and Apache Hadoop
KEY
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
KEY
Paris HUG - Agile Analytics Applications on Hadoop
KEY
Hortonworks: Agile Analytics Applications
PDF
Building a Modern Data Architecture with Enterprise Hadoop
PDF
Keynote from ApacheCon NA 2011
PDF
Apache hadoop bigdata-in-banking
PPTX
Yahoo! Hack Europe
PPTX
Introduction To Big Data & Hadoop
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PDF
Présentation on radoop
PPTX
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
PDF
Supporting Financial Services with a More Flexible Approach to Big Data
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
PPTX
Anexinet Big Data Solutions
PDF
Hadoop for shanghai dev meetup
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
PPTX
201305 hadoop jpl-v3
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Paris HUG - Agile Analytics Applications on Hadoop
Hortonworks: Agile Analytics Applications
Building a Modern Data Architecture with Enterprise Hadoop
Keynote from ApacheCon NA 2011
Apache hadoop bigdata-in-banking
Yahoo! Hack Europe
Introduction To Big Data & Hadoop
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Présentation on radoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Supporting Financial Services with a More Flexible Approach to Big Data
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Anexinet Big Data Solutions
Hadoop for shanghai dev meetup
Apache Hadoop and its role in Big Data architecture - Himanshu Bari

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
PPTX
The age of rename() is over
PPTX
What does Rename Do: (detailed version)
PPTX
Put is the new rename: San Jose Summit Edition
PPTX
@Dissidentbot: dissent will be automated!
PPTX
PUT is the new rename()
PPT
Extreme Programming Deployed
PPT
PPTX
I hate mocking
PPTX
What does rename() do?
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
PPTX
Apache Spark and Object Stores —for London Spark User Group
PPTX
Spark Summit East 2017: Apache spark and object stores
PPTX
Hadoop, Hive, Spark and Object Stores
PPTX
Apache Spark and Object Stores
PPTX
Household INFOSEC in a Post-Sony Era
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Slider: Applications on YARN
PPTX
YARN Services
Hadoop Vectored IO
The age of rename() is over
What does Rename Do: (detailed version)
Put is the new rename: San Jose Summit Edition
@Dissidentbot: dissent will be automated!
PUT is the new rename()
Extreme Programming Deployed
I hate mocking
What does rename() do?
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Apache Spark and Object Stores —for London Spark User Group
Spark Summit East 2017: Apache spark and object stores
Hadoop, Hive, Spark and Object Stores
Apache Spark and Object Stores
Household INFOSEC in a Post-Sony Era
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate
Slider: Applications on YARN
YARN Services

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Review of recent advances in non-invasive hemoglobin estimation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Hadoop as data refinery

  • 1. Hadoop as a Data Refinery Steve Loughran– Hortonworks @steveloughran London, October 2012 © Hortonworks Inc. 2012
  • 2. About me: • HP Labs: –Deployment, cloud infrastructure, Hadoop-in-Cloud • Apache – member and committer –Ant, Axis ; author: Ant in Action –Hadoop –Dynamic deployments –Diagnostics on failures –Cloud infrastructure integration • Joined Hortonworks in 2012 –UK based: R&D Page 2 © Hortonworks Inc. 2012
  • 3. What is Apache Hadoop? • Collection of Open Source Projects One of the best examples of – Apache Software Foundation (ASF) open source driving innovation – commercial and community development and creating a market • Foundation for Big Data Solutions – Stores petabytes of data reliably – Runs highly distributed computation – Commodity servers & storage – Powers data-driven business Page 3 © Hortonworks Inc. 2012
  • 4. Why Hadoop? Business Pressure 1 Opportunity to enable innovative new business models 2 Potential new insights that drive competitive advantage Technical Pressure 3 Data collected and stored continues to grow exponentially 4 Data is increasingly everywhere and in many formats 5 Traditional solutions not designed for new requirements Financial Pressure 6 Cost of data systems, as % of IT spend, continues to grow 7 Cost advantages of commodity hardware & open source Page 4 © Hortonworks Inc. 2012
  • 5. The data refinery in an enterprise Audio, Web, Mobile, CRM, Video, ERP, SCM, … Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data SQL NoSQL NewSQL Social, Refinery Graph, ETL Feeds EDW MPP NewSQL Sensors, Devices, RFID Business Pig Spatial, Intelligence GPS Apache Hadoop & Analytics Events, Other Dashboards, Reports, Visualization, … Page 5 © Hortonworks Inc. 2012
  • 6. Modernising Business Intelligence • Before: – Current records & short history – Analytics/BI systems keep conformed / cleaned / digested data – Unstructured data locked silos, archived offline Inflexible, new questions require system redesigns • Now – Keep raw data in Hadoop for a long time – Reprocess/enhance analytics/BI data on-demand – Can directly experiment on all raw data – New products / services can be added very quickly Storage and agility justifies new infrastructure Page 6 © Hortonworks Inc. 2012
  • 7. Refineries pull in raw data Internal: pipelines with Apache Flume – Web site logs – Real-world events: retail, financial, vehicle movements – New data sources you create The data you couldn't afford to keep External: pipelines and bulk deliveries – Correlating data: weather, market, competition – New sources -twitter feeds, infochimps, open government – Real-world events: retail, financial – Apache Sqoop To help understand your own data Page 8 © Hortonworks Inc. 2012
  • 8. Refineries refine raw data • Clean up raw data • Filter “cleaned” data • Forward data to different destinations: – Existing BI infrastructure – New “Agile Data” infrastructures • Offload work from the core Data Warehouse – ETL operations – Report and Chart Generation – Ad-hoc queries Needs: query, workflow and reporting tools Page 9 © Hortonworks Inc. 2012
  • 9. Refineries can store data • Retain historical transaction data, analyses • Store (cleaned, filtered, compressed) raw data • Provide the history for more advanced analysis in future applications and queries • Needs: storage, query tools – Storage: HDFS and HBase – Languages: Pig & Hive – Workflow for scheduled jobs: Oozie – Shared schema repository: HCatalog Hadoop makes storing bulk & historical data affordable Page 10 © Hortonworks Inc. 2012
  • 10. What if I didn't have a Data Warehouse? Page 12 © Hortonworks Inc. 2012
  • 11. Congratulations! 1. HBase: scale, Hadoop integration 2. mongoDB, CouchDB, Riak good for web UIs 3. Postgres, MySQL, … transactions Page 13 © Hortonworks Inc. 2012
  • 12. Agile Data Page 14 © Hortonworks Inc. 2012
  • 13. Agile Data • SQL Experts: Hive HQL queries • Ad-hoc queries: Pig • Statistics platform: R + Hadoop • Visualisation tools –including Excel • New web UI applications Because you don’t know all that you are looking for when you collect the data Page 15 © Hortonworks Inc. 2012
  • 15. Pig: an Agile Data language • Optimised for refining data • Dataflow-driven –much higher level than Java • Macros and User Defined Functions • ILLUSTRATE aids development • For ad-hoc and production use Page 17 © Hortonworks Inc. 2012
  • 16. Example: Packetpig snort_alerts = LOAD '$pcap' USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig'); countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority; countries = GROUP countries BY country; countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Page 18 © Hortonworks Inc. 2012
  • 17. web UI: d3.js Page 19 © Hortonworks Inc. 2012
  • 18. Analytics Apps: It takes a Team • Broad skill-set to make useful apps • Basically nobody has them all • Application development is inherently collaborative Page 20 © Hortonworks Inc. 2012
  • 19. Developers: learn statistics via Pig Data Scientists & Statisticians: learn Pig (and R) Russ Jurney @ HUG UK in November meetup.com/hadoop-users-group-uk/ Page 21 © Hortonworks Inc. 2012
  • 20. Challenge: Becoming a data-driven organisation Page 22 © Hortonworks Inc. 2012
  • 21. Challenges • Thinking of the right questions to ask • Conducting valid experiments: A/B testing, surveys with effective sampling, … – Not: "try a web new design for a week" – Not: "please do a site survey" pop-up dialog • Accepting negative results – "no design was better than the other" • Accepting results you don't agree with – “trials imply the proposed strategy won't work” Page 23 © Hortonworks Inc. 2012
  • 22. Example: Yahoo! • Online Application logic driven by big lookup tables • Lookup data computed periodically on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis… • Application development requires data science – Huge amounts of actually observed data key to modern apps – Hadoop used as the science platform Architecting © Hortonworks Inc. 2012 the Future of Big Data Page 24
  • 23. Yahoo! Homepage • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute CATEGORIZATION USER Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customised home pages with latest data (thousands / second) Copyright Yahoo 2011 25
  • 24. Conclusions Hadoop can live alongside existing BI systems –as a data refinery • Store, refine bulk & unstructured data • Archive data for long-term analysis • Support ad-hoc queries over bulk data • Become the data-science platform 26
  • 25. Thank You! Questions & Answers hortonworks.com/download Page 27 © Hortonworks Inc. 2012

Editor's Notes

  • #6: In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.
  • #8: At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.The three areas are:1.Business Transactions & Interactions2. Business Intelligence & Analytics3. Big Data RefineryThe graphic illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 ̊ view of customers, for example.By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.
  • #10: Real world data is 'dirty' -you need to clean it upExamples: merge multiple events into one of an extended periodSanity check events against your world view (how fast things move, how much things cost). There is much danger here.text cleanup, discard empty fieldsYou may still want to retain the original data to see what was filtered -at the very least log & sample the outliers
  • #11: This is taking a metaphor beyond the limits: all that comes next is photos of Grangemout or Milford Haven.Real world refineries have giant storage tanks to buffer differences between ingress and egress rates.Here we are proposing keeping data near the refinery
  • #12: RCFile (Record Columnar File)http://en.wikipedia.org/wiki/RCFileHCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
  • #20: This is an example that's gone up our web site recently, using Pig to analyse NetFlow packets and so look for origins over time. That's the kind of thing you can only do with large datasets. Using a language like Pig helps you look at the numbers and decide what the next questions to ask are.
  • #23: This is important. once you start becoming more aware of your customers, your potential customers, your internal state and the world outside -you have more information than ever before.Yet you still need to analyse it.
  • #24: Conducting valid experiments: A/B testing of two different options must be conducted truly at random, to avoid selection bias or influence by external factorsAccepting negative results: It's OK to have an outcome that says "neither option is any better or worse than the other"Accepting results you don't agree with: evidence your idea doesn't work. no 3, is hard -and why you need large, valid sample sets. Otherwise you could dismiss it as a bad experiment. Governments are classic examples of organisations that don't do this. Badger Culling and Drug Policies are key examples -policy is driven by the belief of constituencies (farmers, daily mail), rather than recognising the evidence and trying to explain to the constituencies that they are mistaken. This isn't a critique of the current administration -the previous one was also belief-driven rather than fact-driven.