Hadoop as data refinery

Hadoop as a Data Refinery

Steve Loughran– Hortonworks
@steveloughran
London, October 2012

© Hortonworks Inc. 2012

About me:
• HP Labs:
–Deployment, cloud infrastructure, Hadoop-in-Cloud
• Apache – member and committer
–Ant, Axis ; author: Ant in Action
–Hadoop
–Dynamic deployments
–Diagnostics on failures
–Cloud infrastructure integration
• Joined Hortonworks in 2012
–UK based: R&D

Page 2

What is Apache Hadoop?

• Collection of Open Source Projects One of the best examples of
– Apache Software Foundation (ASF) open source driving innovation
– commercial and community development and creating a market

• Foundation for Big Data Solutions
– Stores petabytes of data reliably
– Runs highly distributed computation
– Commodity servers & storage
– Powers data-driven business

Page 3

Why Hadoop?
Business Pressure
1 Opportunity to enable innovative new business models

2 Potential new insights that drive competitive advantage

Technical Pressure
3 Data collected and stored continues to grow exponentially

4 Data is increasingly everywhere and in many formats

5 Traditional solutions not designed for new requirements

Financial Pressure
6 Cost of data systems, as % of IT spend, continues to grow

7 Cost advantages of commodity hardware & open source

Page 4

The data refinery in an enterprise
Audio, Web, Mobile, CRM,
Video, ERP, SCM, …
Images
New Data Business
Transactions
Docs, Sources
Text, & Interactions
XML

HDFS
Web
Logs,
Clicks
Big Data
SQL NoSQL NewSQL
Social, Refinery
Graph, ETL
Feeds

EDW MPP NewSQL
Sensors,
Devices,
RFID

Business
Pig
Spatial, Intelligence
GPS Apache Hadoop
& Analytics
Events,
Other Dashboards, Reports,
Visualization, …

Page 5

Modernising Business Intelligence
• Before:
– Current records & short history
– Analytics/BI systems keep conformed / cleaned / digested data
– Unstructured data locked silos, archived offline
Inflexible, new questions require system redesigns

• Now
– Keep raw data in Hadoop for a long time
– Reprocess/enhance analytics/BI data on-demand
– Can directly experiment on all raw data
– New products / services can be added very quickly
Storage and agility justifies new infrastructure

Page 6

Refineries pull in raw data
Internal: pipelines with Apache Flume
– Web site logs
– Real-world events: retail, financial, vehicle movements
– New data sources you create
The data you couldn't afford to keep

External: pipelines and bulk deliveries
– Correlating data: weather, market, competition
– New sources -twitter feeds, infochimps, open government
– Real-world events: retail, financial
– Apache Sqoop
To help understand your own data

Page 8

Refineries refine raw data
• Clean up raw data
• Filter “cleaned” data

• Forward data to different destinations:
– Existing BI infrastructure
– New “Agile Data” infrastructures

• Offload work from the core Data Warehouse
– ETL operations
– Report and Chart Generation
– Ad-hoc queries

Needs: query, workflow and reporting tools
Page 9

Refineries can store data
• Retain historical transaction data, analyses
• Store (cleaned, filtered, compressed) raw data
• Provide the history for more advanced analysis in
future applications and queries

• Needs: storage, query tools
– Storage: HDFS and HBase
– Languages: Pig & Hive
– Workflow for scheduled jobs: Oozie
– Shared schema repository: HCatalog

Hadoop makes storing bulk & historical data affordable
Page 10

What if I didn't have a Data
Warehouse?

Page 12

Congratulations!

1. HBase: scale, Hadoop integration

2. mongoDB, CouchDB, Riak
good for web UIs

3. Postgres, MySQL, …
transactions
Page 13

Agile Data

Page 14

Agile Data
• SQL Experts: Hive HQL queries
• Ad-hoc queries: Pig
• Statistics platform: R + Hadoop
• Visualisation tools –including Excel
• New web UI applications

Because you don’t know all that you are looking for
when you collect the data

Page 15

Pig: an Agile Data language
• Optimised for refining data
• Dataflow-driven –much higher level than Java
• Macros and User Defined Functions
• ILLUSTRATE aids development
• For ad-hoc and production use

Page 17

Example: Packetpig
snort_alerts = LOAD '$pcap'
USING
com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts
GENERATE
com.packetloop.packetpig.udf.geoip.Country(src) as country,
priority;

countries = GROUP countries BY country;

countries = FOREACH countries
GENERATE
group,
AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');

Page 18

web UI: d3.js

Page 19

Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative

Page 20

Developers: learn statistics via Pig

Data Scientists & Statisticians:
learn Pig (and R)

Russ Jurney @ HUG UK in November
meetup.com/hadoop-users-group-uk/
Page 21

Challenge:
Becoming a data-driven organisation

Page 22

Challenges
• Thinking of the right questions to ask

• Conducting valid experiments:
A/B testing, surveys with effective sampling, …
– Not: "try a web new design for a week"
– Not: "please do a site survey" pop-up dialog

• Accepting negative results
– "no design was better than the other"

• Accepting results you don't agree with
– “trials imply the proposed strategy won't work”

Page 23

Example: Yahoo!
• Online Application logic driven by big lookup tables

• Lookup data computed periodically on Hadoop
– Machine learning, other expensive computation offline
– Personalization, classification, fraud, value analysis…

• Application development requires data science
– Huge amounts of actually observed data key to modern apps
– Hadoop used as the science platform

Architecting
© Hortonworks Inc. 2012 the Future of Big Data
Page 24

Yahoo! Homepage

• Serving Maps SCIENCE » Machine learning to build ever
• Users - Interests HADOOP better categorization models
CLUSTER
• Five Minute CATEGORIZATION
USER
Production BEHAVIOR MODELS (weekly)

• Weekly PRODUCTION
Categorization HADOOP » Identify user interests using
CLUSTER
models SERVING Categorization models
MAPS
(every 5 minutes)
USER
BEHAVIOR

SERVING SYSTEMS ENGAGED USERS

Build customised home pages with latest data (thousands / second)
Copyright Yahoo 2011 25

Conclusions

Hadoop can live alongside existing BI
systems –as a data refinery

• Store, refine bulk & unstructured data
• Archive data for long-term analysis
• Support ad-hoc queries over bulk data
• Become the data-science platform

26

Thank You!
Questions & Answers

hortonworks.com/download

Page 27

Hadoop as data refinery

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Hadoop as data refinery (20)

More from Steve Loughran (20)

Recently uploaded (20)

Hadoop as data refinery

Editor's Notes