Steve Watt Presentation

Big Data
Steve Watt Emerging Technologies @ HP

1
– Someday Soon (Flickr)

Agenda

Hardware Software Data

• Big Data

• Situational
Applications

3

Situational Applications

4
– eaghra (Flickr)

Web 2.0 Era Topic Map
Produce Process
Inexpensiv
Data e Storage
Explosion
LAM
Social P
Platform Publishin
s g
Platforms

Situational
Applications
Web 2.0 Mashups

Enterpris SOA
e
5

Big Data

7
– blmiers2 (Flickr)

The data just keeps growing…

1024 GIGABYTE= 1 TERABYTE
1024 TERABYTES = 1 PETABYTE
1024 PETABYTES = 1 EXABYTE

1 PETABYTE 13.3 Years of HD Video

20 PETABYTES Amount of Data processed by Google daily

5 EXABYTES All words ever spoken by humanity

Mobile
App Economy for Devices Sensor Web
App for this App for that An instrumented and monitored world

Set Top Tablets, etc. Multiple Sensors in your pocket
Boxes
Real-time
Data

The Fractured Web
Opportunity
Facebook Twitter LinkedIn
Service Economy
Service for this Google NetFlix New York Times

Service for that eBay Pandora PayPal Web 2.0 Data Exhaust of
Historical and Real-time Data

Web 2.0 - Connecting People API Foundation
Web as a Platform
9 Web 1.0 - Connecting Machines Infrastructure

Data Deluge! But filter patterns can
help…
10
Kakadu (Flickr)

Filtering
Socially

Awesome
12

But filter patterns force you down a pre-processed
path
M.V. Jantzen (Flickr)

What if you could ask your own questions?

15
– wowwzers(Flickr)

And go from discovering Something about Everything…

– MrB-MMX (Flickr)

To discovering Everything about Something ?

17

How do we do this?
Lets examine a few techniques for
Gathering,
Storing,
Processing &

18
Delivering Data @ Scale

Gathering Data

Data Marketplaces

19

Gathering Data

Apache Nutch
(Web Crawler)

22

Storing, Reading and Processing - Apache Hadoop
 Cluster technology with a single master and scale out with multiple slaves
 It consists of two runtimes:
 The Hadoop Distributed File System (HDFS)
 Map/Reduce

 As data is copied onto the HDFS it ensures the data is blocked and replicated to other
machines to provide redundancy
 A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop
Master which in-turn distributes the job to each slave in the cluster.
 Jobs run on data that is on the local disks of the machine they are sent to ensuring data
locality
 Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-
execute a job on any node in the cluster.

Want to know more?
23
“Hadoop – The Definitive Guide (2nd Edition)”

Delivering Data @ Scale

• Structured Data
• Low Latency & Random Access
• Column Stores (Apache HBase or Apache Cassandra)
• faster seeks
• better compression
• simpler scale out
• De-normalized – Data is written as it is intended to be queried

Want to know more?
24
“HBase – The Definitive Guide” & “Cassandra High Performance

Storing, Processing & Delivering : Hadoop + NoSQL

Gather Read/Transfor Low-
m latency Application
Web Data
Nutch Query
Crawl
Serve
Copy

Apache
Hadoop
Log Files
Flume
Connector HDFS NoSQL
Repository
NoSQL
SQOOP Connector/A
Connector PI

Relational
Data
-Clean and Filter Data
(JDBC)
- Transform and Enrich Data
MySQL
- Often multiple Hadoop jobs
25

Some things to keep
in mind…

26
– Kanaka Menehune (Flickr)

Some things to keep in mind…

• Processing arbitrary types of data (unstructured, semi-
structured, structured) requires normalizing data with many different
kinds of readers
Hadoop is really great at this !
• However, readers won’t really help you process truly unstructured data
such as prose. For that you’re going to have to get handy with Natural
Language Processing. But this is really hard.
Consider using parsing services & APIs like Open Calais

Want to know more?
27
“Programming Pig” (O’REILLY)

Statistical real-time decision making

 Capture Historical information

 Use Machine Learning to build decision making models (such as
Classification, Clustering & Recommendation)

 Mesh real-time events (such as sensor data) against Models to make
automated decisions

Want to know more?
29
“Mahout in Action”

Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2

For example:

http://www.crunchbase.com/companies?c=a&q=private_held
http://www.crunchbase.com/companies?c=b&q=private_held
http://www.crunchbase.com/companies?c=c&q=private_held
http://www.crunchbase.com/companies?c=d&q=private_held
...

Crawl data is stored in sequence files in the segments dir on the HDFS
33

Making the data STRUCTURED

Retrieving HTML

Prelim Filtering on URL

Company POJO then /t Out

35

Aargh!

My viz tool
requires
zipcodes to plot
geospatially!

36

Apache Pig Script to Join on City to get Zip
Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);

CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year

int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');

Total Tech Investments By Year

Total Investments By Zip Code for all Sectors

$1.2 Billion in Boston

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.7 Billion in Austin

40

Total Investments By Zip Code for Consumer Web

$600 Million in Seattle
$1.2 Billion in Chicago

$1.7 Billion in San Francisco

41

Total Investments By Zip Code for BioTech

$1.3 Billion in Cambridge

$528 Million in Dallas

$1.1 Billion in San Diego

42

Questions?

Steve Watt swatt@hp.com

@wattsteve

stevewatt.blogspot.com

43

Steve Watt Presentation

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Steve Watt Presentation (20)

Recently uploaded (20)

Steve Watt Presentation

Editor's Notes