Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud

Hadoop
Data Analytics in the Cloud

Mike Olson
Chief Executive Ofﬁcer

Friday, July 17, 2009

Hadoop History

▪ Doug Cutting worked on Nutch (web-scale crawler-based
search), 2002-2004
▪ Google published MapReduce paper in 2004
▪ Cutting adds DFS & MapReduce support to Nutch
▪ Joined by Mike Cafarella
▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
▪ Web-scale deployments in 2007, 2008 at Y!, Facebook, others
▪ Today: 22 committers to core project
▪ Related projects: HBase, Hive, Pig, Mahout, Hama and others


Why Hadoop?

▪ Large web properties invented MapReduce for large-scale,
reliable, inexpensive analytics
▪ Enterprises generally need these techniques
▪ Retail, ﬁnancial services, oil and gas, health care, green
technologies and more
▪ Hardware trends driving toward long-term retention of valuable
source data
▪ New analytical tools are required
▪ Hadoop complements current-generation data warehousing and
analytical products


Where Does Data Come From?
Many Sources Provide Deeper Insight



▪ Simulations and Scientiﬁc/Experimental Data
▪ genome sequencing, medical imaging, wireless sensors



▪ Existing Databases
▪ product catalogs, historical sales data, transaction histories



▪ User Data
▪ web logs, clicks on website, pictures, videos, bbs, etc



▪ User Data
▪ System Generated Data
▪ 1000’s of systems reporting status every second



▪ User Data
▪ System Generated Data
▪ 1000’s of systems reporting status every second
▪ Data Comes in All Shapes, Sizes, Schemas and Structures
▪ Hadoop combines many sources regardless of format and structure


Hadoop Technical Overview: HDFS
Storing Data: Distributed Over Many Machines

HDFS: Hadoop Distributed File System



Commodity Servers




Commodity Servers

Files are broken into blocks and distributed across all
servers. Replication protects data from hardware failure.



Hadoop Technical Overview: MapReduce
Processing Data: Leveraging Data Locality

MapReduce


Hadoop Technical Overview: MapReduce
Processing Data: Leveraging Data Locality

Data elements processed locally, in parallel
Reliable computation implicitly managed by Hadoop

MapReduce


Hadoop Technical Overview: Reliability
Fault Tolerance: Handled with Software

Software Fault Tolerance


Hadoop Technical Overview: Reliability
Fault Tolerance: Handled with Software

Data loss prevented through automatic replication and rebalancing
Computation is restarted automatically without user intervention

Software Fault Tolerance


Cloud Deployment Options for Hadoop
▪ In your data center
• Acquire, provision, administer servers
• Choose a virtualization infrastructure?
▪ On dedicated, hosted services
• Scale up or down by coordinating with your MSP
• On dynamic web services (AWS and others)
• Spin up, use, shut down a cluster

• Issues:

• Data persistence and location, organizational control


Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud

More Related Content

Similar to Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud