Cloudera - Amr Awadallah - Hadoop World 2010

Business Analyst Tools for Hadoop
Amr Awadallah
CTO, Cloudera, Inc.
Hadoop World
October 12th, 2010
Copyright 2010 Couldera Inc. All Rights Reserved. 1

The Spectrum of Hadoop Users
Copyright 2010 Cloudera Inc. All rights reserved 2
Logs Files Web Data
Enterprise
Data
Warehouse
Web
Application
Enterprise
Reporting
BI, Analytics
Analysts Business Users
Customers
IDEs
Engineers
Relational
Databases
Low-Latency
Serving
Systems
Cloudera
Enterprise
Operators

Evolution of Hadoop Query/Programming Languages
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
3Copyright 2010 Couldera Inc. All Rights Reserved.

Hive vs Pig Example (count distinct values > 0)
• Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;

Hive Features
• A subset of SQL covering the most common statements
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• JDBC/ODBC support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive

The Hadoop Query Tool Ecosystem
Cloudera Enterprise
Cloudera’s Distribution for Hadoop
In Memory
PowerPivot
QlikTech
EdgeSpring
Tableau
ETL
Informatica
Pervasive
IBM DataStage
Microsoft SSIS
Talend
Kettle
Query Authoring
Karmasphere
Quest (Toad)
Spreadsheet
IBM BigSheets
Datameer
BI/OLAP
MicroStrategy
IBM Cognos
SAP BOBJ
Microsoft SSRS
Jaspersoft
Pentaho
Developer
Karmasphere
Eclipse
Cascading
Stats/Math
SAS
IBM SPSS
Matlab
R/RHIPE
Mahoot
Hama
Reporting
SAP Crystal
Actuate/BIRT
Hadoop is very flexible, use the right tool for the job at hand.

Toad for Cloud (for Query Authoring)
RDBMSHadoop
Learn more at: http://www.ToadForCloud.com

Karmasphere (for Developers and Analysts)

Tableau (for Advanced Visualization)

Datameer (for Analysts, Spreadsheet UI)

MicroStrategy (for interactive Dashboards)

Talend (for Extract-Tranform-Load, aka ETL)

General Advice for Choosing the Right Tool.
• First and foremost, what problem are you trying to solve? And
what is your skill set? Use the tool that gets you there fastest.
• What is the learning curve involved with this new tool?
• Does the tool interoperate with other systems?
• Is the tool leveraging the investment in Pig/Hive?
• Does the tool lock you in to a proprietary file format?
• Is the tool certified for Cloudera’s Distribution of Hadoop?

Hive Agile Data Types
• STRUCTS:
• SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
• SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
• SELECT mytable.mycolumn[5] FROM …
• JSON:
• SELECT get_json_object(mycolumn, objpath

Cloudera - Amr Awadallah - Hadoop World 2010

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Cloudera - Amr Awadallah - Hadoop World 2010 (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Cloudera - Amr Awadallah - Hadoop World 2010