SlideShare a Scribd company logo
Business Analyst Tools for Hadoop
Amr Awadallah
CTO, Cloudera, Inc.
Hadoop World
October 12th, 2010
Copyright 2010 Couldera Inc. All Rights Reserved. 1
The Spectrum of Hadoop Users
Copyright 2010 Cloudera Inc. All rights reserved 2
Logs Files Web Data
Enterprise
Data
Warehouse
Web
Application
Enterprise
Reporting
BI, Analytics
Analysts Business Users
Customers
IDEs
Engineers
Relational
Databases
Low-Latency
Serving
Systems
Cloudera
Enterprise
Operators
Evolution of Hadoop Query/Programming Languages
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
3Copyright 2010 Couldera Inc. All Rights Reserved.
Hive vs Pig Example (count distinct values > 0)
• Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
4Copyright 2010 Couldera Inc. All Rights Reserved.
Hive Features
• A subset of SQL covering the most common statements
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• JDBC/ODBC support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
5Copyright 2010 Couldera Inc. All Rights Reserved.
The Hadoop Query Tool Ecosystem
6Copyright 2010 Couldera Inc. All Rights Reserved.
Cloudera Enterprise
Cloudera’s Distribution for Hadoop
In Memory
PowerPivot
QlikTech
EdgeSpring
Tableau
ETL
Informatica
Pervasive
IBM DataStage
Microsoft SSIS
Talend
Kettle
Query Authoring
Karmasphere
Quest (Toad)
Spreadsheet
IBM BigSheets
Datameer
BI/OLAP
MicroStrategy
IBM Cognos
SAP BOBJ
Microsoft SSRS
Jaspersoft
Pentaho
Developer
Karmasphere
Eclipse
Cascading
Stats/Math
SAS
IBM SPSS
Matlab
R/RHIPE
Mahoot
Hama
Reporting
SAP Crystal
Actuate/BIRT
Hadoop is very flexible, use the right tool for the job at hand.
Toad for Cloud (for Query Authoring)
7Copyright 2010 Couldera Inc. All Rights Reserved.
RDBMSHadoop
Learn more at: http://www.ToadForCloud.com
Karmasphere (for Developers and Analysts)
8Copyright 2010 Couldera Inc. All Rights Reserved.
Tableau (for Advanced Visualization)
9Copyright 2010 Couldera Inc. All Rights Reserved.
Datameer (for Analysts, Spreadsheet UI)
10Copyright 2010 Couldera Inc. All Rights Reserved.
MicroStrategy (for interactive Dashboards)
11Copyright 2010 Couldera Inc. All Rights Reserved.
Talend (for Extract-Tranform-Load, aka ETL)
12Copyright 2010 Couldera Inc. All Rights Reserved.
General Advice for Choosing the Right Tool.
• First and foremost, what problem are you trying to solve? And
what is your skill set? Use the tool that gets you there fastest.
• What is the learning curve involved with this new tool?
• Does the tool interoperate with other systems?
• Is the tool leveraging the investment in Pig/Hive?
• Does the tool lock you in to a proprietary file format?
• Is the tool certified for Cloudera’s Distribution of Hadoop?
13Copyright 2010 Couldera Inc. All Rights Reserved.
Appendix
Copyright 2010 Couldera Inc. All Rights Reserved. 14
Hive Agile Data Types
• STRUCTS:
• SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
• SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
• SELECT mytable.mycolumn[5] FROM …
• JSON:
• SELECT get_json_object(mycolumn, objpath
15Copyright 2010 Couldera Inc. All Rights Reserved.

More Related Content

PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PDF
Introduction To Hadoop Administration - SpringPeople
PPTX
Hadoop in the Cloud: Common Architectural Patterns
PPTX
Optimizing Big Data to run in the Public Cloud
PPTX
Get started with hadoop hive hive ql languages
PPTX
Atlanta MLConf
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Introduction To Hadoop Administration - SpringPeople
Hadoop in the Cloud: Common Architectural Patterns
Optimizing Big Data to run in the Public Cloud
Get started with hadoop hive hive ql languages
Atlanta MLConf
Data Science at Scale Using Apache Spark and Apache Hadoop

What's hot (20)

PDF
Hadoop on Cloud: Why and How?
PDF
Impala use case @ Zoosk
PPTX
Hadoop Ecosystem at a Glance
PPTX
Building Big data solutions in Azure
PPTX
Big data solutions in azure
PDF
JethroData technical white paper
PPT
Hadoop distributions - ecosystem
PPTX
Big data solutions in Azure
PPTX
Hadoop vs. RDBMS for Advanced Analytics
PPTX
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
PDF
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
PPTX
High concurrency,
Low latency analytics
using Spark/Kudu
PDF
Comparison among rdbms, hadoop and spark
PDF
A Closer Look at Apache Kudu
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
PPTX
Introduction to Hadoop
PPTX
Enabling the Active Data Warehouse with Apache Kudu
ODP
What is Apache spark
PDF
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
PPTX
Sparkflows.io
Hadoop on Cloud: Why and How?
Impala use case @ Zoosk
Hadoop Ecosystem at a Glance
Building Big data solutions in Azure
Big data solutions in azure
JethroData technical white paper
Hadoop distributions - ecosystem
Big data solutions in Azure
Hadoop vs. RDBMS for Advanced Analytics
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
High concurrency,
Low latency analytics
using Spark/Kudu
Comparison among rdbms, hadoop and spark
A Closer Look at Apache Kudu
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Introduction to Hadoop
Enabling the Active Data Warehouse with Apache Kudu
What is Apache spark
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
Sparkflows.io
Ad

Viewers also liked (9)

PPSX
PDF
Photo Sharing Services Smart Card 060513
DOCX
Introduction
PDF
Manifeste des tiers lieux
PPTX
Final Presentation-ARC
PPTX
Scope of cost accounting
PPTX
HBaseCon 2013: ETL for Apache HBase
PPTX
Acc0101. Meaning and Scope of Accounting
PDF
Vahva henkilöbrändi DIKO
Photo Sharing Services Smart Card 060513
Introduction
Manifeste des tiers lieux
Final Presentation-ARC
Scope of cost accounting
HBaseCon 2013: ETL for Apache HBase
Acc0101. Meaning and Scope of Accounting
Vahva henkilöbrändi DIKO
Ad

Similar to Cloudera - Amr Awadallah - Hadoop World 2010 (20)

PPTX
Overview of big data & hadoop v1
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
PPTX
Big data or big deal
PDF
spark_v1_2
PPTX
Hadoop and Big Data: Revealed
PPT
Architecting the Future of Big Data and Search
PDF
Big SQL Competitive Summary - Vendor Landscape
PDF
Agile data lake? An oxymoron?
ODP
The other Apache Technologies your Big Data solution needs
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
PDF
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
PDF
Oracle Unified Information Architeture + Analytics by Example
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
PPTX
Hadoop in a Nutshell
PDF
10 big data analytics tools to watch out for in 2019
PDF
Data Science Languages and Industry Analytics
PDF
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Overview of big data & hadoop v1
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Eric Baldeschwieler Keynote from Storage Developers Conference
Big data or big deal
spark_v1_2
Hadoop and Big Data: Revealed
Architecting the Future of Big Data and Search
Big SQL Competitive Summary - Vendor Landscape
Agile data lake? An oxymoron?
The other Apache Technologies your Big Data solution needs
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Oracle Unified Information Architeture + Analytics by Example
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop in a Nutshell
10 big data analytics tools to watch out for in 2019
Data Science Languages and Industry Analytics
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Hybrid model detection and classification of lung cancer
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25-Week II
Accuracy of neural networks in brain wave diagnosis of schizophrenia
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Chapter 5: Probability Theory and Statistics
Hybrid model detection and classification of lung cancer
Zenith AI: Advanced Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Assigned Numbers - 2025 - Bluetooth® Document
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Web App vs Mobile App What Should You Build First.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
1 - Historical Antecedents, Social Consideration.pdf
OMC Textile Division Presentation 2021.pptx
A novel scalable deep ensemble learning framework for big data classification...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative study of natural language inference in Swahili using monolingua...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25-Week II

Cloudera - Amr Awadallah - Hadoop World 2010

  • 1. Business Analyst Tools for Hadoop Amr Awadallah CTO, Cloudera, Inc. Hadoop World October 12th, 2010 Copyright 2010 Couldera Inc. All Rights Reserved. 1
  • 2. The Spectrum of Hadoop Users Copyright 2010 Cloudera Inc. All rights reserved 2 Logs Files Web Data Enterprise Data Warehouse Web Application Enterprise Reporting BI, Analytics Analysts Business Users Customers IDEs Engineers Relational Databases Low-Latency Serving Systems Cloudera Enterprise Operators
  • 3. Evolution of Hadoop Query/Programming Languages 1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility. 3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes. 4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe. 6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. 3Copyright 2010 Couldera Inc. All Rights Reserved.
  • 4. Hive vs Pig Example (count distinct values > 0) • Hive syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 4Copyright 2010 Couldera Inc. All Rights Reserved.
  • 5. Hive Features • A subset of SQL covering the most common statements • Agile data types: Array, Map, Struct, and JSON objects • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • JDBC/ODBC support • Partitions and Buckets (for performance optimization) • In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive 5Copyright 2010 Couldera Inc. All Rights Reserved.
  • 6. The Hadoop Query Tool Ecosystem 6Copyright 2010 Couldera Inc. All Rights Reserved. Cloudera Enterprise Cloudera’s Distribution for Hadoop In Memory PowerPivot QlikTech EdgeSpring Tableau ETL Informatica Pervasive IBM DataStage Microsoft SSIS Talend Kettle Query Authoring Karmasphere Quest (Toad) Spreadsheet IBM BigSheets Datameer BI/OLAP MicroStrategy IBM Cognos SAP BOBJ Microsoft SSRS Jaspersoft Pentaho Developer Karmasphere Eclipse Cascading Stats/Math SAS IBM SPSS Matlab R/RHIPE Mahoot Hama Reporting SAP Crystal Actuate/BIRT Hadoop is very flexible, use the right tool for the job at hand.
  • 7. Toad for Cloud (for Query Authoring) 7Copyright 2010 Couldera Inc. All Rights Reserved. RDBMSHadoop Learn more at: http://www.ToadForCloud.com
  • 8. Karmasphere (for Developers and Analysts) 8Copyright 2010 Couldera Inc. All Rights Reserved.
  • 9. Tableau (for Advanced Visualization) 9Copyright 2010 Couldera Inc. All Rights Reserved.
  • 10. Datameer (for Analysts, Spreadsheet UI) 10Copyright 2010 Couldera Inc. All Rights Reserved.
  • 11. MicroStrategy (for interactive Dashboards) 11Copyright 2010 Couldera Inc. All Rights Reserved.
  • 12. Talend (for Extract-Tranform-Load, aka ETL) 12Copyright 2010 Couldera Inc. All Rights Reserved.
  • 13. General Advice for Choosing the Right Tool. • First and foremost, what problem are you trying to solve? And what is your skill set? Use the tool that gets you there fastest. • What is the learning curve involved with this new tool? • Does the tool interoperate with other systems? • Is the tool leveraging the investment in Pig/Hive? • Does the tool lock you in to a proprietary file format? • Is the tool certified for Cloudera’s Distribution of Hadoop? 13Copyright 2010 Couldera Inc. All Rights Reserved.
  • 14. Appendix Copyright 2010 Couldera Inc. All Rights Reserved. 14
  • 15. Hive Agile Data Types • STRUCTS: • SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): • SELECT mytable.mycolumn[mykey] FROM … • ARRAYS: • SELECT mytable.mycolumn[5] FROM … • JSON: • SELECT get_json_object(mycolumn, objpath 15Copyright 2010 Couldera Inc. All Rights Reserved.