SlideShare a Scribd company logo
ANALYTICS ON
                                            HADOOP


                                               Donald Miner
                                               Solutions Architect
                                               Advanced Technologies Group




© Copyright 2012 EMC Corporation. All rights reserved.                       1
Large Retailer and Pregnancy

                                                         “   As Pole’s computers crawled
                                                             through the data, he was able
                                                             to identify about 25 products
                                                             that, when analyzed together,
                                                             allowed him to assign each
                                                             shopper a “pregnancy
                                                             prediction” score. More
                                                             important, he could also
                                         ?                   estimate her due date to
                                                             within a small window, so they
                                                             could send coupons timed to
                                                             very specific stages of her

                                                                                         ”
                                                             pregnancy.



© Copyright 2012 EMC Corporation. All rights reserved.                                        2
Hadoop Origins
 Open source system based off of papers
  written by Google
 MapReduce used by Google to parse and
  index web pages and calculate “page rank”
 Came from the need of a system that is:
        –    Linearly and horizontally scalable
        –    Able to store massive amounts of data
        –    Fault tolerant
        –    Ready to analyze HTML files
        –    Cheap to build and maintain


© Copyright 2012 EMC Corporation. All rights reserved.   3
What is Hadoop?
                                              Two Core Components

                               HDFS                            MapReduce

                  Scalable storage in                        Compute via the
                  Hadoop Distribued                        MapReduce distributed
                     File System                            Processing platform

 Open source system developed by the Apache
  Foundation
 Storage and compute in one framework
 Massively scalable



© Copyright 2012 EMC Corporation. All rights reserved.                             4
Why is Hadoop Important?
 Business analytics require new approaches
        – Data size
        – Data growth
 The new nature of data
        – Unstructured
        – Numerous sources
 Hadoop makes analytics on large data sets
  more cost effective




© Copyright 2012 EMC Corporation. All rights reserved.   5
Structured and Unstructured Data
 Greenplum DB
              Partitioning
  SQL
               Indexing
       RDBMS                            BI Tools
                                                          GP MapReduce
Tables and Schemas

 STRUCTURED                                              UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                   6
Structured and Unstructured Data
                                                                                          Hadoop
                                                                                    Schema on load
                                                         SequenceFile
                                                                                         MapReduce
                            Hive                                             Directories  Java
                                                                 XML, JSON, …             Flat files
                                           Pig                                  No ETL

 STRUCTURED                                                                        UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                                 7
Leverage Both in a Unified Platform
 Greenplum DB                                                                             Hadoop
              Partitioning
  SQL                                                                               Schema on load
         Indexing                                        SequenceFile
                                                                                         MapReduce
       RDBMS Hive                       BI Tools                             Directories  Java
                                                                 XML, JSON, …      GP MapReduce
Tables and Schemas Pig                                                          No ETL    Flat files
 STRUCTURED                                                                        UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                                 8
Hadoop Use Case
                                Launching our new product:
                                  The Marshmallow House




© Copyright 2012 EMC Corporation. All rights reserved.       9
Marshmallow House Release Analysis
            Greenplum Party




© Copyright 2012 EMC Corporation. All rights reserved.   10
Website Logs
 15 web servers, 5 application servers
 Problem: cross-correlation
 Problem: 500TB of data with 1TB/day
 Problem: extracting insights from text




© Copyright 2012 EMC Corporation. All rights reserved.   11
Current System
 SQL database
        – ETL process to collect and parse logs
        – Analyze transactions on the website
        – Can’t work with the text comfortably
 Perl scripts parsing the logs
        – Doesn’t scale
        – Hard to correlate across systems
        – Hard to deploy




© Copyright 2012 EMC Corporation. All rights reserved.   12
Augmenting Capabilities with Hadoop
 Hadoop helps us extract value in more ways
 Particular analytics we have in mind:
        –    Interest in product by location
        –    Sessionizing our disparate data
        –    Building behavior models of our customers
        –    Analyzing customers’ sentiment of our products
 Why? Target Marshmallow House purchasers




© Copyright 2012 EMC Corporation. All rights reserved.        13
Geographical Distribution
 Problem: We don’t know what the amount of
  interest is, by location
 Value: This will allow us to justify and scope
  additional marketing efforts
 Why Hadoop: Search through text, parsing
  log, custom data structures




© Copyright 2012 EMC Corporation. All rights reserved.   14
Geographical Distribution
 Solution: Find IP addresses interested in our
  product, then count them over their locations
 MapReduce job:
        – map: extract ip addresses from all data, enrich with
          ipgeo information
        – reduce: group by geographical location, count the
          number of records
        – output: location, count
 Result: Lots of interest in Virginia




© Copyright 2012 EMC Corporation. All rights reserved.           15
Sample MapReduce Java Code
 A MapReduce job consists of a Mapper,
  Reducer, and a Driver
 The Mapper parses, filters, transforms,
  enriches, and extracts
 The Reducer aggregates, counts, and outputs
 The Driver sets up and submits the job for
  execution




© Copyright 2012 EMC Corporation. All rights reserved.   16
Mapper Code




© Copyright 2012 EMC Corporation. All rights reserved.   17
Reducer Code




© Copyright 2012 EMC Corporation. All rights reserved.   18
Driver Code




© Copyright 2012 EMC Corporation. All rights reserved.   19
Sessionizing
 Problem: Data is scattered
 Value: Analyze a user’s experience at a session-
  level, which shows a bigger picture
 Why Hadoop: Hadoop can deal with heterogeneous
  and hierarchical data well




© Copyright 2012 EMC Corporation. All rights reserved.   20
Sessionizing
 Solution: Load the data sets and group by IP and
  temporal locality, then output as a hierarchical data
  structure
 MapReduce job:
        – map: extract IP and date/time, keep the record
        – reduce: group by IP, then group into sessions; format
          into JSON documents and output
 Result: 1 million sessions a day




© Copyright 2012 EMC Corporation. All rights reserved.            21
Unstructured and Semi-Structured Data
 Unnatural to store in an RDBMS
 Unstructured: text, documents, media,
  raw sensor data
 Semi-structured: mixed structured/unstructured;
  hierarchical
 Hadoop’s ability to leverage Java to gives flexibility
 “Schema on load”
 Data stored as “rich documents”




© Copyright 2012 EMC Corporation. All rights reserved.     22
Behavioral Model
 Problem: We don’t understand how our visitors
  behave stereotypically
 Value: Optimize our interface for usability;
  understand our customers
 Why Hadoop: Advanced analytics and machine
  learning is possible because of the flexibility of the
  framework




© Copyright 2012 EMC Corporation. All rights reserved.     23
Behavioral Model
 Solution: Run over the sessions and build a generic
  model from those
 MapReduce job: Use clustering to bring users into
  stereotypes, then use frequent item set analysis to build
  correlations between our users’ actions
 Results: We have three major types of buyers; casual
  buyers usually visit the marshmallow house from the
  main page




© Copyright 2012 EMC Corporation. All rights reserved.        24
Apache Mahout
 Machine learning library built on Hadoop
 Scalable machine learning
 Open source project
 Data mining, advanced analytics, predictive
  modeling
 Main use cases: recommendation engines,
  clustering, classification, frequent itemset
  mining



© Copyright 2012 EMC Corporation. All rights reserved.   25
Hadoop Makes These Possible
 Unstructured analysis is possible in Java and
  Hadoop
 Advanced data mining and machine learning
  techniques are natural
 Data analysis can be done on the data in its
  original form
 Analyze large amounts of heterogeneous
  data



© Copyright 2012 EMC Corporation. All rights reserved.   26
Provide Feedback & Win!


                                                          125 attendees will receive
                                                           $100 iTunes gift cards. To
                                                           enter the raffle, simply
                                                           complete:
                                                            – 5 sessions surveys
                                                            – The conference survey
                                                          Download the EMC World
                                                           Conference App to learn
                                                           more: emcworld.com/app



© Copyright 2012 EMC Corporation. All rights reserved.                                  27
© Copyright 2012 EMC Corporation. All rights reserved.   28
Thank You




© Copyright 2012 EMC Corporation. All rights reserved.        29
Analytics on Hadoop

More Related Content

PPTX
A unified data modeler in the world of big data
PPTX
Demonstrating the Future of Data Science
PDF
Greenplum hadoop
PDF
Hadoop - Now, Next and Beyond
PDF
Couchbase Server and IBM BigInsights: One + One = Three
PPTX
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
PPTX
Hadoop & Greenplum: Why Do Such a Thing?
PPTX
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
A unified data modeler in the world of big data
Demonstrating the Future of Data Science
Greenplum hadoop
Hadoop - Now, Next and Beyond
Couchbase Server and IBM BigInsights: One + One = Three
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...

What's hot (18)

PDF
Oracle Optimized Datacenter - Storage
PDF
Big Data launch Singapore Patrick Buddenbaum
PDF
sigmod08
PDF
The CIOs Guide to NoSQL 2012
PDF
Introduction to Hadoop
PDF
Build a Big Data solution using DB2 for z/OS
PPT
Cómo construimos Oracle TimesTen
PDF
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
 
PPTX
Jaspersoft Dashboards Webinar Feb 2013
PDF
Greenplum Database Overview
 
PDF
Realtime hadoopsigmod2011
PDF
Sql no sql
PDF
The 25 Most Promising Open Source Projects
PDF
How Apollo Group Evaluted MongoDB
PDF
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
PDF
Liquidity Risk Management powered by SAP HANA
PPTX
From the Big Data keynote at InCSIghts 2012
PDF
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Oracle Optimized Datacenter - Storage
Big Data launch Singapore Patrick Buddenbaum
sigmod08
The CIOs Guide to NoSQL 2012
Introduction to Hadoop
Build a Big Data solution using DB2 for z/OS
Cómo construimos Oracle TimesTen
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
 
Jaspersoft Dashboards Webinar Feb 2013
Greenplum Database Overview
 
Realtime hadoopsigmod2011
Sql no sql
The 25 Most Promising Open Source Projects
How Apollo Group Evaluted MongoDB
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
Liquidity Risk Management powered by SAP HANA
From the Big Data keynote at InCSIghts 2012
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Ad

Similar to Analytics on Hadoop (20)

PDF
Hadoop Overview
 
PDF
Hadoop for shanghai dev meetup
PDF
Hadoop 101
 
PPT
Data Science Day New York: The Platform for Big Data
PDF
Cloud computing era
PDF
Hadoop Business Cases
PDF
Improving MySQL performance with Hadoop
PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
KEY
Agile analytics applications on hadoop
KEY
Hortonworks: Agile Analytics Applications
PDF
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
PDF
Hadoop programming
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
PDF
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
PPTX
Hadoop training in bangalore
PDF
Hw09 Data Processing In The Enterprise
PDF
2.1-HADOOP.pdf
PPTX
Hadoop tutorial for Freshers,
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Hadoop Overview
 
Hadoop for shanghai dev meetup
Hadoop 101
 
Data Science Day New York: The Platform for Big Data
Cloud computing era
Hadoop Business Cases
Improving MySQL performance with Hadoop
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Agile analytics applications on hadoop
Hortonworks: Agile Analytics Applications
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
Hadoop programming
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop training in bangalore
Hw09 Data Processing In The Enterprise
2.1-HADOOP.pdf
Hadoop tutorial for Freshers,
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Ad

More from EMC (20)

PPTX
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
PDF
Cloud Foundry Summit Berlin Keynote
 
PPTX
EMC GLOBAL DATA PROTECTION INDEX
 
PDF
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
PDF
Citrix ready-webinar-xtremio
 
PDF
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
PPTX
EMC with Mirantis Openstack
 
PPTX
Modern infrastructure for business data lake
 
PDF
Force Cyber Criminals to Shop Elsewhere
 
PDF
Pivotal : Moments in Container History
 
PDF
Data Lake Protection - A Technical Review
 
PDF
Mobile E-commerce: Friend or Foe
 
PDF
Virtualization Myths Infographic
 
PDF
Intelligence-Driven GRC for Security
 
PDF
The Trust Paradox: Access Management and Trust in an Insecure Age
 
PDF
EMC Technology Day - SRM University 2015
 
PDF
EMC Academic Summit 2015
 
PDF
Data Science and Big Data Analytics Book from EMC Education Services
 
PDF
Using EMC Symmetrix Storage in VMware vSphere Environments
 
PDF
Using EMC VNX storage with VMware vSphereTechBook
 
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
Unlocking AI with Model Context Protocol (MCP)
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”

Analytics on Hadoop

  • 1. ANALYTICS ON HADOOP Donald Miner Solutions Architect Advanced Technologies Group © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Large Retailer and Pregnancy “ As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also ? estimate her due date to within a small window, so they could send coupons timed to very specific stages of her ” pregnancy. © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. Hadoop Origins  Open source system based off of papers written by Google  MapReduce used by Google to parse and index web pages and calculate “page rank”  Came from the need of a system that is: – Linearly and horizontally scalable – Able to store massive amounts of data – Fault tolerant – Ready to analyze HTML files – Cheap to build and maintain © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. What is Hadoop? Two Core Components HDFS MapReduce Scalable storage in Compute via the Hadoop Distribued MapReduce distributed File System Processing platform  Open source system developed by the Apache Foundation  Storage and compute in one framework  Massively scalable © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Why is Hadoop Important?  Business analytics require new approaches – Data size – Data growth  The new nature of data – Unstructured – Numerous sources  Hadoop makes analytics on large data sets more cost effective © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Structured and Unstructured Data Greenplum DB Partitioning SQL Indexing RDBMS BI Tools GP MapReduce Tables and Schemas STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Structured and Unstructured Data Hadoop Schema on load SequenceFile MapReduce Hive Directories Java XML, JSON, … Flat files Pig No ETL STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Leverage Both in a Unified Platform Greenplum DB Hadoop Partitioning SQL Schema on load Indexing SequenceFile MapReduce RDBMS Hive BI Tools Directories Java XML, JSON, … GP MapReduce Tables and Schemas Pig No ETL Flat files STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Hadoop Use Case Launching our new product: The Marshmallow House © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Marshmallow House Release Analysis Greenplum Party © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Website Logs  15 web servers, 5 application servers  Problem: cross-correlation  Problem: 500TB of data with 1TB/day  Problem: extracting insights from text © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Current System  SQL database – ETL process to collect and parse logs – Analyze transactions on the website – Can’t work with the text comfortably  Perl scripts parsing the logs – Doesn’t scale – Hard to correlate across systems – Hard to deploy © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Augmenting Capabilities with Hadoop  Hadoop helps us extract value in more ways  Particular analytics we have in mind: – Interest in product by location – Sessionizing our disparate data – Building behavior models of our customers – Analyzing customers’ sentiment of our products  Why? Target Marshmallow House purchasers © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. Geographical Distribution  Problem: We don’t know what the amount of interest is, by location  Value: This will allow us to justify and scope additional marketing efforts  Why Hadoop: Search through text, parsing log, custom data structures © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Geographical Distribution  Solution: Find IP addresses interested in our product, then count them over their locations  MapReduce job: – map: extract ip addresses from all data, enrich with ipgeo information – reduce: group by geographical location, count the number of records – output: location, count  Result: Lots of interest in Virginia © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Sample MapReduce Java Code  A MapReduce job consists of a Mapper, Reducer, and a Driver  The Mapper parses, filters, transforms, enriches, and extracts  The Reducer aggregates, counts, and outputs  The Driver sets up and submits the job for execution © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Mapper Code © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Reducer Code © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Driver Code © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Sessionizing  Problem: Data is scattered  Value: Analyze a user’s experience at a session- level, which shows a bigger picture  Why Hadoop: Hadoop can deal with heterogeneous and hierarchical data well © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Sessionizing  Solution: Load the data sets and group by IP and temporal locality, then output as a hierarchical data structure  MapReduce job: – map: extract IP and date/time, keep the record – reduce: group by IP, then group into sessions; format into JSON documents and output  Result: 1 million sessions a day © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Unstructured and Semi-Structured Data  Unnatural to store in an RDBMS  Unstructured: text, documents, media, raw sensor data  Semi-structured: mixed structured/unstructured; hierarchical  Hadoop’s ability to leverage Java to gives flexibility  “Schema on load”  Data stored as “rich documents” © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Behavioral Model  Problem: We don’t understand how our visitors behave stereotypically  Value: Optimize our interface for usability; understand our customers  Why Hadoop: Advanced analytics and machine learning is possible because of the flexibility of the framework © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. Behavioral Model  Solution: Run over the sessions and build a generic model from those  MapReduce job: Use clustering to bring users into stereotypes, then use frequent item set analysis to build correlations between our users’ actions  Results: We have three major types of buyers; casual buyers usually visit the marshmallow house from the main page © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. Apache Mahout  Machine learning library built on Hadoop  Scalable machine learning  Open source project  Data mining, advanced analytics, predictive modeling  Main use cases: recommendation engines, clustering, classification, frequent itemset mining © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. Hadoop Makes These Possible  Unstructured analysis is possible in Java and Hadoop  Advanced data mining and machine learning techniques are natural  Data analysis can be done on the data in its original form  Analyze large amounts of heterogeneous data © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Provide Feedback & Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. © Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. Thank You © Copyright 2012 EMC Corporation. All rights reserved. 29