SlideShare a Scribd company logo
Big data
The technology landscape and its applications.




                                                 Natalino Busa - 12 Feb. 2013
Outline


          ● Big Data: Who are thou?
          ● Big Data: The technology landscape

          ● Hadoop: Overview
          ● Analytics & Machine Learning
          ● Opportunities




                                            Natalino Busa - 12 Feb. 2013
Hype cycle on new IT technologies

                                    Gartner 2012




                                    Natalino Busa - 12 Feb. 2013
What is big data?

        DATA (structured and un-structured, Logs, ETL, social)


            Velocity               Diversity                Volume




                        BIG DATA


           Hardware                Software                Services

      Infrastructure            Marketing (e.g. Unica)    RDBMS
      (Private) Cloud           Analytics (Tableau)       OLAP
      Networking                Modeling (SAS)            Messaging



                                                                      Natalino Busa - 12 Feb. 2013
Big Data Heat map




                    Natalino Busa - 12 Feb. 2013
How big is big?

SkyTree (tm) defines: Analytics Requirements Index (ARI)

                                 ARI = # Rows × # Columns
                                          Time (secs)


Where          # Rows =                   Number of records being analyzed

               # Columns =                Number of variables captured in each record

               Time (secs) =              The timeframe within which to complete the analysis




 Example: For each view (1000 views/sec) produce a personalized banner
 I need to analyze 100 variables on 1000 records (historic data) every 1 ms

 ARI = (1000*100)/0.001 = 100 M values/sec




                                                                                  Natalino Busa - 12 Feb. 2013
What data?

Big Data can imply:


           ●   Complex Data refactoring in Batch                  (lots of rows)
           ●   Real-Time Event Processing                         (high-speed responses)
           ●   Multidimensional analisys                          (lots of parameters)

           ●   ... or any of those three
                                           Response
                                           time




                                                  Pa
                                                    ram
                                                          ete              s
                                                             rs       titie
                                                                    En

                                                                               Natalino Busa - 12 Feb. 2013
More data

                                                                           customers +
                                                         customers +       products +
                                  customers +            products +        surveys +
                customers +       products +             surveys +         transactions +
customers       products          surveys                transactions      social messages


Database        Databases         Federated Data         Aggregated Data   Linked Data            Just Data


Structured                                                                                   Unstructured



   ●    in today's IT environments there is a gradual shift
        from structured data to unstructured data

 RDBMS are well suited to deal with structured data ->
   but: more and complex ETL, how to deal with new data (structures) ?

 Map-Reduce and noSQL systems are good with unstructured data ->
  but: how to we query and analyze this data?



                                                                                 Natalino Busa - 12 Feb. 2013
Big Data: how to deal with it



        ●   Big Data at rest     (storage, access)
        ●   Big Data in motion   (streaming, dataflows)


        ●   Big Data analytics   (OLAP, OTAP, BI)
        ●   Big Data modeling    (predictive, machine learning)




                                                          Natalino Busa - 12 Feb. 2013
Big Data at rest

Analytical RDBMSs                (EDW) Oracle, IBM, and various MPP's

Hadoop Distributed Systems       HDFS (distributed file system)
                                 Hbase (Big Table)




                  Batch      Real-time

                 Cassandra       HBase                            Analytics

      Logs                HDFS                 EDW                  EDW       EDW




  ●   Traditional EDW and Distributed             ●   These systems do not exclude each
      BigData / NoSQL solutions are                   others and can coexist to form a full
      complementary to each other.                    enterprise level solution.


                                                                               Natalino Busa - 12 Feb. 2013
Big Data at rest

No need to get everything out of the hadoop ecosystem:

NoSQL DBMSs:            Couchbase ( ++ reads, caching)
                         Cassandra ( ++ writes, OLAP)

... hybrid solutions are also possible:

HDFS + Cassandra : in-memory analytics + large DFS
HDFS + Solr/Lucene: fast text search on a distributed file system




                                                                    Natalino Busa - 12 Feb. 2013
Big Data in motion

Stream processing // Dataflow architectures

Used to support the automatic analysis of data-in-motion in real-time or near real-time.

- Identify meaningful patterns
- Trigger action to respond to them as quickly as possible.



                                                       - Storm (from twitter)
                                                         dataflow processing framework
                                                         ++ multi-language

                                                       - Akka (from typesafe)
                                                         dataflow actor framework
                                                         ++ speed


                                                       Both are:
                                                       Distributed, fault-tolerant, streaming



                                                                                   Natalino Busa - 12 Feb. 2013
Big Data Landscape

                                           Machine Learning on Big Data



                    Unstructured
                                    SAS, R over HDFS                Mahout


                           REST
                  Logs     flume                 Hbase                    Hive
Data Interfaces




                           scribe                                                      ●   Batch Analytics
                                    HDFS                                               ●   Visualization
                                                               MapR              BI
                                                                                       ●   Monitoring
                                                                                       ●   Marketing
                           sqoop              Cassandra                   Pig
                  EDW
                           hiho

                    Unstructured
                                     FS          OLAP            OTAP Impala
                                                                                  ●   Real-Time Analytics
                                                                                  ●   Streaming
                                              STORM

                                                                                 Natalino Busa - 12 Feb. 2013
Lambda Architecture




                                    Logic layer
                                                   Software as a Service
                                                   e.g realt-time predictor




from http://www.manning.com/marz/
                                                  Natalino Busa - 12 Feb. 2013
Why do machine learning on big data




    http://www.skytree.net/why-do-machine-learning-on-big-data/



                                                                  Natalino Busa - 12 Feb. 2013
Machine Learning: What?
          SIMILARITY SEARCH
          Similarity search provides a way to find the
          objects that are the most similar, in an overall
          sense, to the object(s) of interest.


                                         PREDICTIVE ANALYTICS
                                         Predictive analytics is the science of analyzing current and
                                         historical facts/data to make predictions about future events.



             CLUSTERING AND SEGMENTATION
             Cluster analysis and segmentation represents a purely data
             driven approach to grouping similar objects, behaviors, or
             whatever is represented by the data.


From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/                   Natalino Busa - 12 Feb. 2013
Word Counting on Map Reduce




                              Natalino Busa - 12 Feb. 2013
Machine learning on Map Reduce




     From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011




                                                                          Natalino Busa - 12 Feb. 2013
Machine learning on Map Reduce




From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011   Natalino Busa - 12 Feb. 2013
Machine Learning: Use Cases

 E-Commerce / E-Tailing
 ● Product Recommendation Engines
 ● Cross Channel Analytics
 ● Events/Activity Behavior Segmentation

 Product Marketing
 ● Campaign management and optimization
 ● Market and consumer segmentations
 ● Pricing Optimization

 Customer Marketing
 ● Customer Churn Management
 ● (Mobile) User Behavior Prediction
 ● Offer Personalization


                                           Natalino Busa - 12 Feb. 2013
Big Data: Opportunities

 Unstructured Data
 ● Clustering
 ● Distributed processing
 ● Distributed Storage

 Modeling & Analytics
 ● Distributed Machine Learning
 ● Fast Online Analytics Cubes

 Streaming and Real-Time processing
 ● Build RT profiles
 ● Decision trees and Predictions
 ● Offer Personalization



                                      Natalino Busa - 12 Feb. 2013
Thanks


         linkedin:
         www.linkedin.com/in/natalinobusa

         blog:
         www.natalinobusa.com

More Related Content

PPTX
Data science.chapter-1,2,3
PPTX
Big data Presentation
DOCX
BIG DATA-Seminar Report
PPTX
Cloud Computing & Big Data
PPTX
Best Practices of Data Modeling with InfoSphere Data Architect
PPTX
Cloud Computing and Data Centers
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Hadoop File system (HDFS)
Data science.chapter-1,2,3
Big data Presentation
BIG DATA-Seminar Report
Cloud Computing & Big Data
Best Practices of Data Modeling with InfoSphere Data Architect
Cloud Computing and Data Centers
Architect’s Open-Source Guide for a Data Mesh Architecture
Hadoop File system (HDFS)

What's hot (20)

PDF
Security in a Virtualised Environment
PPTX
Cloud computing security issues and challenges
PPTX
What is Virtualization and its types & Techniques.What is hypervisor and its ...
PDF
Cloud Migration
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Map Reduce data types and formats
PPTX
PPT
Virtualization.ppt
PDF
Gathering Business Requirements for Data Warehouses
PPTX
PDF
Introduction to Microsoft Fabric.pdf
PDF
Big data.
PDF
Cloud Computing Business Models
PDF
Power BI Governance and Development Best Practices - Presentation at #MSBIFI ...
PPTX
Building a modern data warehouse
PPTX
Data Streaming in Big Data Analysis
PPTX
Introduction to Big Data
PPTX
Eucalyptus, Nimbus & OpenNebula
PPTX
Data warehouse physical design
PPTX
Data storage security in cloud computing
Security in a Virtualised Environment
Cloud computing security issues and challenges
What is Virtualization and its types & Techniques.What is hypervisor and its ...
Cloud Migration
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Map Reduce data types and formats
Virtualization.ppt
Gathering Business Requirements for Data Warehouses
Introduction to Microsoft Fabric.pdf
Big data.
Cloud Computing Business Models
Power BI Governance and Development Best Practices - Presentation at #MSBIFI ...
Building a modern data warehouse
Data Streaming in Big Data Analysis
Introduction to Big Data
Eucalyptus, Nimbus & OpenNebula
Data warehouse physical design
Data storage security in cloud computing
Ad

Viewers also liked (8)

PDF
Big data landscape v 3.0 - Matt Turck (FirstMark)
PPTX
Big data landscape version 2.0
PPTX
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
PDF
Big data landscape map collection by aibdp
PPTX
A chart of the big data ecosystem
PDF
Big Data Landscape 2016
PPTX
Big Data, Big Deal? (A Big Data 101 presentation)
PPTX
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape version 2.0
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
Big data landscape map collection by aibdp
A chart of the big data ecosystem
Big Data Landscape 2016
Big Data, Big Deal? (A Big Data 101 presentation)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
Ad

Similar to Big data landscape (20)

PPTX
NoSQL for the SQL Server Pro
PDF
Big Data/Hadoop Infrastructure Considerations
PDF
Big Data and Implications on Platform Architecture
PDF
Architecting Virtualized Infrastructure for Big Data
PPTX
Big data ppt
PPTX
PPTX
Big Data & Hadoop Introduction
PDF
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
PDF
Big Data Tutorial - Marko Grobelnik - 25 May 2012
PDF
Hadoop - Now, Next and Beyond
PPTX
The elephantintheroom bigdataanalyticsinthecloud
PDF
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
PPT
Big Data = Big Decisions
PPTX
PPTX
Unlocking value in your (big) data
PPTX
Big Data_Architecture.pptx
PPTX
Big Data, Big Content, and Aligning Your Storage Strategy
PDF
JDD2014: Real Big Data - Scott MacGregor
KEY
Processing Big Data
PPTX
big data overview ppt
NoSQL for the SQL Server Pro
Big Data/Hadoop Infrastructure Considerations
Big Data and Implications on Platform Architecture
Architecting Virtualized Infrastructure for Big Data
Big data ppt
Big Data & Hadoop Introduction
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Hadoop - Now, Next and Beyond
The elephantintheroom bigdataanalyticsinthecloud
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Big Data = Big Decisions
Unlocking value in your (big) data
Big Data_Architecture.pptx
Big Data, Big Content, and Aligning Your Storage Strategy
JDD2014: Real Big Data - Scott MacGregor
Processing Big Data
big data overview ppt

More from Natalino Busa (19)

PDF
Data Production Pipelines: Legacy, practices, and innovation
PDF
Data science apps powered by Jupyter Notebooks
PDF
7 steps for highly effective deep neural networks
PDF
Data science apps: beyond notebooks
PDF
[Ai in finance] AI in regulatory compliance, risk management, and auditing
PDF
Strata London 16: sightseeing, venues, and friends
PDF
Data in Action
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
The evolution of data analytics
PDF
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
PDF
Streaming Api Design with Akka, Scala and Spray
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
PDF
Big data solutions for advanced marketing analytics
PDF
Awesome Banking API's
PDF
Yo. big data. understanding data science in the era of big data.
PDF
Big and fast a quest for relevant and real-time analytics
PDF
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
PDF
Strata 2014: Data science and big data trending topics
PDF
Streaming computing: architectures, and tchnologies
Data Production Pipelines: Legacy, practices, and innovation
Data science apps powered by Jupyter Notebooks
7 steps for highly effective deep neural networks
Data science apps: beyond notebooks
[Ai in finance] AI in regulatory compliance, risk management, and auditing
Strata London 16: sightseeing, venues, and friends
Data in Action
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
The evolution of data analytics
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Streaming Api Design with Akka, Scala and Spray
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Big data solutions for advanced marketing analytics
Awesome Banking API's
Yo. big data. understanding data science in the era of big data.
Big and fast a quest for relevant and real-time analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Strata 2014: Data science and big data trending topics
Streaming computing: architectures, and tchnologies

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
DOCX
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
A Presentation on Artificial Intelligence
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The AUB Centre for AI in Media Proposal.docx

Big data landscape

  • 1. Big data The technology landscape and its applications. Natalino Busa - 12 Feb. 2013
  • 2. Outline ● Big Data: Who are thou? ● Big Data: The technology landscape ● Hadoop: Overview ● Analytics & Machine Learning ● Opportunities Natalino Busa - 12 Feb. 2013
  • 3. Hype cycle on new IT technologies Gartner 2012 Natalino Busa - 12 Feb. 2013
  • 4. What is big data? DATA (structured and un-structured, Logs, ETL, social) Velocity Diversity Volume BIG DATA Hardware Software Services Infrastructure Marketing (e.g. Unica) RDBMS (Private) Cloud Analytics (Tableau) OLAP Networking Modeling (SAS) Messaging Natalino Busa - 12 Feb. 2013
  • 5. Big Data Heat map Natalino Busa - 12 Feb. 2013
  • 6. How big is big? SkyTree (tm) defines: Analytics Requirements Index (ARI) ARI = # Rows × # Columns Time (secs) Where # Rows = Number of records being analyzed # Columns = Number of variables captured in each record Time (secs) = The timeframe within which to complete the analysis Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms ARI = (1000*100)/0.001 = 100 M values/sec Natalino Busa - 12 Feb. 2013
  • 7. What data? Big Data can imply: ● Complex Data refactoring in Batch (lots of rows) ● Real-Time Event Processing (high-speed responses) ● Multidimensional analisys (lots of parameters) ● ... or any of those three Response time Pa ram ete s rs titie En Natalino Busa - 12 Feb. 2013
  • 8. More data customers + customers + products + customers + products + surveys + customers + products + surveys + transactions + customers products surveys transactions social messages Database Databases Federated Data Aggregated Data Linked Data Just Data Structured Unstructured ● in today's IT environments there is a gradual shift from structured data to unstructured data RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ? Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data? Natalino Busa - 12 Feb. 2013
  • 9. Big Data: how to deal with it ● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows) ● Big Data analytics (OLAP, OTAP, BI) ● Big Data modeling (predictive, machine learning) Natalino Busa - 12 Feb. 2013
  • 10. Big Data at rest Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's Hadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table) Batch Real-time Cassandra HBase Analytics Logs HDFS EDW EDW EDW ● Traditional EDW and Distributed ● These systems do not exclude each BigData / NoSQL solutions are others and can coexist to form a full complementary to each other. enterprise level solution. Natalino Busa - 12 Feb. 2013
  • 11. Big Data at rest No need to get everything out of the hadoop ecosystem: NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP) ... hybrid solutions are also possible: HDFS + Cassandra : in-memory analytics + large DFS HDFS + Solr/Lucene: fast text search on a distributed file system Natalino Busa - 12 Feb. 2013
  • 12. Big Data in motion Stream processing // Dataflow architectures Used to support the automatic analysis of data-in-motion in real-time or near real-time. - Identify meaningful patterns - Trigger action to respond to them as quickly as possible. - Storm (from twitter) dataflow processing framework ++ multi-language - Akka (from typesafe) dataflow actor framework ++ speed Both are: Distributed, fault-tolerant, streaming Natalino Busa - 12 Feb. 2013
  • 13. Big Data Landscape Machine Learning on Big Data Unstructured SAS, R over HDFS Mahout REST Logs flume Hbase Hive Data Interfaces scribe ● Batch Analytics HDFS ● Visualization MapR BI ● Monitoring ● Marketing sqoop Cassandra Pig EDW hiho Unstructured FS OLAP OTAP Impala ● Real-Time Analytics ● Streaming STORM Natalino Busa - 12 Feb. 2013
  • 14. Lambda Architecture Logic layer Software as a Service e.g realt-time predictor from http://www.manning.com/marz/ Natalino Busa - 12 Feb. 2013
  • 15. Why do machine learning on big data http://www.skytree.net/why-do-machine-learning-on-big-data/ Natalino Busa - 12 Feb. 2013
  • 16. Machine Learning: What? SIMILARITY SEARCH Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest. PREDICTIVE ANALYTICS Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events. CLUSTERING AND SEGMENTATION Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data. From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/ Natalino Busa - 12 Feb. 2013
  • 17. Word Counting on Map Reduce Natalino Busa - 12 Feb. 2013
  • 18. Machine learning on Map Reduce From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013
  • 19. Machine learning on Map Reduce From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013
  • 20. Machine Learning: Use Cases E-Commerce / E-Tailing ● Product Recommendation Engines ● Cross Channel Analytics ● Events/Activity Behavior Segmentation Product Marketing ● Campaign management and optimization ● Market and consumer segmentations ● Pricing Optimization Customer Marketing ● Customer Churn Management ● (Mobile) User Behavior Prediction ● Offer Personalization Natalino Busa - 12 Feb. 2013
  • 21. Big Data: Opportunities Unstructured Data ● Clustering ● Distributed processing ● Distributed Storage Modeling & Analytics ● Distributed Machine Learning ● Fast Online Analytics Cubes Streaming and Real-Time processing ● Build RT profiles ● Decision trees and Predictions ● Offer Personalization Natalino Busa - 12 Feb. 2013
  • 22. Thanks linkedin: www.linkedin.com/in/natalinobusa blog: www.natalinobusa.com