SlideShare a Scribd company logo
By Thanuja Seneviratne
 What is Big Data
› Top Down Approach to the Topic of Big Data
› Data Science and Data Scientists
› Days of Data Past – Part I
› Days of Data Past – Part II
› Days of Data Past – Part III
› Three V’s
› “Data Lake” Architecture
 Who can use Big Data
› Individual Experience vs Collective Experience
› Business Cases
› Use Cases
 How to Use Big Data
› Coming of Hadoop
› Evolution of Hadoop
› “Other than” HDFS
 Data Science and Data Scientists
› Science of mining, extracting, analyzing,
modeling, visualizing large data sets from
multiple sources
› “Data analyst, data artist”
› Knowledge of math, statistics, predictive
modeling, pattern recognition and learning, data
visualization, data warehousing, etc
› From C.F. Jeff Wu to William S. Cleveland to
“Data Science Journal” and beyond
 Days of Data Past - Part 1
› Relational databases and their impact
› Write-first schema
› ACID compliant
› Row-store technology
› Relationally structured data for smaller data sets
› Relatively cheaper products
 SQL Server, Oracle, etc
› Highly available skill-set
› SQL languages
 Data manipulation– Insert, Select, Update and Delete
 Data definition – Create, Alter, Truncate and Drop
› Influenced LINQ (in .NET) and JPQL (in Java) etc in application
programming
› Enterprise ready
 Days of Data Past - Part II
› Enterprise Data Warehouses (EDW) and their impact
› Massively Parallel Processing (MPP) appliances –not all EDW’s are
packaged as MPP appliances
› Column-store technology, faster and easier for BI – not all EDW’s use
column-store
› Dimensionally structured data for large data sets
› Enterprise storage not commodity storage
› Expensive premium products
 TeraData, Vertica, SQL Server PDW, etc
 Some major companies offers commodity hardware for low price
customer
 Some major companies offers services in addition to products
› Demanding skill-set
› Enterprise ready
 Days of Data Past - Part III
› NoSql data stores and their impact
› Not relational and Not ACID compliant
› Four types
 Key-value stores (KV)
 Document stores
 Graph database stores
 Wide column stores
› Relatively cheaper products
› Commodity storage not enterprise storages
› Demanding and scarce skill-set
› Not Enterprise ready
NewSql data stores as an alternative to NoSql
 Relational and ACID compliant
 SQL driven so that existing SQL investments are intact
 Three V’s
› Volume
 Large volumes 100 TB or more currently
 Expecting above benchmark in future
› Velocity
 How quickly data accumulates
 How quickly your data makes sense
 Batch, near-time, real-time
 Batch vs Interactive
› Variety
 Various data sources
 Structured data – relational, ERP, CRM
 Semi-structured data – click streams, weblogs, geographical,
social
 Unstructured data – sensor, textual, machine generated
 “Data Lake” Architecture
› Modern Data Architecture
 Provides a shared service for broad insight across a
large, diverse data set at efficient scale according to
HortonWorks
 A unified data architecture which integrated to
enterprise end-to-end solutions according to TeraData
› Cater to support 3V driven big data opportunities
› Raw data of unrecognized value
› Read-first schema
 Individual Experience vs Collective Experience
› Need to treat as individuals instead a mass collective
› Predictive modeling to recommend individual’s best
“intent”
› Implementing Process communication models (PCM) to
give better individualized customer service
 Listening to particular song by particular artist via mobile
 Calling to a call center
› Privacy concerns – main obstacle in current big data trend
 Business Cases
› Medical or Healthcare
› Entertainment
› Forensics
› Financial
› Retail
 Use Cases
› Medical or Healthcare
 Find a cure to a disease based on individual’s medical history,
behavior patterns, food and drug consumption, and similar
patients’ data
› Entertainment
 Provide a recommendation engine for IMDB or Netflix for
individual’s viewing patterns
› Forensics
 Capture a serial killer from historical murder data in CSI.
Similarly avoid more incidents in the similar killer pattern
› Financial
 Provide a predictive financial model for Wall Street stock market
fluctuations based on historical shareholder patterns
› Retail
 Coming of Hadoop
› GFS and Google’s MapReduce engine and
publishing of white papers by Google
› Yahoo team who first to decode the white papers and
create HDFS and an MR engine to scale out yahoo
search
› Creation of Hadoop 1.0 (Generation 1) in 2006 and
commit for Production level Hadoop by Yahoo
› Spawning the HortonWorks company in 2011 from a
set of Yahoo employees and move towards
Enterprise hardening
› Spawning multiple Hadoop distros as products
 Evolution of Hadoop
› Hadoop 1.x (Generation 1)
 Data Management – HDFS for redundant data storage from various sources and MapReduce
to process the data
 Data Access Layer (batch, near-time, real-time) - to access data simultaneously in multiple
ways
› Hadoop 2.x (Generation 2)
 Introducing YARN for Data Management layer
 Governance and Integration for Enterprise level – data loading, execute data policies, data
management – introducing Apache Falcon
 Security – authentication and authorization at a layered and secured way – Apache Knox
 Operations – deploy, monitor and manage the platform as whole – introducing Apache Ambari
› Enterprise Hadoop
 Deployment choice – Physical, virtual, cloud; distro Windows or Linux; distro product
HortonWorks or Cloudera or other
 Presentation and Applications – Enable existing and new applications to generate value from
Hadoop
 Enterprise management and security – empower existing proven enterprise tools to integrate
with Hadoop
 Services or Product choice - YARN-enabling always –on forever running services with Apache
Slider
 Hadoop 2.7 Stack (HortonWorks view)
 “Other than” Hadoop, HDFS
› HDFS-like storage systems with similar
MapReduce engines
› MapR (uses an NFS)
 Has cloud support too
› EMC, NetApp, CleverState, Symentic
› IBM’s BigInsight (kind of distro of Cloudera which
is intern distro of Hadoop)
› SAP’s HANA suite
› Of course proprietary GFS which HDFS is based
on originally
Big Data - Part I

More Related Content

PPTX
Big Data - Part II
PPTX
Big Data - Part III
PPTX
Big Data - Part IV
PDF
Big data presentation
PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PDF
Big Data Streams Architectures. Why? What? How?
PDF
Big data ecosystem
PPTX
Managed Cluster Services
Big Data - Part II
Big Data - Part III
Big Data - Part IV
Big data presentation
Introduction To Big Data Analytics On Hadoop - SpringPeople
Big Data Streams Architectures. Why? What? How?
Big data ecosystem
Managed Cluster Services

What's hot (20)

PPTX
Intro to Big Data Hadoop
PDF
Bigdata and Hadoop Bootcamp
PPTX
Intro to bigdata on gcp (1)
PPTX
Hadoop
PPT
Big data and hadoop
PPTX
Introduction to Big Data
PPTX
big data and hadoop
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Great Expectations Presentation
PPTX
Big Data Analytics for Non-Programmers
PPTX
PDF
Next Generation Data Platforms - Deon Thomas
PPTX
Big Data and Hadoop
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
PPTX
Big Data Visualisation with Hadoop and PowerPivot
PPT
BigData Analytics with Hadoop and BIRT
PDF
NoSQL Introduction
PPTX
How to boost your datamanagement with Dremio ?
PPTX
An exploration in analysis and visualization
PPTX
Big data in Azure
Intro to Big Data Hadoop
Bigdata and Hadoop Bootcamp
Intro to bigdata on gcp (1)
Hadoop
Big data and hadoop
Introduction to Big Data
big data and hadoop
Big Data Analytics Projects - Real World with Pentaho
Great Expectations Presentation
Big Data Analytics for Non-Programmers
Next Generation Data Platforms - Deon Thomas
Big Data and Hadoop
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Big Data Visualisation with Hadoop and PowerPivot
BigData Analytics with Hadoop and BIRT
NoSQL Introduction
How to boost your datamanagement with Dremio ?
An exploration in analysis and visualization
Big data in Azure
Ad

Similar to Big Data - Part I (20)

PDF
Modern data warehouse
PDF
Modern data warehouse
PPTX
Finding business value in Big Data
PDF
Hadoop at the Center: The Next Generation of Hadoop
PPTX
Deutsche Telekom on Big Data
PPTX
5 Things that Make Hadoop a Game Changer
PPTX
Big data4businessusers
PDF
Hadoop data-lake-white-paper
PPTX
Introduction to Harnessing Big Data
PPTX
Stratebi Big Data
PDF
Big data appliances for BI on Cloud
PDF
Big data
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
PDF
Architecting Agile Data Applications for Scale
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
PPTX
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
PPTX
Architecting Your First Big Data Implementation
PDF
Modern data warehouse
Modern data warehouse
Finding business value in Big Data
Hadoop at the Center: The Next Generation of Hadoop
Deutsche Telekom on Big Data
5 Things that Make Hadoop a Game Changer
Big data4businessusers
Hadoop data-lake-white-paper
Introduction to Harnessing Big Data
Stratebi Big Data
Big data appliances for BI on Cloud
Big data
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Architecting Agile Data Applications for Scale
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Architecting Your First Big Data Implementation
Ad

Recently uploaded (20)

PPTX
history of c programming in notes for students .pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Digital Strategies for Manufacturing Companies
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
medical staffing services at VALiNTRY
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
top salesforce developer skills in 2025.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Nekopoi APK 2025 free lastest update
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
history of c programming in notes for students .pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Digital Strategies for Manufacturing Companies
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Which alternative to Crystal Reports is best for small or large businesses.pdf
medical staffing services at VALiNTRY
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How Creative Agencies Leverage Project Management Software.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
PTS Company Brochure 2025 (1).pdf.......
top salesforce developer skills in 2025.pdf
Reimagine Home Health with the Power of Agentic AI​
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Nekopoi APK 2025 free lastest update
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Navsoft: AI-Powered Business Solutions & Custom Software Development
2025 Textile ERP Trends: SAP, Odoo & Oracle

Big Data - Part I

  • 2.  What is Big Data › Top Down Approach to the Topic of Big Data › Data Science and Data Scientists › Days of Data Past – Part I › Days of Data Past – Part II › Days of Data Past – Part III › Three V’s › “Data Lake” Architecture  Who can use Big Data › Individual Experience vs Collective Experience › Business Cases › Use Cases  How to Use Big Data › Coming of Hadoop › Evolution of Hadoop › “Other than” HDFS
  • 3.  Data Science and Data Scientists › Science of mining, extracting, analyzing, modeling, visualizing large data sets from multiple sources › “Data analyst, data artist” › Knowledge of math, statistics, predictive modeling, pattern recognition and learning, data visualization, data warehousing, etc › From C.F. Jeff Wu to William S. Cleveland to “Data Science Journal” and beyond
  • 4.  Days of Data Past - Part 1 › Relational databases and their impact › Write-first schema › ACID compliant › Row-store technology › Relationally structured data for smaller data sets › Relatively cheaper products  SQL Server, Oracle, etc › Highly available skill-set › SQL languages  Data manipulation– Insert, Select, Update and Delete  Data definition – Create, Alter, Truncate and Drop › Influenced LINQ (in .NET) and JPQL (in Java) etc in application programming › Enterprise ready
  • 5.  Days of Data Past - Part II › Enterprise Data Warehouses (EDW) and their impact › Massively Parallel Processing (MPP) appliances –not all EDW’s are packaged as MPP appliances › Column-store technology, faster and easier for BI – not all EDW’s use column-store › Dimensionally structured data for large data sets › Enterprise storage not commodity storage › Expensive premium products  TeraData, Vertica, SQL Server PDW, etc  Some major companies offers commodity hardware for low price customer  Some major companies offers services in addition to products › Demanding skill-set › Enterprise ready
  • 6.  Days of Data Past - Part III › NoSql data stores and their impact › Not relational and Not ACID compliant › Four types  Key-value stores (KV)  Document stores  Graph database stores  Wide column stores › Relatively cheaper products › Commodity storage not enterprise storages › Demanding and scarce skill-set › Not Enterprise ready NewSql data stores as an alternative to NoSql  Relational and ACID compliant  SQL driven so that existing SQL investments are intact
  • 7.  Three V’s › Volume  Large volumes 100 TB or more currently  Expecting above benchmark in future › Velocity  How quickly data accumulates  How quickly your data makes sense  Batch, near-time, real-time  Batch vs Interactive › Variety  Various data sources  Structured data – relational, ERP, CRM  Semi-structured data – click streams, weblogs, geographical, social  Unstructured data – sensor, textual, machine generated
  • 8.  “Data Lake” Architecture › Modern Data Architecture  Provides a shared service for broad insight across a large, diverse data set at efficient scale according to HortonWorks  A unified data architecture which integrated to enterprise end-to-end solutions according to TeraData › Cater to support 3V driven big data opportunities › Raw data of unrecognized value › Read-first schema
  • 9.  Individual Experience vs Collective Experience › Need to treat as individuals instead a mass collective › Predictive modeling to recommend individual’s best “intent” › Implementing Process communication models (PCM) to give better individualized customer service  Listening to particular song by particular artist via mobile  Calling to a call center › Privacy concerns – main obstacle in current big data trend
  • 10.  Business Cases › Medical or Healthcare › Entertainment › Forensics › Financial › Retail
  • 11.  Use Cases › Medical or Healthcare  Find a cure to a disease based on individual’s medical history, behavior patterns, food and drug consumption, and similar patients’ data › Entertainment  Provide a recommendation engine for IMDB or Netflix for individual’s viewing patterns › Forensics  Capture a serial killer from historical murder data in CSI. Similarly avoid more incidents in the similar killer pattern › Financial  Provide a predictive financial model for Wall Street stock market fluctuations based on historical shareholder patterns › Retail
  • 12.  Coming of Hadoop › GFS and Google’s MapReduce engine and publishing of white papers by Google › Yahoo team who first to decode the white papers and create HDFS and an MR engine to scale out yahoo search › Creation of Hadoop 1.0 (Generation 1) in 2006 and commit for Production level Hadoop by Yahoo › Spawning the HortonWorks company in 2011 from a set of Yahoo employees and move towards Enterprise hardening › Spawning multiple Hadoop distros as products
  • 13.  Evolution of Hadoop › Hadoop 1.x (Generation 1)  Data Management – HDFS for redundant data storage from various sources and MapReduce to process the data  Data Access Layer (batch, near-time, real-time) - to access data simultaneously in multiple ways › Hadoop 2.x (Generation 2)  Introducing YARN for Data Management layer  Governance and Integration for Enterprise level – data loading, execute data policies, data management – introducing Apache Falcon  Security – authentication and authorization at a layered and secured way – Apache Knox  Operations – deploy, monitor and manage the platform as whole – introducing Apache Ambari › Enterprise Hadoop  Deployment choice – Physical, virtual, cloud; distro Windows or Linux; distro product HortonWorks or Cloudera or other  Presentation and Applications – Enable existing and new applications to generate value from Hadoop  Enterprise management and security – empower existing proven enterprise tools to integrate with Hadoop  Services or Product choice - YARN-enabling always –on forever running services with Apache Slider
  • 14.  Hadoop 2.7 Stack (HortonWorks view)
  • 15.  “Other than” Hadoop, HDFS › HDFS-like storage systems with similar MapReduce engines › MapR (uses an NFS)  Has cloud support too › EMC, NetApp, CleverState, Symentic › IBM’s BigInsight (kind of distro of Cloudera which is intern distro of Hadoop) › SAP’s HANA suite › Of course proprietary GFS which HDFS is based on originally

Editor's Notes

  • #4: In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?“ In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics"  In April 2002, the International Council for Science: Committee on Data for Science and Technology (CODATA)[9] started the Data Science Journal
  • #5: ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably