Big Data - Part I

 What is Big Data
› Top Down Approach to the Topic of Big Data
› Data Science and Data Scientists
› Days of Data Past – Part I
› Days of Data Past – Part II
› Days of Data Past – Part III
› Three V’s
› “Data Lake” Architecture
 Who can use Big Data
› Individual Experience vs Collective Experience
› Business Cases
› Use Cases
 How to Use Big Data
› Coming of Hadoop
› Evolution of Hadoop
› “Other than” HDFS

 Data Science and Data Scientists
› Science of mining, extracting, analyzing,
modeling, visualizing large data sets from
multiple sources
› “Data analyst, data artist”
› Knowledge of math, statistics, predictive
modeling, pattern recognition and learning, data
visualization, data warehousing, etc
› From C.F. Jeff Wu to William S. Cleveland to
“Data Science Journal” and beyond

 Days of Data Past - Part 1
› Relational databases and their impact
› Write-first schema
› ACID compliant
› Row-store technology
› Relationally structured data for smaller data sets
› Relatively cheaper products
 SQL Server, Oracle, etc
› Highly available skill-set
› SQL languages
 Data manipulation– Insert, Select, Update and Delete
 Data definition – Create, Alter, Truncate and Drop
› Influenced LINQ (in .NET) and JPQL (in Java) etc in application
programming
› Enterprise ready

 Days of Data Past - Part II
› Enterprise Data Warehouses (EDW) and their impact
› Massively Parallel Processing (MPP) appliances –not all EDW’s are
packaged as MPP appliances
› Column-store technology, faster and easier for BI – not all EDW’s use
column-store
› Dimensionally structured data for large data sets
› Enterprise storage not commodity storage
› Expensive premium products
 TeraData, Vertica, SQL Server PDW, etc
 Some major companies offers commodity hardware for low price
customer
 Some major companies offers services in addition to products
› Demanding skill-set
› Enterprise ready

 Days of Data Past - Part III
› NoSql data stores and their impact
› Not relational and Not ACID compliant
› Four types
 Key-value stores (KV)
 Document stores
 Graph database stores
 Wide column stores
› Relatively cheaper products
› Commodity storage not enterprise storages
› Demanding and scarce skill-set
› Not Enterprise ready
NewSql data stores as an alternative to NoSql
 Relational and ACID compliant
 SQL driven so that existing SQL investments are intact

 Three V’s
› Volume
 Large volumes 100 TB or more currently
 Expecting above benchmark in future
› Velocity
 How quickly data accumulates
 How quickly your data makes sense
 Batch, near-time, real-time
 Batch vs Interactive
› Variety
 Various data sources
 Structured data – relational, ERP, CRM
 Semi-structured data – click streams, weblogs, geographical,
social
 Unstructured data – sensor, textual, machine generated

 “Data Lake” Architecture
› Modern Data Architecture
 Provides a shared service for broad insight across a
large, diverse data set at efficient scale according to
HortonWorks
 A unified data architecture which integrated to
enterprise end-to-end solutions according to TeraData
› Cater to support 3V driven big data opportunities
› Raw data of unrecognized value
› Read-first schema

 Individual Experience vs Collective Experience
› Need to treat as individuals instead a mass collective
› Predictive modeling to recommend individual’s best
“intent”
› Implementing Process communication models (PCM) to
give better individualized customer service
 Listening to particular song by particular artist via mobile
 Calling to a call center
› Privacy concerns – main obstacle in current big data trend

 Business Cases
› Medical or Healthcare
› Entertainment
› Forensics
› Financial
› Retail

 Use Cases
› Medical or Healthcare
 Find a cure to a disease based on individual’s medical history,
behavior patterns, food and drug consumption, and similar
patients’ data
› Entertainment
 Provide a recommendation engine for IMDB or Netflix for
individual’s viewing patterns
› Forensics
 Capture a serial killer from historical murder data in CSI.
Similarly avoid more incidents in the similar killer pattern
› Financial
 Provide a predictive financial model for Wall Street stock market
fluctuations based on historical shareholder patterns
› Retail

 Coming of Hadoop
› GFS and Google’s MapReduce engine and
publishing of white papers by Google
› Yahoo team who first to decode the white papers and
create HDFS and an MR engine to scale out yahoo
search
› Creation of Hadoop 1.0 (Generation 1) in 2006 and
commit for Production level Hadoop by Yahoo
› Spawning the HortonWorks company in 2011 from a
set of Yahoo employees and move towards
Enterprise hardening
› Spawning multiple Hadoop distros as products

 Evolution of Hadoop
› Hadoop 1.x (Generation 1)
 Data Management – HDFS for redundant data storage from various sources and MapReduce
to process the data
 Data Access Layer (batch, near-time, real-time) - to access data simultaneously in multiple
ways
› Hadoop 2.x (Generation 2)
 Introducing YARN for Data Management layer
 Governance and Integration for Enterprise level – data loading, execute data policies, data
management – introducing Apache Falcon
 Security – authentication and authorization at a layered and secured way – Apache Knox
 Operations – deploy, monitor and manage the platform as whole – introducing Apache Ambari
› Enterprise Hadoop
 Deployment choice – Physical, virtual, cloud; distro Windows or Linux; distro product
HortonWorks or Cloudera or other
 Presentation and Applications – Enable existing and new applications to generate value from
Hadoop
 Enterprise management and security – empower existing proven enterprise tools to integrate
with Hadoop
 Services or Product choice - YARN-enabling always –on forever running services with Apache
Slider

 Hadoop 2.7 Stack (HortonWorks view)

 “Other than” Hadoop, HDFS
› HDFS-like storage systems with similar
MapReduce engines
› MapR (uses an NFS)
 Has cloud support too
› EMC, NetApp, CleverState, Symentic
› IBM’s BigInsight (kind of distro of Cloudera which
is intern distro of Hadoop)
› SAP’s HANA suite
› Of course proprietary GFS which HDFS is based
on originally

Big Data - Part I

More Related Content

What's hot (20)

Similar to Big Data - Part I (20)

Recently uploaded (20)

Big Data - Part I

Editor's Notes