Big Data Intelligence
The Beginning
Prof.Ashok.R | +91-9943900101 | ashok@zettab.com
ZettaB.com
Big Data Training in Coimbatore
Ref: Ullman et.al, Mining Massive Datasets
Caution
• The grass is always green on the other side
Be inspired!
• Stories.. and more stories…
Be informed!
• The devil is in the details
Be challenged!
2Hsuan- Tien Lin
The Dream in 1945
3
• A dream machine of Vannevar Bush (1945)
• An extended supplement to Human Memory
• A device which stores individual library such as
books, records and communications
• Microfilms can be searched, copied and shared
• Useful to store and share information among
lawyers, patent attorney, Doctor and chemists
• The base concept from which WWW evolved
Leads to Web and Web Scale Data
• Data Volume: doubles every 1.2 years
5 EB – Total data produced in 5000 years till 2003
20 EB – Data collected by Google alone for a day now
• Data Variety : structured, semi and unstructured
xml, JSON, doc, pdf, html, email body, .mp4,.jpeg…
• Data Velocity : Lot happens in a minute
72 hours of new video uploaded in YouTube
3 million searches in google
200 million emails sent
350 thousand Tweets | 1 million searches in Twitter
690 thousand shares | 420 GB data handled in FB
20 million photo views in Flickr
4
Source: Qmee, Wikibon
https://blog.kissmetrics.com/facebook-statistics/
Desktop
Hobbyist
Internet
Big Data
Byte one grain of rice
Kilobyte cup of rice
Source: What is big data, Slideshare.net
Megabyte 8 bags of rice
Gigabyte 3 Trucks of rice
Terabyte 2 container ships
Petabyte Fills half the area of Tirupur
Exabyte Fills the area of south india
ZettaByte Fills Indian ocean twice
PB/EB/ZB
210
220
230
240
250
260
270
1
Big Data Intelligence (BDI)
The ability to understand all of us better by connecting the dots
from massive data sets (with TB/PB Volume, streaming
Velocity and Variety in sources) to
predict the future.
6
WHY DO WE PREDICT
7
To Survive
8
With largest neural network brain to store and process
big volume of data with 100 billions of neurons and 2.5
PB equivalent memory @ 100 million MIPS (33K i7 cores)
Vision | Touch | Hearing | Smell | Taste
Scientificamerican.com, Storagecraft.com
1250 MB/s | 125 MB/s | 12.5 MB/s | 1.25 MB/s
You only feel 0.7% of
What you sense
The Prediction Power
• 10000 hours (7-8 years) of rigorous practice is required to be
the world-class expert—in anything Daniel Levitin, The neurologist
• This enables the ability to predict 2 seconds before others-
“Two Second Advantage”
– Wayne Gretzky, The greatest Ice-hockey player of all time, was able to predict where the puck was
going to be, an instant before it arrived
– Sachin Tendulkar
– Warren Buffet
– Viswanath Anand
9
wins the competitors
Can Machines Think (to Predict)?
10
Alan Turing asked this question
in 1950 and proposed a test
to validate it.
Which is machine, and
which is woman???
Can Machine Imitate Brain?
Which is man,
and which is
woman???
Turing Test
11
Did any machine pass?
12
Any machine nearer? (near AI)
“William Wilkinson’s ‘An account of the
principalities of Wallachia and Modavia’
inspired this author’s most famous novel.”
Jeopardy! Quiz Contest.
The challenge is to predict the question and
bet with reasonable confidence.
No.
IBM
Watson Computer
13
“William Wilkinson’s ‘An account of the principalities of Wallachia and Modavia’ inspired this
author’s most famous novel.”
Near AI Solutions
• Natural language processing
• Machine learning
• Prediction analytics
• Face recognition
• Languages translation
• Speech recognition
14
Can machine predict?
• Share price in a stock market next day
• Top 5 products consumers want to buy next week
• Price of Tomato(1 Kg) next month
• No. of cars to be sold next quarter
• Potential criminals in the city/ mega event
• When machine/human will become sick
• Best matched course/school to study
• Best matched job/company to work
15
Google Story: Where it all began
• 50 billion indexed pages
• Thousands/Millions/Billions of pages may match each search
query
• How to rank them in order to display the most relevant
(important) pages in the top.
• Predict what you want to see. Not what you asked.
Do You Know?
4 billion searches happen in a day
Each query uses 1000 nodes
Result returned in 0.2 seconds.
20 billion pages crawled per day
20 Exabytes of data collected in a day
Page Rank
• Give pages ranks (scores) based on links to them
– Links from many pages  high rank
– Link from a high-rank page  high rank
Parallel Programming With Spark, Matei Zaharia



ji
t
it
j
r
r
i
1
d
“rank” rj for page j
Matrix-Vector Multiplication
MatrixGoogle
A.rr t1t


A
• Page rank equation in a practical form,
(Rank vector r is the Eigen vector of A)
Iteration is repeated, till rank vector converges
(or max. iteration reaches)
For iteration t+1,
RAM is not Enough
• Won’t be a problem for small dimension (NxN)
• Consider, N=1 billion (pages that match a query)
• Dimension is now in billions
– A is billion x billion matrix
– r is billion size rank vector
– r(old,new) has two billion entries ( 16 GB for 8 bytes double values)
– A has billion x billion entries ( 8 ExaBytes)
Though, we have methods such as sparse matrix to reduce dimensions in actual implementation.
RAM size of a highly configured server node: 128-512 GB
Worker Node
20
Datacenter node
16 cores
10-30 TB disks
(Secondary)
128-512GB RAM
(Main memory)
1-4TB (SSD)
1 -10 Gbps
0.2-1GB/s
(x10 disks) (Seek)
1-4GB/s
(x4 disks)
40-60GB/s
Source: AmpLab, UCB, Dell
Disk is slowest and not Enough
• 50 billion web pages x 20KB = 1 PB
• 1 computer reads 30-35 MB/sec from disk
~10 months to read all
• Also, it requires 1,000 hard drives to store all
21J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
can you wait that long for one search?
Parallelism using Cluster
• 8-64 nodes/rack, 4-16 racks in a cluster
• 1 Gbps bandwidth within rack, 8 Gbps out of rack
• Node specs :
8-16 cores, 128-512 GB RAM, 10×1 TB disks
Aggregation switch
Rack switch
ToR
But Nodes Fail at Scale
• One server may stay upto 3 years (1,000 days)
• If you have 1,000 servers, expect to loose 1/day
• Google has 1 Million servers
–Hence 1000 machines will fail every day.
23J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Traditional RDBMS Fail
• Not designed for variety of data types (Text, Video, Images)
• Not capable to handle big volume (PB/EB/ZB)
• Not designed for parallelism
• Poor fault tolerance at Scale (Million servers)
• Slow down due to joins, volume, ACID check and high velocity
requests
• Designed for transaction processing; Not designed for deep
analytics (intensive computing)
24
Google Solution: DFS
• Distributed File System
• Divide the bigger data file into smaller chunks of size 16-64
MB and store them in different nodes in different racks.
• Chunks are replicated (2-3) for fault tolerance
25
C0 C1
C2C5
Chunk server 1
D1
C5
Chunk server 3
C1
C3C5
Chunk server 2
…
C2D0
D0
C0 C5
Chunk server N
C2
D0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
& Map-Reduce
Map-Reduce environment(Master) takes care of:
• Handling machine failures (with replica nodes)
• Partitioning the input data
• Scheduling workers
• Performing the group by key step
• Managing inter-machine communication
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
26
Big Data Platform
M-R App
MapReduce Stack
(Hadoop & Spark)
Distributed File Systems
(HDFS/ GFS)
BUT WITH RESTRICTION
DFS is useful, only when
• Size is big (> 1 TB)
• Files are rarely updated
– Works for Google (to store indexed pages)
– Will not be effective for Airline reservation system
(where frequent data updates are done)
29
M-R is useful, only when
• Dimension in billions
– Matrix-vector multiplication in Google Pagerank
• Graph with millions of nodes and billions of edges
– FB Network Graph
• Deep analytical application with intensive computing
– Useful in Finding users with similar buying pattern for products
recommendations in Amazon
– But not useful to manage online retail sales of Amazon (frequent data
updates, transactions)
30
BDI PARADIGM
Google Creates
• DFS (GFS)
• Map-Reduce
• Dremel (Big Query)
• Pregel
& Apache Follows
• GFS  HDFS
• Map-Reduce Hadoop, Spark
• Dremel  Drill
• Pregel  Giraph
SCALA
• Uses and Runs on Java Virtual Machine
• Yet, simpler to write (succinct) than Java
– Strong Type Inference (statically typed)
– Lesser Code
• Functional Programming (+ OOP)
– First class functions
• Used to develop Spark stack (Hadoop 2.0)
• Most suited for Map-Reduce applications
– Traits, collections, nested classes
– Immutable dataset
– Scalable
Mining
• Link Analysis
• Classification
• Content based recommendation
• Collaborative Filtering
• Finding similar items
• Clustering, Decomposition…
Machine Learning (Supervised/ Unsupervised)
Cloud
• Amazon AWS
• Google Cloud Platform
• IBM BlueMix
• OpenStack
• Data Bricks, Cloudera, HortonWorks, MapR,…
• SAP, Oracle…
Spark as a service, Hadoop as a service
One Circle
BIG POTENTIAL
Big Market
• $16.9 billion in 2015
• $50 billion by 2017
• 90 percent of the Fortune 500 already initiated big data
projects
• Big Data Spending : $8M Per company
• 200 TB of stored data per company
– with >1000 employees
Ref: McKinsey 2011
39
Big Players
• Leaders
– IBM, HP, Dell, SAP, Teradata, Oracle, SAS, Accenture
(>$400 Million)
• Pure players (100% revenue from Big data)
– Palantir, Pivotal, Splunk, Mu Sigma, Actian, Opera Solutions
(>$100 Million)
• Indian Players
– TCS, CapGemini
(>$10 million)
40
WikiBon
2013
Big Jobs
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
41
BIG ENABLERS
42
Smart phones
• 1.2 billion sold in 2014
– 23.1 % increase over 2013
• Accounts to 27% of global handsets
– but consumes 95% of global traffic (1.5 EB/month)
• Daily SMS count exceeded the world population
Ref: Znet, Fool.com
43
Nielsen’s Law
44
bandwidth doubles every twenty-one months
5G in 2020 and 6G in 2030.
Moore's Law
45
Zilog PC
1980 iPhone
2007
Kryder's Law
46
In 2020, 2.5-inch disk drive would
store ~ 40 TB and cost about $40.
Storage capacity (doubles every 12
months) grows faster than Moore’s
law (processing capacity doubles
every 18-24 months).
All Together
47
Annual
Growth Rate
Nielsen's Law
Internet
bandwidth
50%
Moore's Law
Computing
power
60%
Kryder’s Law
Storage
capacity
100%
BIG SOURCES
Social Networks
49
as of August 2015
http://www.statista.com/
No. of active users in millions
Facebook
Ref: Chassis-plans.com, Wikibon
50
60 million posts per day
2.6 billion likes per day
375 million photos uploaded per day
15 TB data uploaded per day
600 TB data handled per day
700 TB Graph search DB
300 PB user data
http://allfacebook.com/orcfile b130817
Twitter
Ref: Chassis-plans.com, Wikibon
51
500 Million tweets per day
1.6 Billion search queries per day
316 Million montly active users
80% active users on mobile
Youtube
• 100 hours of new video every minute
• 53% mobile traffic is video
• Avg Human vision input: ½ million hours/life
• Youtube new uploads: 15 million hours/ year
52
MORE STORIES
House of Cards
Big data analytics picked up on the success of the British version
of House of Cards, and the popularity of David Fincher (Actor)
and Kevin Spacey (Director) movies
Netflix then made a major decision to commit $100 million for
two 13-episode sessions for its remake (US version) with
above team and streamed online
Netflix earned $1 Billion in that Quarter.
The Atlantic: May 2012
https://gigaom.com/2013/04/22/netflix-q1-2014-earnings/
first Emmy-winning
Streaming show
Lumiata
creates personalized treatment recommendations based on patients'
health data, using 170 million data points
55
raised US$10 Million from VCs
Ash Damle
Founder & CEO
MedAware
Avoids prescription errors due to
Drug mix-up
Patient mix-up
Unawareness of clinical data
Dosage mix-up
56
Example: Chlorambucil (chemotherapy) prescribed to a patient without
cancer, instead of Chloramphenicol (antibiotic)
Using mathematical model derived from Millions of EMRs which
represents real-world treatment patterns
Raised US$1 million funding
Windward
• Only platform to analyze maritime data from ships and ocean
to maintain ship history, predict threats and help make huge
financial decisions on shipping and commodity flows
• Earlier to 2010, it was impossible to know vessel’s location
once it sailed past 30 miles off shores; Then commercial
satellites were introduced ; But the big data collected from
ships gave corrupted picture
57
Raised $15.8 million.
mnubo
• Analytics of IoT Data
• Analytics of data from Connected car for driving
habits, vehicle failure pattern, inventory
management, usage based insurance etc
(36M connected cars will be on the road in 2020)
58
Raised $6 million
rocana
59
How many of your servers
are talking to blacklisted IPs?
How long has your
business been hacked?
Recana helps IT identify the root cause of performance or
security issues at any scale and complexity and resolve
underlying issues in real-time.
Instead of employing “brute force” searches against
millions of log entries, advanced analytics identifies
anomalies for investigation.
raised $19.4 million
Whetlab
• Only 5 data scientists worked
• Twitter acquired at undisclosed deal to increase the ability to
show users the kinds of tweets and content they actually want
to see.
60
Applied Predictive Technologies
Cloud based cause and effect analytics platform to accurately
measure the profit impact of pricing, marketing,
merchandising, operations, and capital initiatives, tailoring
investments in these areas to maximize ROI.
Acquired by MasterCard for $600 million.
61
Netflix Challenge
• Data: How users have rated movies
– 100.5 million ratings by 5 Lakh users to 18K movies
• Goal: Predict how a user would rate an unrated movie
– A recommender system problem
– 10% improvement: 1 million dollar prize
62Hsuan- Tien Lin
KDD Cup Challenge
• Data: How users rated songs
– 252.8 million ratings by 1 million users to 650K songs (Yahoo!)
• Goal: Recommend new songs that user would like
63Hsuan- Tien Lin
BDI for National Security
• TIA after (11/9)
• NATGRID after Mumbai attack (26/11)
– We could have stopped both, if we would have connected the pieces
of intel from all security agencies and info tracked from suspects
together.
64
More Applications
• Building a Stock Investment Strategy Model
• Predicting Customer Transaction Behavior
• Failure Prediction
• Opinion Mining to Determine User Sentiments
• Financial Loss Prediction
• Insurance Claim Prediction Model
• Bond Trade Price Prediction
• Prediction of Number of Days in the Hospital
• Accelerating Discovery of Drugs for Mutants of H1N1
• Molecular Activity Prediction
• Job Recommendation Engine
65
https://insofeprojects.wordpress.com/insofe-projects/
WHAT NEXT
A first course on BDI
Day Topics
Day 1 FN BDI: The Beginning
DFS and Map-Reduce
Distributed Graph (Pregel)
Page Rank algorithm
Day 1 AN BDI Tools Landscape
Dremel and Big Query
Naïve Bayes Classifier
Day 2 FN TF-IDF, Jaccard and Cosine
Collaborative filtering
Shingling, Minhashing
Locality Sensitive Hashing
Day Topics
Day 2 AN Scala Basics for MR apps
Practice session
More fun with Scala
Day 3 FN Spark projects using
Scala
Day 3 AN Student Projects ideas
Q&A
M.S. Options in USA
68
University Program
Stanford University M.S-CS, Specialization in Information
Management and Analytics
Four course graduate certificate in mining
massive datasets (link)
Northwestern University Master of Science In Analytics
DePaul University Master of Science in Predictive Analytics
North Carolina State University Master of Science In Analytics
University of Ottawa, Canada M.Sc in Analytics
University of Connecticut MS in Business Analytics and Project
Management
informationweek.com
IBM Director Dr. Spohrer's short list
PG options in India
69
Institute Program
Indian School of Business Certified Program in Business
Analytics (CBA)
Great Lakes Institute of
Management
PGP in Business Analytics
IIM Bangalore Analytics Essentials, BAI
IIM Ahmedabad Advanced Analytics for
Management
AnalyticsVidya.com, analyticsindiamag.com
Road Ahead
”The ultimate
search engine would
understand exactly
what you mean and
give back exactly
what you want.”
- Larry Page
Evolution of Manager Desk
71
Tree is God and above all
72
Prof.Ashok.R | +91-9943900101 | ashok@zettab.com
ZettaB.com
Big Data Training in Coimbatore

More Related Content

PPTX
Bw tech hadoop
PPTX
Column Stores and Google BigQuery
PPTX
Introduction to Apache Hadoop Ecosystem
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PDF
Big data and hadoop overvew
PDF
Geospatial Rectification of Web Transactions and Data Security
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Bw tech hadoop
Column Stores and Google BigQuery
Introduction to Apache Hadoop Ecosystem
Hadoop, Pig, and Twitter (NoSQL East 2009)
Big data and hadoop overvew
Geospatial Rectification of Web Transactions and Data Security
Dataiku big data paris - the rise of the hadoop ecosystem
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...

What's hot (18)

PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PPTX
Introduction to Apache Hadoop
PDF
OpenLSH - a framework for locality sensitive hashing
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PDF
Hadoop Overview & Architecture
 
KEY
Intro To Hadoop
PPT
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
PPTX
Hadoop: Distributed Data Processing
PDF
Hadoop Family and Ecosystem
PDF
Hadoop Overview kdd2011
PPTX
Hadoop and big data
PDF
Hadoop: Distributed data processing
PPT
Presentation on Hadoop Technology
PPTX
The Right Data for the Right Job
PPTX
Hadoop overview
PPTX
Faster Faster Faster! Datamarts with Hive at Yahoo
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Introduction to Apache Hadoop
OpenLSH - a framework for locality sensitive hashing
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Introduction to Big Data & Hadoop Architecture - Module 1
Hadoop Overview & Architecture
 
Intro To Hadoop
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Hadoop: Distributed Data Processing
Hadoop Family and Ecosystem
Hadoop Overview kdd2011
Hadoop and big data
Hadoop: Distributed data processing
Presentation on Hadoop Technology
The Right Data for the Right Job
Hadoop overview
Faster Faster Faster! Datamarts with Hive at Yahoo
Supporting Financial Services with a More Flexible Approach to Big Data
Ad

Viewers also liked (20)

PPT
Social Media 101 - An Introduction to Social Media
PDF
Report Mi.Mo. - I cittadini Migliorano la Mobilità - 2012
PDF
Trompito 1
PPT
Gov. Presentation - Development Of Democracy
PPS
Easter 1
PDF
Türkiyede Eğitim Sitemi
PPS
Rememberit Well
PPTX
Debt Dr Newsletter December 2010
DOCX
4ª pb 9 ano sao judas
PPT
MTech13: "Social Media Tools for Success" - Eric Andersen
PPTX
Tif original 2011 final council presentation
PPTX
Media APP Summit Non-Profits
PDF
Favorite Apps and Business Tools
PDF
MTech14: Up Close and Personal: Technology's Undeniable Impact on Individuali...
PPT
Social Shares - The New Link Building. SMX London 2012
PDF
PDF
Catalogo Lazúli 2009
PDF
Workflow NPW2010
PPTX
Angeli Tindall
Social Media 101 - An Introduction to Social Media
Report Mi.Mo. - I cittadini Migliorano la Mobilità - 2012
Trompito 1
Gov. Presentation - Development Of Democracy
Easter 1
Türkiyede Eğitim Sitemi
Rememberit Well
Debt Dr Newsletter December 2010
4ª pb 9 ano sao judas
MTech13: "Social Media Tools for Success" - Eric Andersen
Tif original 2011 final council presentation
Media APP Summit Non-Profits
Favorite Apps and Business Tools
MTech14: Up Close and Personal: Technology's Undeniable Impact on Individuali...
Social Shares - The New Link Building. SMX London 2012
Catalogo Lazúli 2009
Workflow NPW2010
Angeli Tindall
Ad

Similar to BDI- The Beginning (Big data training in Coimbatore) (20)

PDF
LUISS - Deep Learning and data analyses - 09/01/19
PDF
Introduction to Big Data
PPTX
A Big Data Concept
PDF
Big Data & Artificial Intelligence
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
PPTX
Big data4businessusers
PDF
Big data
PPTX
Big Data - An Overview
PPTX
Big data business case
PPTX
Big data Intro - Presentation to OCHackerz Meetup Group
PDF
Introduction to Big Data
PPT
Big Data Ecosystem for Data-Driven Decision Making
PPTX
unit1 big data analysis description and defenition .pptx
PPTX
Big Data Analysis : Deciphering the haystack
PDF
Computational intelligence for big data analytics bda 2013
PDF
Dba to data scientist -Satyendra
PDF
PDF
Big Data overview
PPTX
BigData.pptx
PDF
Balogh gyorgy big_data
LUISS - Deep Learning and data analyses - 09/01/19
Introduction to Big Data
A Big Data Concept
Big Data & Artificial Intelligence
Big Data and Data Science: The Technologies Shaping Our Lives
Big data4businessusers
Big data
Big Data - An Overview
Big data business case
Big data Intro - Presentation to OCHackerz Meetup Group
Introduction to Big Data
Big Data Ecosystem for Data-Driven Decision Making
unit1 big data analysis description and defenition .pptx
Big Data Analysis : Deciphering the haystack
Computational intelligence for big data analytics bda 2013
Dba to data scientist -Satyendra
Big Data overview
BigData.pptx
Balogh gyorgy big_data

Recently uploaded (20)

PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Introduction to Inferential Statistics.pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Microsoft Core Cloud Services powerpoint
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Introduction to Data Science and Data Analysis
PDF
Microsoft 365 products and services descrption
PPT
Predictive modeling basics in data cleaning process
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Business_Capability_Map_Collection__pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPT
Image processing and pattern recognition 2.ppt
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
CYBER SECURITY the Next Warefare Tactics
Introduction to Inferential Statistics.pptx
Navigating the Thai Supplements Landscape.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
SAP 2 completion done . PRESENTATION.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Microsoft Core Cloud Services powerpoint
SET 1 Compulsory MNH machine learning intro
Introduction to Data Science and Data Analysis
Microsoft 365 products and services descrption
Predictive modeling basics in data cleaning process
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Business_Capability_Map_Collection__pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Image processing and pattern recognition 2.ppt

BDI- The Beginning (Big data training in Coimbatore)

  • 1. Big Data Intelligence The Beginning Prof.Ashok.R | +91-9943900101 | ashok@zettab.com ZettaB.com Big Data Training in Coimbatore Ref: Ullman et.al, Mining Massive Datasets
  • 2. Caution • The grass is always green on the other side Be inspired! • Stories.. and more stories… Be informed! • The devil is in the details Be challenged! 2Hsuan- Tien Lin
  • 3. The Dream in 1945 3 • A dream machine of Vannevar Bush (1945) • An extended supplement to Human Memory • A device which stores individual library such as books, records and communications • Microfilms can be searched, copied and shared • Useful to store and share information among lawyers, patent attorney, Doctor and chemists • The base concept from which WWW evolved
  • 4. Leads to Web and Web Scale Data • Data Volume: doubles every 1.2 years 5 EB – Total data produced in 5000 years till 2003 20 EB – Data collected by Google alone for a day now • Data Variety : structured, semi and unstructured xml, JSON, doc, pdf, html, email body, .mp4,.jpeg… • Data Velocity : Lot happens in a minute 72 hours of new video uploaded in YouTube 3 million searches in google 200 million emails sent 350 thousand Tweets | 1 million searches in Twitter 690 thousand shares | 420 GB data handled in FB 20 million photo views in Flickr 4 Source: Qmee, Wikibon https://blog.kissmetrics.com/facebook-statistics/
  • 5. Desktop Hobbyist Internet Big Data Byte one grain of rice Kilobyte cup of rice Source: What is big data, Slideshare.net Megabyte 8 bags of rice Gigabyte 3 Trucks of rice Terabyte 2 container ships Petabyte Fills half the area of Tirupur Exabyte Fills the area of south india ZettaByte Fills Indian ocean twice PB/EB/ZB 210 220 230 240 250 260 270 1
  • 6. Big Data Intelligence (BDI) The ability to understand all of us better by connecting the dots from massive data sets (with TB/PB Volume, streaming Velocity and Variety in sources) to predict the future. 6
  • 7. WHY DO WE PREDICT 7
  • 8. To Survive 8 With largest neural network brain to store and process big volume of data with 100 billions of neurons and 2.5 PB equivalent memory @ 100 million MIPS (33K i7 cores) Vision | Touch | Hearing | Smell | Taste Scientificamerican.com, Storagecraft.com 1250 MB/s | 125 MB/s | 12.5 MB/s | 1.25 MB/s You only feel 0.7% of What you sense
  • 9. The Prediction Power • 10000 hours (7-8 years) of rigorous practice is required to be the world-class expert—in anything Daniel Levitin, The neurologist • This enables the ability to predict 2 seconds before others- “Two Second Advantage” – Wayne Gretzky, The greatest Ice-hockey player of all time, was able to predict where the puck was going to be, an instant before it arrived – Sachin Tendulkar – Warren Buffet – Viswanath Anand 9 wins the competitors
  • 10. Can Machines Think (to Predict)? 10 Alan Turing asked this question in 1950 and proposed a test to validate it.
  • 11. Which is machine, and which is woman??? Can Machine Imitate Brain? Which is man, and which is woman??? Turing Test 11
  • 12. Did any machine pass? 12 Any machine nearer? (near AI) “William Wilkinson’s ‘An account of the principalities of Wallachia and Modavia’ inspired this author’s most famous novel.” Jeopardy! Quiz Contest. The challenge is to predict the question and bet with reasonable confidence. No.
  • 13. IBM Watson Computer 13 “William Wilkinson’s ‘An account of the principalities of Wallachia and Modavia’ inspired this author’s most famous novel.”
  • 14. Near AI Solutions • Natural language processing • Machine learning • Prediction analytics • Face recognition • Languages translation • Speech recognition 14
  • 15. Can machine predict? • Share price in a stock market next day • Top 5 products consumers want to buy next week • Price of Tomato(1 Kg) next month • No. of cars to be sold next quarter • Potential criminals in the city/ mega event • When machine/human will become sick • Best matched course/school to study • Best matched job/company to work 15
  • 16. Google Story: Where it all began • 50 billion indexed pages • Thousands/Millions/Billions of pages may match each search query • How to rank them in order to display the most relevant (important) pages in the top. • Predict what you want to see. Not what you asked. Do You Know? 4 billion searches happen in a day Each query uses 1000 nodes Result returned in 0.2 seconds. 20 billion pages crawled per day 20 Exabytes of data collected in a day
  • 17. Page Rank • Give pages ranks (scores) based on links to them – Links from many pages  high rank – Link from a high-rank page  high rank Parallel Programming With Spark, Matei Zaharia    ji t it j r r i 1 d “rank” rj for page j
  • 18. Matrix-Vector Multiplication MatrixGoogle A.rr t1t   A • Page rank equation in a practical form, (Rank vector r is the Eigen vector of A) Iteration is repeated, till rank vector converges (or max. iteration reaches) For iteration t+1,
  • 19. RAM is not Enough • Won’t be a problem for small dimension (NxN) • Consider, N=1 billion (pages that match a query) • Dimension is now in billions – A is billion x billion matrix – r is billion size rank vector – r(old,new) has two billion entries ( 16 GB for 8 bytes double values) – A has billion x billion entries ( 8 ExaBytes) Though, we have methods such as sparse matrix to reduce dimensions in actual implementation. RAM size of a highly configured server node: 128-512 GB
  • 20. Worker Node 20 Datacenter node 16 cores 10-30 TB disks (Secondary) 128-512GB RAM (Main memory) 1-4TB (SSD) 1 -10 Gbps 0.2-1GB/s (x10 disks) (Seek) 1-4GB/s (x4 disks) 40-60GB/s Source: AmpLab, UCB, Dell
  • 21. Disk is slowest and not Enough • 50 billion web pages x 20KB = 1 PB • 1 computer reads 30-35 MB/sec from disk ~10 months to read all • Also, it requires 1,000 hard drives to store all 21J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org can you wait that long for one search?
  • 22. Parallelism using Cluster • 8-64 nodes/rack, 4-16 racks in a cluster • 1 Gbps bandwidth within rack, 8 Gbps out of rack • Node specs : 8-16 cores, 128-512 GB RAM, 10×1 TB disks Aggregation switch Rack switch ToR
  • 23. But Nodes Fail at Scale • One server may stay upto 3 years (1,000 days) • If you have 1,000 servers, expect to loose 1/day • Google has 1 Million servers –Hence 1000 machines will fail every day. 23J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 24. Traditional RDBMS Fail • Not designed for variety of data types (Text, Video, Images) • Not capable to handle big volume (PB/EB/ZB) • Not designed for parallelism • Poor fault tolerance at Scale (Million servers) • Slow down due to joins, volume, ACID check and high velocity requests • Designed for transaction processing; Not designed for deep analytics (intensive computing) 24
  • 25. Google Solution: DFS • Distributed File System • Divide the bigger data file into smaller chunks of size 16-64 MB and store them in different nodes in different racks. • Chunks are replicated (2-3) for fault tolerance 25 C0 C1 C2C5 Chunk server 1 D1 C5 Chunk server 3 C1 C3C5 Chunk server 2 … C2D0 D0 C0 C5 Chunk server N C2 D0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 26. & Map-Reduce Map-Reduce environment(Master) takes care of: • Handling machine failures (with replica nodes) • Partitioning the input data • Scheduling workers • Performing the group by key step • Managing inter-machine communication J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
  • 27. Big Data Platform M-R App MapReduce Stack (Hadoop & Spark) Distributed File Systems (HDFS/ GFS)
  • 29. DFS is useful, only when • Size is big (> 1 TB) • Files are rarely updated – Works for Google (to store indexed pages) – Will not be effective for Airline reservation system (where frequent data updates are done) 29
  • 30. M-R is useful, only when • Dimension in billions – Matrix-vector multiplication in Google Pagerank • Graph with millions of nodes and billions of edges – FB Network Graph • Deep analytical application with intensive computing – Useful in Finding users with similar buying pattern for products recommendations in Amazon – But not useful to manage online retail sales of Amazon (frequent data updates, transactions) 30
  • 32. Google Creates • DFS (GFS) • Map-Reduce • Dremel (Big Query) • Pregel
  • 33. & Apache Follows • GFS  HDFS • Map-Reduce Hadoop, Spark • Dremel  Drill • Pregel  Giraph
  • 34. SCALA • Uses and Runs on Java Virtual Machine • Yet, simpler to write (succinct) than Java – Strong Type Inference (statically typed) – Lesser Code • Functional Programming (+ OOP) – First class functions • Used to develop Spark stack (Hadoop 2.0) • Most suited for Map-Reduce applications – Traits, collections, nested classes – Immutable dataset – Scalable
  • 35. Mining • Link Analysis • Classification • Content based recommendation • Collaborative Filtering • Finding similar items • Clustering, Decomposition… Machine Learning (Supervised/ Unsupervised)
  • 36. Cloud • Amazon AWS • Google Cloud Platform • IBM BlueMix • OpenStack • Data Bricks, Cloudera, HortonWorks, MapR,… • SAP, Oracle… Spark as a service, Hadoop as a service
  • 39. Big Market • $16.9 billion in 2015 • $50 billion by 2017 • 90 percent of the Fortune 500 already initiated big data projects • Big Data Spending : $8M Per company • 200 TB of stored data per company – with >1000 employees Ref: McKinsey 2011 39
  • 40. Big Players • Leaders – IBM, HP, Dell, SAP, Teradata, Oracle, SAS, Accenture (>$400 Million) • Pure players (100% revenue from Big data) – Palantir, Pivotal, Splunk, Mu Sigma, Actian, Opera Solutions (>$100 Million) • Indian Players – TCS, CapGemini (>$10 million) 40 WikiBon 2013
  • 41. Big Jobs J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
  • 43. Smart phones • 1.2 billion sold in 2014 – 23.1 % increase over 2013 • Accounts to 27% of global handsets – but consumes 95% of global traffic (1.5 EB/month) • Daily SMS count exceeded the world population Ref: Znet, Fool.com 43
  • 44. Nielsen’s Law 44 bandwidth doubles every twenty-one months 5G in 2020 and 6G in 2030.
  • 46. Kryder's Law 46 In 2020, 2.5-inch disk drive would store ~ 40 TB and cost about $40. Storage capacity (doubles every 12 months) grows faster than Moore’s law (processing capacity doubles every 18-24 months).
  • 47. All Together 47 Annual Growth Rate Nielsen's Law Internet bandwidth 50% Moore's Law Computing power 60% Kryder’s Law Storage capacity 100%
  • 49. Social Networks 49 as of August 2015 http://www.statista.com/ No. of active users in millions
  • 50. Facebook Ref: Chassis-plans.com, Wikibon 50 60 million posts per day 2.6 billion likes per day 375 million photos uploaded per day 15 TB data uploaded per day 600 TB data handled per day 700 TB Graph search DB 300 PB user data http://allfacebook.com/orcfile b130817
  • 51. Twitter Ref: Chassis-plans.com, Wikibon 51 500 Million tweets per day 1.6 Billion search queries per day 316 Million montly active users 80% active users on mobile
  • 52. Youtube • 100 hours of new video every minute • 53% mobile traffic is video • Avg Human vision input: ½ million hours/life • Youtube new uploads: 15 million hours/ year 52
  • 54. House of Cards Big data analytics picked up on the success of the British version of House of Cards, and the popularity of David Fincher (Actor) and Kevin Spacey (Director) movies Netflix then made a major decision to commit $100 million for two 13-episode sessions for its remake (US version) with above team and streamed online Netflix earned $1 Billion in that Quarter. The Atlantic: May 2012 https://gigaom.com/2013/04/22/netflix-q1-2014-earnings/ first Emmy-winning Streaming show
  • 55. Lumiata creates personalized treatment recommendations based on patients' health data, using 170 million data points 55 raised US$10 Million from VCs Ash Damle Founder & CEO
  • 56. MedAware Avoids prescription errors due to Drug mix-up Patient mix-up Unawareness of clinical data Dosage mix-up 56 Example: Chlorambucil (chemotherapy) prescribed to a patient without cancer, instead of Chloramphenicol (antibiotic) Using mathematical model derived from Millions of EMRs which represents real-world treatment patterns Raised US$1 million funding
  • 57. Windward • Only platform to analyze maritime data from ships and ocean to maintain ship history, predict threats and help make huge financial decisions on shipping and commodity flows • Earlier to 2010, it was impossible to know vessel’s location once it sailed past 30 miles off shores; Then commercial satellites were introduced ; But the big data collected from ships gave corrupted picture 57 Raised $15.8 million.
  • 58. mnubo • Analytics of IoT Data • Analytics of data from Connected car for driving habits, vehicle failure pattern, inventory management, usage based insurance etc (36M connected cars will be on the road in 2020) 58 Raised $6 million
  • 59. rocana 59 How many of your servers are talking to blacklisted IPs? How long has your business been hacked? Recana helps IT identify the root cause of performance or security issues at any scale and complexity and resolve underlying issues in real-time. Instead of employing “brute force” searches against millions of log entries, advanced analytics identifies anomalies for investigation. raised $19.4 million
  • 60. Whetlab • Only 5 data scientists worked • Twitter acquired at undisclosed deal to increase the ability to show users the kinds of tweets and content they actually want to see. 60
  • 61. Applied Predictive Technologies Cloud based cause and effect analytics platform to accurately measure the profit impact of pricing, marketing, merchandising, operations, and capital initiatives, tailoring investments in these areas to maximize ROI. Acquired by MasterCard for $600 million. 61
  • 62. Netflix Challenge • Data: How users have rated movies – 100.5 million ratings by 5 Lakh users to 18K movies • Goal: Predict how a user would rate an unrated movie – A recommender system problem – 10% improvement: 1 million dollar prize 62Hsuan- Tien Lin
  • 63. KDD Cup Challenge • Data: How users rated songs – 252.8 million ratings by 1 million users to 650K songs (Yahoo!) • Goal: Recommend new songs that user would like 63Hsuan- Tien Lin
  • 64. BDI for National Security • TIA after (11/9) • NATGRID after Mumbai attack (26/11) – We could have stopped both, if we would have connected the pieces of intel from all security agencies and info tracked from suspects together. 64
  • 65. More Applications • Building a Stock Investment Strategy Model • Predicting Customer Transaction Behavior • Failure Prediction • Opinion Mining to Determine User Sentiments • Financial Loss Prediction • Insurance Claim Prediction Model • Bond Trade Price Prediction • Prediction of Number of Days in the Hospital • Accelerating Discovery of Drugs for Mutants of H1N1 • Molecular Activity Prediction • Job Recommendation Engine 65 https://insofeprojects.wordpress.com/insofe-projects/
  • 67. A first course on BDI Day Topics Day 1 FN BDI: The Beginning DFS and Map-Reduce Distributed Graph (Pregel) Page Rank algorithm Day 1 AN BDI Tools Landscape Dremel and Big Query Naïve Bayes Classifier Day 2 FN TF-IDF, Jaccard and Cosine Collaborative filtering Shingling, Minhashing Locality Sensitive Hashing Day Topics Day 2 AN Scala Basics for MR apps Practice session More fun with Scala Day 3 FN Spark projects using Scala Day 3 AN Student Projects ideas Q&A
  • 68. M.S. Options in USA 68 University Program Stanford University M.S-CS, Specialization in Information Management and Analytics Four course graduate certificate in mining massive datasets (link) Northwestern University Master of Science In Analytics DePaul University Master of Science in Predictive Analytics North Carolina State University Master of Science In Analytics University of Ottawa, Canada M.Sc in Analytics University of Connecticut MS in Business Analytics and Project Management informationweek.com IBM Director Dr. Spohrer's short list
  • 69. PG options in India 69 Institute Program Indian School of Business Certified Program in Business Analytics (CBA) Great Lakes Institute of Management PGP in Business Analytics IIM Bangalore Analytics Essentials, BAI IIM Ahmedabad Advanced Analytics for Management AnalyticsVidya.com, analyticsindiamag.com
  • 70. Road Ahead ”The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page
  • 72. Tree is God and above all 72
  • 73. Prof.Ashok.R | +91-9943900101 | ashok@zettab.com ZettaB.com Big Data Training in Coimbatore

Editor's Notes

  • #14: Watson was developed by 25 researchers over four years. The software runs on a supercomputer with 2,880 IBM Power750 cores, or computing brains, and 15 terabytes of memory. One of Watson’s advantages is that it can hit the buzzer to answer a question faster than any human possibly can — six to 10 milliseconds. Watson won $1 million and all of its winnings will be donated to charity. Watson is an analytical computing system that specializes in natural human language and provides specific answers to complex questions at rapid speeds. Watson cannot respond to video or audio clues and they were omitted by jeopardy producers.
  • #46: An Osborne Executive portable computer, from 1982 with aZilog Z80 4MHz CPU, and a 2007 Apple iPhone with a 412MHzARM11 CPU; the Executive weighs 100 times as much, has nearly 500 times as much volume, cost approximately 10 times as much (adjusted for inflation), and has about 1/100th the clock frequencyof the smartphone.
  • #55: “House of Cards” is one of the first major test cases of this Big Data-driven creative strategy. For almost a year, Netflix executives have told us that their detailed knowledge of Netflix subscriber viewing preferences clinched their decision to license a remake of the popular and critically well regarded 1990 BBC miniseries. Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.
  • #61: http://whatsthebigdata.com/big-data-startups/