BDI- The Beginning (Big data training in Coimbatore)

Big Data Intelligence
The Beginning
Prof.Ashok.R | +91-9943900101 | ashok@zettab.com
ZettaB.com
Big Data Training in Coimbatore
Ref: Ullman et.al, Mining Massive Datasets

Caution
• The grass is always green on the other side
Be inspired!
• Stories.. and more stories…
Be informed!
• The devil is in the details
Be challenged!
2Hsuan- Tien Lin

The Dream in 1945
3
• A dream machine of Vannevar Bush (1945)
• An extended supplement to Human Memory
• A device which stores individual library such as
books, records and communications
• Microfilms can be searched, copied and shared
• Useful to store and share information among
lawyers, patent attorney, Doctor and chemists
• The base concept from which WWW evolved

Leads to Web and Web Scale Data
• Data Volume: doubles every 1.2 years
5 EB – Total data produced in 5000 years till 2003
20 EB – Data collected by Google alone for a day now
• Data Variety : structured, semi and unstructured
xml, JSON, doc, pdf, html, email body, .mp4,.jpeg…
• Data Velocity : Lot happens in a minute
72 hours of new video uploaded in YouTube
3 million searches in google
200 million emails sent
350 thousand Tweets | 1 million searches in Twitter
690 thousand shares | 420 GB data handled in FB
20 million photo views in Flickr
4
Source: Qmee, Wikibon
https://blog.kissmetrics.com/facebook-statistics/

Desktop
Hobbyist
Internet
Big Data
Byte one grain of rice
Kilobyte cup of rice
Source: What is big data, Slideshare.net
Megabyte 8 bags of rice
Gigabyte 3 Trucks of rice
Terabyte 2 container ships
Petabyte Fills half the area of Tirupur
Exabyte Fills the area of south india
ZettaByte Fills Indian ocean twice
PB/EB/ZB
210
220
230
240
250
260
270
1

Big Data Intelligence (BDI)
The ability to understand all of us better by connecting the dots
from massive data sets (with TB/PB Volume, streaming
Velocity and Variety in sources) to
predict the future.
6

The Prediction Power
• 10000 hours (7-8 years) of rigorous practice is required to be
the world-class expert—in anything Daniel Levitin, The neurologist
• This enables the ability to predict 2 seconds before others-
“Two Second Advantage”
– Wayne Gretzky, The greatest Ice-hockey player of all time, was able to predict where the puck was
going to be, an instant before it arrived
– Sachin Tendulkar
– Warren Buffet
– Viswanath Anand
9
wins the competitors

Can Machines Think (to Predict)?
10
Alan Turing asked this question
in 1950 and proposed a test
to validate it.

Which is machine, and
which is woman???
Can Machine Imitate Brain?
Which is man,
and which is
woman???
Turing Test
11

Did any machine pass?
12
Any machine nearer? (near AI)
“William Wilkinson’s ‘An account of the
principalities of Wallachia and Modavia’
inspired this author’s most famous novel.”
Jeopardy! Quiz Contest.
The challenge is to predict the question and
bet with reasonable confidence.
No.

IBM
Watson Computer
13
“William Wilkinson’s ‘An account of the principalities of Wallachia and Modavia’ inspired this
author’s most famous novel.”

Near AI Solutions
• Natural language processing
• Machine learning
• Prediction analytics
• Face recognition
• Languages translation
• Speech recognition
14

Can machine predict?
• Share price in a stock market next day
• Top 5 products consumers want to buy next week
• Price of Tomato(1 Kg) next month
• No. of cars to be sold next quarter
• Potential criminals in the city/ mega event
• When machine/human will become sick
• Best matched course/school to study
• Best matched job/company to work
15

Google Story: Where it all began
• 50 billion indexed pages
• Thousands/Millions/Billions of pages may match each search
query
• How to rank them in order to display the most relevant
(important) pages in the top.
• Predict what you want to see. Not what you asked.
Do You Know?
4 billion searches happen in a day
Each query uses 1000 nodes
Result returned in 0.2 seconds.
20 billion pages crawled per day
20 Exabytes of data collected in a day

Page Rank
• Give pages ranks (scores) based on links to them
– Links from many pages  high rank
– Link from a high-rank page  high rank
Parallel Programming With Spark, Matei Zaharia



ji
t
it
j
r
r
i
1
d
“rank” rj for page j

Matrix-Vector Multiplication
MatrixGoogle
A.rr t1t


A
• Page rank equation in a practical form,
(Rank vector r is the Eigen vector of A)
Iteration is repeated, till rank vector converges
(or max. iteration reaches)
For iteration t+1,

RAM is not Enough
• Won’t be a problem for small dimension (NxN)
• Consider, N=1 billion (pages that match a query)
• Dimension is now in billions
– A is billion x billion matrix
– r is billion size rank vector
– r(old,new) has two billion entries ( 16 GB for 8 bytes double values)
– A has billion x billion entries ( 8 ExaBytes)
Though, we have methods such as sparse matrix to reduce dimensions in actual implementation.
RAM size of a highly configured server node: 128-512 GB

Worker Node
20
Datacenter node
16 cores
10-30 TB disks
(Secondary)
128-512GB RAM
(Main memory)
1-4TB (SSD)
1 -10 Gbps
0.2-1GB/s
(x10 disks) (Seek)
1-4GB/s
(x4 disks)
40-60GB/s
Source: AmpLab, UCB, Dell

Disk is slowest and not Enough
• 50 billion web pages x 20KB = 1 PB
• 1 computer reads 30-35 MB/sec from disk
~10 months to read all
• Also, it requires 1,000 hard drives to store all
21J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
can you wait that long for one search?

Parallelism using Cluster
• 8-64 nodes/rack, 4-16 racks in a cluster
• 1 Gbps bandwidth within rack, 8 Gbps out of rack
• Node specs :
8-16 cores, 128-512 GB RAM, 10×1 TB disks
Aggregation switch
Rack switch
ToR

But Nodes Fail at Scale
• One server may stay upto 3 years (1,000 days)
• If you have 1,000 servers, expect to loose 1/day
• Google has 1 Million servers
–Hence 1000 machines will fail every day.
23J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Traditional RDBMS Fail
• Not designed for variety of data types (Text, Video, Images)
• Not capable to handle big volume (PB/EB/ZB)
• Not designed for parallelism
• Poor fault tolerance at Scale (Million servers)
• Slow down due to joins, volume, ACID check and high velocity
requests
• Designed for transaction processing; Not designed for deep
analytics (intensive computing)
24

Google Solution: DFS
• Distributed File System
• Divide the bigger data file into smaller chunks of size 16-64
MB and store them in different nodes in different racks.
• Chunks are replicated (2-3) for fault tolerance
25
C0 C1
C2C5
Chunk server 1
D1
C5
Chunk server 3
C1
C3C5
Chunk server 2
…
C2D0
D0
C0 C5
Chunk server N
C2
D0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

& Map-Reduce
Map-Reduce environment(Master) takes care of:
• Handling machine failures (with replica nodes)
• Partitioning the input data
• Scheduling workers
• Performing the group by key step
• Managing inter-machine communication
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
26

Big Data Platform
M-R App
MapReduce Stack
(Hadoop & Spark)
Distributed File Systems
(HDFS/ GFS)

DFS is useful, only when
• Size is big (> 1 TB)
• Files are rarely updated
– Works for Google (to store indexed pages)
– Will not be effective for Airline reservation system
(where frequent data updates are done)
29

M-R is useful, only when
• Dimension in billions
– Matrix-vector multiplication in Google Pagerank
• Graph with millions of nodes and billions of edges
– FB Network Graph
• Deep analytical application with intensive computing
– Useful in Finding users with similar buying pattern for products
recommendations in Amazon
– But not useful to manage online retail sales of Amazon (frequent data
updates, transactions)
30

Google Creates
• DFS (GFS)
• Map-Reduce
• Dremel (Big Query)
• Pregel

& Apache Follows
• GFS  HDFS
• Map-Reduce Hadoop, Spark
• Dremel  Drill
• Pregel  Giraph

SCALA
• Uses and Runs on Java Virtual Machine
• Yet, simpler to write (succinct) than Java
– Strong Type Inference (statically typed)
– Lesser Code
• Functional Programming (+ OOP)
– First class functions
• Used to develop Spark stack (Hadoop 2.0)
• Most suited for Map-Reduce applications
– Traits, collections, nested classes
– Immutable dataset
– Scalable

Mining
• Link Analysis
• Classification
• Content based recommendation
• Collaborative Filtering
• Finding similar items
• Clustering, Decomposition…
Machine Learning (Supervised/ Unsupervised)

Cloud
• Amazon AWS
• Google Cloud Platform
• IBM BlueMix
• OpenStack
• Data Bricks, Cloudera, HortonWorks, MapR,…
• SAP, Oracle…
Spark as a service, Hadoop as a service

Big Market
• $16.9 billion in 2015
• $50 billion by 2017
• 90 percent of the Fortune 500 already initiated big data
projects
• Big Data Spending : $8M Per company
• 200 TB of stored data per company
– with >1000 employees
Ref: McKinsey 2011
39

Big Players
• Leaders
– IBM, HP, Dell, SAP, Teradata, Oracle, SAS, Accenture
(>$400 Million)
• Pure players (100% revenue from Big data)
– Palantir, Pivotal, Splunk, Mu Sigma, Actian, Opera Solutions
(>$100 Million)
• Indian Players
– TCS, CapGemini
(>$10 million)
40
WikiBon
2013

Big Jobs
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
41

Smart phones
• 1.2 billion sold in 2014
– 23.1 % increase over 2013
• Accounts to 27% of global handsets
– but consumes 95% of global traffic (1.5 EB/month)
• Daily SMS count exceeded the world population
Ref: Znet, Fool.com
43

Nielsen’s Law
44
bandwidth doubles every twenty-one months
5G in 2020 and 6G in 2030.

Moore's Law
45
Zilog PC
1980 iPhone
2007

Kryder's Law
46
In 2020, 2.5-inch disk drive would
store ~ 40 TB and cost about $40.
Storage capacity (doubles every 12
months) grows faster than Moore’s
law (processing capacity doubles
every 18-24 months).

All Together
47
Annual
Growth Rate
Nielsen's Law
Internet
bandwidth
50%
Moore's Law
Computing
power
60%
Kryder’s Law
Storage
capacity
100%

Social Networks
49
as of August 2015
http://www.statista.com/
No. of active users in millions

Facebook
Ref: Chassis-plans.com, Wikibon
50
60 million posts per day
2.6 billion likes per day
375 million photos uploaded per day
15 TB data uploaded per day
600 TB data handled per day
700 TB Graph search DB
300 PB user data
http://allfacebook.com/orcfile b130817

Twitter
Ref: Chassis-plans.com, Wikibon
51
500 Million tweets per day
1.6 Billion search queries per day
316 Million montly active users
80% active users on mobile

Youtube
• 100 hours of new video every minute
• 53% mobile traffic is video
• Avg Human vision input: ½ million hours/life
• Youtube new uploads: 15 million hours/ year
52

House of Cards
Big data analytics picked up on the success of the British version
of House of Cards, and the popularity of David Fincher (Actor)
and Kevin Spacey (Director) movies
Netflix then made a major decision to commit $100 million for
two 13-episode sessions for its remake (US version) with
above team and streamed online
Netflix earned $1 Billion in that Quarter.
The Atlantic: May 2012
https://gigaom.com/2013/04/22/netflix-q1-2014-earnings/
first Emmy-winning
Streaming show

Lumiata
creates personalized treatment recommendations based on patients'
health data, using 170 million data points
55
raised US$10 Million from VCs
Ash Damle
Founder & CEO

MedAware
Avoids prescription errors due to
Drug mix-up
Patient mix-up
Unawareness of clinical data
Dosage mix-up
56
Example: Chlorambucil (chemotherapy) prescribed to a patient without
cancer, instead of Chloramphenicol (antibiotic)
Using mathematical model derived from Millions of EMRs which
represents real-world treatment patterns
Raised US$1 million funding

Windward
• Only platform to analyze maritime data from ships and ocean
to maintain ship history, predict threats and help make huge
financial decisions on shipping and commodity flows
• Earlier to 2010, it was impossible to know vessel’s location
once it sailed past 30 miles off shores; Then commercial
satellites were introduced ; But the big data collected from
ships gave corrupted picture
57
Raised $15.8 million.

mnubo
• Analytics of IoT Data
• Analytics of data from Connected car for driving
habits, vehicle failure pattern, inventory
management, usage based insurance etc
(36M connected cars will be on the road in 2020)
58
Raised $6 million

rocana
59
How many of your servers
are talking to blacklisted IPs?
How long has your
business been hacked?
Recana helps IT identify the root cause of performance or
security issues at any scale and complexity and resolve
underlying issues in real-time.
Instead of employing “brute force” searches against
millions of log entries, advanced analytics identifies
anomalies for investigation.
raised $19.4 million

Whetlab
• Only 5 data scientists worked
• Twitter acquired at undisclosed deal to increase the ability to
show users the kinds of tweets and content they actually want
to see.
60

Applied Predictive Technologies
Cloud based cause and effect analytics platform to accurately
measure the profit impact of pricing, marketing,
merchandising, operations, and capital initiatives, tailoring
investments in these areas to maximize ROI.
Acquired by MasterCard for $600 million.
61

Netflix Challenge
• Data: How users have rated movies
– 100.5 million ratings by 5 Lakh users to 18K movies
• Goal: Predict how a user would rate an unrated movie
– A recommender system problem
– 10% improvement: 1 million dollar prize
62Hsuan- Tien Lin

KDD Cup Challenge
• Data: How users rated songs
– 252.8 million ratings by 1 million users to 650K songs (Yahoo!)
• Goal: Recommend new songs that user would like
63Hsuan- Tien Lin

BDI for National Security
• TIA after (11/9)
• NATGRID after Mumbai attack (26/11)
– We could have stopped both, if we would have connected the pieces
of intel from all security agencies and info tracked from suspects
together.
64

More Applications
• Building a Stock Investment Strategy Model
• Predicting Customer Transaction Behavior
• Failure Prediction
• Opinion Mining to Determine User Sentiments
• Financial Loss Prediction
• Insurance Claim Prediction Model
• Bond Trade Price Prediction
• Prediction of Number of Days in the Hospital
• Accelerating Discovery of Drugs for Mutants of H1N1
• Molecular Activity Prediction
• Job Recommendation Engine
65
https://insofeprojects.wordpress.com/insofe-projects/

A first course on BDI
Day Topics
Day 1 FN BDI: The Beginning
DFS and Map-Reduce
Distributed Graph (Pregel)
Page Rank algorithm
Day 1 AN BDI Tools Landscape
Dremel and Big Query
Naïve Bayes Classifier
Day 2 FN TF-IDF, Jaccard and Cosine
Collaborative filtering
Shingling, Minhashing
Locality Sensitive Hashing
Day Topics
Day 2 AN Scala Basics for MR apps
Practice session
More fun with Scala
Day 3 FN Spark projects using
Scala
Day 3 AN Student Projects ideas
Q&A

M.S. Options in USA
68
University Program
Stanford University M.S-CS, Specialization in Information
Management and Analytics
Four course graduate certificate in mining
massive datasets (link)
Northwestern University Master of Science In Analytics
DePaul University Master of Science in Predictive Analytics
North Carolina State University Master of Science In Analytics
University of Ottawa, Canada M.Sc in Analytics
University of Connecticut MS in Business Analytics and Project
Management
informationweek.com
IBM Director Dr. Spohrer's short list

PG options in India
69
Institute Program
Indian School of Business Certified Program in Business
Analytics (CBA)
Great Lakes Institute of
Management
PGP in Business Analytics
IIM Bangalore Analytics Essentials, BAI
IIM Ahmedabad Advanced Analytics for
Management
AnalyticsVidya.com, analyticsindiamag.com

Road Ahead
”The ultimate
search engine would
understand exactly
what you mean and
give back exactly
what you want.”
- Larry Page

Prof.Ashok.R | +91-9943900101 | ashok@zettab.com
ZettaB.com
Big Data Training in Coimbatore

BDI- The Beginning (Big data training in Coimbatore)

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to BDI- The Beginning (Big data training in Coimbatore) (20)

Recently uploaded (20)

BDI- The Beginning (Big data training in Coimbatore)

Editor's Notes