SlideShare a Scribd company logo
Big Data : Challenges & Potential
Solutions
Ashwin Satyanarayana
CST Colloquium
April 16th, 2015
Outline
• Big Data Everywhere
• Motivation (How Big Data solves Speller Issue)
• The New Bottleneck in Big Data
• Search Engine (Big Data)
– Query Document Mismatch
– Query Reformulation
• Potential Solutions
– How much of the data?
• Intelligent Sampling of Big Data
– How to clean up the data?
• Filtering
• Conclusions
Big Data EveryWhere!
• Umbrella term – digital data to health data
• Big data gets its name from the massive amounts of
zeros and ones collected in a single year, month, day
— even an hour.
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
How much data?
640K ought to be
enough for anybody.
Opportunities?
Big Data Challenges and Solutions
Big Data and the U.S. Gov't
• "It's about discovering meaning
and information from millions
and billions of data points… We
therefore can't overstate the
need for software engineers …
working together to find new
and faster ways to identify and
separate relevant data from non-
relevant data"
– Janet Napolitano, March 2011
Big Data Challenges and Solutions
Where are we headed?
• Smart Data
– employees can promote content targeting potential customers.
• Identity Data
– your identity data tells the story of who you are in the digital age, including
what you like, what you buy, your lifestyle choices and at what time or
intervals all of this occurs.
• People Data
– who your audience likes and follows on social media, what links they click,
how long they stay on the site that they clicked over to and how many
converted versus bounced.
– based on people data combined with on-site analytics, a site can customize
experiences for users based how those customers want to use a site.
Big Data – How does it solve
problems?
Motivating Example: Speller Problem Solved!
Speller
Google Speller
Microsoft Word Speller
More Data beats Better Algorithms
• Before (Big Data)
• Now (Big Data)
Algorithms
Data
Data
Algorithms
User Click Data is the key!
• User types a query A, no click
• User alters the query to B, and click on a result
• Google/Bing captures that information: A -> B
• If many users use the same alteration from A to B, then
google knows that B is the better alteration for A.
• “Farenheight” -> “Fahrenheit” -> Click
– Farenheight -> Fahrenheit -> 1045 times
– Farenheight -> Fare height -> 3 times
• Thus, Google/Bing uses the most often made transition
that results in a click to offer spell suggestions.
Other Search Engine Problems
Solved using Big Data
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
Big Data Challenges and Solutions
New Bottleneck
• Hardware was the primary bottleneck with thin
network pipes, small hard disks and slow processors.
– As hardware improved,
– Software that distributed storage and processing across
large numbers of independent servers.
• Hadoop is the software platform - “collect everything.”
• Instead of spending vast amounts of time sifting
through an avalanche of data, data scientists need
tools to determine what subset of data to analyze and
how to clean and make sense of it.
33
Motivation: Training a Neural Network
for Face Recognition
Neural Net MaleFemale
Training
M M M M M M M M M M
M M M M M M M M M M
M M M M M M M M M M
M M M M M M M M M M
F F F F F F F F F F
M M M M M M M M M M
34
Motivation: Face Recognition
• Challenge 1: How many faces from a
large database are needed to train a
system so that it would distinguish
between a male and a female?
Is this enough?
35
Motivation: Face Recognition
• Challenge 2: Predictive accuracy
depends on the quality of data in the
training set. Hence improving the quality
of training data (i.e. handling noise)
improves the accuracy over unseen
instances.
Female
Female
Training Instance Class Label
Male
NonoiseFeaturenoiseClassnoise
Noise
36
The Curse of Large Datasets
There are two main challenges of dealing with large datasets.
• Running Time: Reducing the amount of time it takes for the mining
algorithm to train.
• Predictive Accuracy: Poor quality of instances in training data lead
to lower predictive accuracies. Large datasets are usually
heterogeneous, and subjected to more noisy instances [Liu, Motoda
DMKD 2002].
The goal of our work is to address these questions and provide some
solutions.
37
Two Potential Solutions
• There are two potential solutions to the problems
• Running time:
– Scale up existing data mining algorithms (for e.g.
parallelize them)
– Scale down the data
• Predictive accuracy:
– Better mining algorithms
– Improve the quality of the training data
Intelligent Sampling
Bootstrapping
 We shall focus on first scaling down the data, while improving the quality
38
Learning Curve Phenomenon
• Is it necessary to apply learner
to all of the available data?
• A learning curve depicts the
relationship between sample
size and accuracy [Provost,
Jensen & Oates 99].
• Problem: Determining nmin efficiently
Given a data mining algorithm M, a dataset D of N instances, we would like the
smallest sample Di of size nmin such that:
Pr(acc(D) – acc(Di) > ε) ≤ δ
ε is the maximum acceptable decrease in accuracy (approximation) and
δ is the probability of failure
39
Prior Work: Static Sampling
• Arithmetic Sampling [John & Langley
KDD 1996] uses a schedule
Sa = <n0, n0 + nδ, n0 + 2nδ,..n0+k.nδ>
Example: An example arithmetic
schedule is <100 , 200, 300, . . . .>
Drawback: If nmin is a large multiple of
nδ, then the approach will require many
runs of the underlying algorithm.
The total number of instances used may
be greater than N.
40
Prior Work in Progressive Sampling
• Geometric Sampling [Provost, Jenson &
Oates KDD 1999] uses a schedule
Sg= <n0, a.n0,a2.n0, ………,ak.n0>
An example geometric schedule is:
<100,200,400,800, . . .>
Drawbacks:
(a) In practice, Geometric Sampling
suffers from overshooting nmin.
(b) Not dependent on the dataset at
hand. A good sampling algorithm
is expected to have low bias and
low sampling variance
For example in the KDD CUP dataset,
when nmin=56,600,
the geometric schedule is as follows:
{100, 200, 400, 800, 1600, 3200, 6400,
12800, 25600, 51200, 102400}.
Notice here that the last sample has
overshot nmin by 45,800 instances
Intelligent Dynamic Adaptive Sampling
(a) Sampling schedule → adaptive
to the dataset under consideration.
(c) Adaptive stopping criterion to
determine when the learning curve
has reached a point of diminishing
returns.
42
Definition: Chebyshev Inequality
• Definition 1: Chebyshev Inequality [Bertrand and Chebyshev-1845]: In any
probability distribution, the probability of the estimated quantity p’ being more than
epsilon far away from the true value p after m independently drawn points is
bounded by:
m
1
p'
]|p'-p|Pr[ 22
2










 


 







m
1
p'
]|p'-p|Pr[ 22
2
What are we trying to solve?
Pr(acc(D) – acc(Dnmin) > ε) ≤ δ
Challenge: how do we compute the accuracy of the entire dataset acc(D)?
Myopic Strategy: One Step at a time
43
û(Di)
û(Di+1)


||
1
)(
||
1
)
i
û(D
iD
i
i
i
xf
D
[Average over the sample Di]
The instance function f(xi) used here is a
Bernoulli trial, 0-1 classification accuracy
44



 







m
1
p'
]|p'-p|Pr[ 22
2
True value
u(Da)-u(Db);
where |Da|>|Db|>nmin
Estimated value
u(Da)-u(Db);
where |Da|>|Db|≥1
Plateau Region:
p=0
Myopic Strategy:
u(Di) - u(Di-1)
   


 







m
BOOT 1
))u(D-)u(D(
)u(D-)u(D-0Pr 2
1-ii
2
2
1-ii

 1
))u(D-)u(D( 2
1-ii
2
2







 BOOT
m 








 m
BOOT 1
))u(D-)u(D( 2
1-ii
2
2
Approximation parameter
Confidence parameter
Bootstrapped Variance p’
(to improve the quality of the training data)
45
Four Possible cases for Convergence
NegativeFalse
1
p'
and|p'-p|(d)
PositiveFalse
1
p'
and|p'-p|(c)
instancesmoreAdd
1
p'
and|p'-p|(b)
Converged
1
p'
and|p'-p|(a)
22
2
22
2
22
2
22
2
















































m
m
m
m
46
47
Empirical Results
• (Full): SN = <N>, a single sample with all the instances. This is the most commonly
used method. This method suffers from both speed and memory drawbacks.
• (Geo): Geometric Sampling, in which the training set size is increased
geometrically, Sg = <|D1|,a.|D1|,a2.|D1|,a3.|D1|….ak.|D1|> until convergence is
reached. In our experiments, we use |D1|=100, and a=2 as per the Provost et’ al
study.
• (IDASA): Our adaptive sampling approach using Chebyshev Bounds. We use ε =
0.001 and δ = 0.05 (95% probability) in all our experiments.
• (Oracle): So = <nmin>, the optimal schedule determined by an omniscient oracle; we
determined nmin empirically by analyzing the full learning curve beforehand. We use
these results to empirically compare the optimum performance with other methods.
• We compare these methods using the following two performance criteria:
• (a) The mean computation time: Averaged over 20 runs for each dataset.
• (b) The total number of instances needed to converge: If the sampling schedule, is
as follows: S={|D1|,|D2|,|D3|….|Dk|}, then the total number of instances would be
|D1|+|D2|+|D3|….+|Dk|.
48
Empirical Results (for Non-incremental Learner)
49
Empirical Results (for Incremental Learners)
Table 2: Comparison of the mean computation time required for convergence by the different methods to obtain the same accuracy. Time
is in CPU seconds (by the linux time command)
Table 1: Comparison of the total number of instances required for convergence by the different methods to obtain the same accuracy
50
Conclusions
1) Big data is evolving and Smart data, Identity data and People data are here to stay.
Think of them as the human discovery of fire, the wheel and wheat.
2) Search Engines use Big Data for their relevance and ranking problems
3) Main Bottleneck: data scientists need tools to determine what subset of data to
analyze and how to clean and make sense of it
4) Intelligent Sampling addresses what subset to use.
5) Bootstrap filtering helps clean up the data.
6) In conclusion, Big data is big and it is here to stay. And it’s only getting bigger!!
51
Questions & Suggestions
Filtering
53
Bayesian Analysis of Ensemble Filtering
L1(k-NN Classifier)
Training Data
D
L3(Linear Machine)L2(Decision Tree)
θ1
θ3
θ2
θ1 δ(yj , θ1(xj))
θ2
δ(yj , θ2(xj))
θ3
δ(yj , θ3(xj))
=1 if θ
.(xj) is the same as
the true label yj else 0
( xj , yj )
54
Bayesian Analysis of Filtering
Techniques
• Bayesian Analysis of Ensemble Filtering:
– Ensemble Filtering [Brodley & Friedl 1999- JAIR]: An ensemble classifier detects
mislabeled instances by constructing a set of classifiers (m base level detectors).
– A majority vote filter tags an instance as mislabeled if more than half of the m
classifiers classify it incorrectly
– If < ½ then the instance is noisy and is removed
– Example: Say δ(yj ,θ1(xj)) =1 (non-noisy)
δ(yj ,θ2(xj)) =0 (noisy)
δ(yj ,θ3(xj)) =0 (noisy)
Pr(y|x,Θ*) = 1/3 =0.33 < ½ , The instance xj is treated as noisy and is removed.
  


*
,
||
1
),|Pr( *
*

 xtxty
),|Pr( *
 xty
55
What Computation is Being Performed?
   


*
)(|
1
,|Pr *
i
xty
k
xty i


),|Pr(
1
*
ixty
k i





The computation can be interpreted as a crude form of model averaging which
ignores predictive uncertainty and model uncertainty
)|Pr(),|Pr(
*
Dxty ii
i


 

Assigns equal weight to all the models “Probability” that the model
predicts the correct output
56
Bootstrapped Model Averaging
– Now that we know what computation is occurring in
the filtering techniques, we now propose using a better
approximation to model averaging.
– There are lots of model averaging approaches, we
choose this because it is efficient for large datasets.
– This particular case uses Efron’s non-parametric
bootstrapping approach which is as shown below:
where the datasets D’ are the different bootstrap
samples generated from the original dataset D.
     DDxyxy
DLD
i
i
|'Pr,|Pr,|Pr
)'(,'
 


57
Introduction to Bootstrapping
• Basic idea:
– Draw datasets by sampling with replacement from the original dataset
– Each perturbed dataset has the same size as the original training set
D
Db’
D2’
D1’
58
Bootstrap Averaging
D
Db’
D2’
D1’
L(Decision Tree)
θ1
θ2
θb
59
Bootstrapped Model Averaging (BMA) vs Filtering
• We now try using bootstrapped model averaging and compare it with partition filtering
(10 folds). We used 80% training set and a fixed 20% for testing. We eliminate noisy
instances as in partition filtering, and obtained the following results:
• Thus we see can conclude that bootstrap averaged model produces a better
approximation to model averaging when compared to the model averaging performed by
filtering.
2500 instances -> 10% noise – 250 noisy instances
Partition filtering removed – 194 instances (162 instances were actually noisy, 32 instances were good ones)
BMA filtering removed – 189 instances (178 instances were noisy, 11 instances were non-noisy)
Overlap of noisy instances between the two techniques = 151 instances
LED24 Dataset
60
Why remove noise?
• 1) Removing noisy instances increases
the predictive accuracy. This has been
shown by Quinlan [Qui86] and Brodley
and Friedl ([BF96][BF99])
• 2) Removing noisy instances creates a
simpler model. We show this empirically
in the Table.
• 3) Removing noisy instances reduces the
variance of the predictive accuracy: This
makes a more stable learner whose
accuracy estimate is more likely to be the
true value.
61
Motivation for Correction: Spell
Checker
Filter
Correct
62
Empirical Comparison of Noise Handling Techniques
• We compare the following approaches for noise
handling:
– Robust Algorithms: To avoid overfitting, post pruning
and stop conditions are used that prevent further
splitting of a leaf node.
– Partition Filtering: Discards instances that the C4.5
decision tree misclassifies, and builds a new tree using
the remaining data
– LC Approach: Corrects instances that the C4.5 decision
tree misclassified and built a new tree using the
corrected data.
63
Empirical Results
64
Empirical Results
Conclusion
65
Our research yielded the following results:
• (What size?) Sampling:
– Dynamic Adaptive Sampling using Chernoff
Inequality
– General purpose Adaptive Sampling using Chebyshev
Inequality
• (Cleaning) Filtering:
– Bayesian Analysis of other filtering techniques
– Bootstrap Model Averaging for Noise Removal
– LC Approach
66
Questions & Suggestions

More Related Content

PDF
Using Deep Learning to Find Similar Dresses
PPTX
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
PPTX
Ml ppt at
DOCX
Dnn guidelines
PDF
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
PDF
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
PDF
Domain adaptation: A Theoretical View
PPTX
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Using Deep Learning to Find Similar Dresses
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Ml ppt at
Dnn guidelines
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Domain adaptation: A Theoretical View
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8

What's hot (19)

PPTX
Ironwood4_Tuesday_Medasani_1PM
PDF
Bol.com
PPTX
Algorithms Design Patterns
PDF
Generalized Reinforcement Learning
PPTX
Deep Reinforcement Learning
PDF
Google Big Data Expo
PDF
Garuda Robotics x DataScience SG Meetup (Sep 2015)
PPTX
InfoGAIL
PDF
Model-Based Reinforcement Learning @NIPS2017
PDF
Deeplearning in finance
PPTX
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Com...
PPTX
Competition winning learning rates
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PPTX
Techniques in Deep Learning
PDF
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
PPT
notes as .ppt
PDF
Entity embeddings for categorical data
PDF
Deep Learning for Stock Prediction
Ironwood4_Tuesday_Medasani_1PM
Bol.com
Algorithms Design Patterns
Generalized Reinforcement Learning
Deep Reinforcement Learning
Google Big Data Expo
Garuda Robotics x DataScience SG Meetup (Sep 2015)
InfoGAIL
Model-Based Reinforcement Learning @NIPS2017
Deeplearning in finance
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Com...
Competition winning learning rates
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Techniques in Deep Learning
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Maximum Entropy Reinforcement Learning (Stochastic Control)
notes as .ppt
Entity embeddings for categorical data
Deep Learning for Stock Prediction
Ad

Similar to Big Data Challenges and Solutions (20)

PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
PDF
Introduction to Data Science
PPT
Data Science in the Real World: Making a Difference
ODP
Challenges in Large Scale Machine Learning
PPT
tutorial.ppt
PDF
Machine Learning Foundations
PPT
Large Scale Data Mining using Genetics-Based Machine Learning
PDF
Machine learning at b.e.s.t. summer university
PDF
Oxford Lectures Part 1
PDF
Scalable Learning Technologies for Big Data Mining
PPTX
Introduction to Big Data/Machine Learning
PPTX
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
PPTX
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
PDF
Online machine learning in Streaming Applications
PPTX
machine learning in the age of big data: new approaches and business applicat...
PDF
[243] turning data into value
PDF
Yo. big data. understanding data science in the era of big data.
PPT
Lecture: introduction to Machine Learning.ppt
PDF
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Introduction to Data Science
Data Science in the Real World: Making a Difference
Challenges in Large Scale Machine Learning
tutorial.ppt
Machine Learning Foundations
Large Scale Data Mining using Genetics-Based Machine Learning
Machine learning at b.e.s.t. summer university
Oxford Lectures Part 1
Scalable Learning Technologies for Big Data Mining
Introduction to Big Data/Machine Learning
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Online machine learning in Streaming Applications
machine learning in the age of big data: new approaches and business applicat...
[243] turning data into value
Yo. big data. understanding data science in the era of big data.
Lecture: introduction to Machine Learning.ppt
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
Ad

More from New York City College of Technology Computer Systems Technology Colloquium (12)

PDF
Ontology-based Classification and Faceted Search Interface for APIs
PDF
Towards Improving Interface Modularity in Legacy Java Software Through Automa...
PDF
Data-driven, Interactive Scientific Articles in a Collaborative Environment w...
PPTX
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
PPTX
How We Use Functional Programming to Find the Bad Guys
PDF
PDF
Test Dependencies and the Future of Build Acceleration
Ontology-based Classification and Faceted Search Interface for APIs
Towards Improving Interface Modularity in Legacy Java Software Through Automa...
Data-driven, Interactive Scientific Articles in a Collaborative Environment w...
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
How We Use Functional Programming to Find the Bad Guys
Test Dependencies and the Future of Build Acceleration

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Spectroscopy.pptx food analysis technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectroscopy.pptx food analysis technology
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools

Big Data Challenges and Solutions

  • 1. Big Data : Challenges & Potential Solutions Ashwin Satyanarayana CST Colloquium April 16th, 2015
  • 2. Outline • Big Data Everywhere • Motivation (How Big Data solves Speller Issue) • The New Bottleneck in Big Data • Search Engine (Big Data) – Query Document Mismatch – Query Reformulation • Potential Solutions – How much of the data? • Intelligent Sampling of Big Data – How to clean up the data? • Filtering • Conclusions
  • 3. Big Data EveryWhere! • Umbrella term – digital data to health data • Big data gets its name from the massive amounts of zeros and ones collected in a single year, month, day — even an hour. • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network
  • 4. How much data? 640K ought to be enough for anybody.
  • 7. Big Data and the U.S. Gov't • "It's about discovering meaning and information from millions and billions of data points… We therefore can't overstate the need for software engineers … working together to find new and faster ways to identify and separate relevant data from non- relevant data" – Janet Napolitano, March 2011
  • 9. Where are we headed? • Smart Data – employees can promote content targeting potential customers. • Identity Data – your identity data tells the story of who you are in the digital age, including what you like, what you buy, your lifestyle choices and at what time or intervals all of this occurs. • People Data – who your audience likes and follows on social media, what links they click, how long they stay on the site that they clicked over to and how many converted versus bounced. – based on people data combined with on-site analytics, a site can customize experiences for users based how those customers want to use a site.
  • 10. Big Data – How does it solve problems? Motivating Example: Speller Problem Solved!
  • 12. More Data beats Better Algorithms • Before (Big Data) • Now (Big Data) Algorithms Data Data Algorithms
  • 13. User Click Data is the key! • User types a query A, no click • User alters the query to B, and click on a result • Google/Bing captures that information: A -> B • If many users use the same alteration from A to B, then google knows that B is the better alteration for A. • “Farenheight” -> “Fahrenheit” -> Click – Farenheight -> Fahrenheit -> 1045 times – Farenheight -> Fare height -> 3 times • Thus, Google/Bing uses the most often made transition that results in a click to offer spell suggestions.
  • 14. Other Search Engine Problems Solved using Big Data
  • 32. New Bottleneck • Hardware was the primary bottleneck with thin network pipes, small hard disks and slow processors. – As hardware improved, – Software that distributed storage and processing across large numbers of independent servers. • Hadoop is the software platform - “collect everything.” • Instead of spending vast amounts of time sifting through an avalanche of data, data scientists need tools to determine what subset of data to analyze and how to clean and make sense of it.
  • 33. 33 Motivation: Training a Neural Network for Face Recognition Neural Net MaleFemale Training M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M F F F F F F F F F F M M M M M M M M M M
  • 34. 34 Motivation: Face Recognition • Challenge 1: How many faces from a large database are needed to train a system so that it would distinguish between a male and a female? Is this enough?
  • 35. 35 Motivation: Face Recognition • Challenge 2: Predictive accuracy depends on the quality of data in the training set. Hence improving the quality of training data (i.e. handling noise) improves the accuracy over unseen instances. Female Female Training Instance Class Label Male NonoiseFeaturenoiseClassnoise Noise
  • 36. 36 The Curse of Large Datasets There are two main challenges of dealing with large datasets. • Running Time: Reducing the amount of time it takes for the mining algorithm to train. • Predictive Accuracy: Poor quality of instances in training data lead to lower predictive accuracies. Large datasets are usually heterogeneous, and subjected to more noisy instances [Liu, Motoda DMKD 2002]. The goal of our work is to address these questions and provide some solutions.
  • 37. 37 Two Potential Solutions • There are two potential solutions to the problems • Running time: – Scale up existing data mining algorithms (for e.g. parallelize them) – Scale down the data • Predictive accuracy: – Better mining algorithms – Improve the quality of the training data Intelligent Sampling Bootstrapping  We shall focus on first scaling down the data, while improving the quality
  • 38. 38 Learning Curve Phenomenon • Is it necessary to apply learner to all of the available data? • A learning curve depicts the relationship between sample size and accuracy [Provost, Jensen & Oates 99]. • Problem: Determining nmin efficiently Given a data mining algorithm M, a dataset D of N instances, we would like the smallest sample Di of size nmin such that: Pr(acc(D) – acc(Di) > ε) ≤ δ ε is the maximum acceptable decrease in accuracy (approximation) and δ is the probability of failure
  • 39. 39 Prior Work: Static Sampling • Arithmetic Sampling [John & Langley KDD 1996] uses a schedule Sa = <n0, n0 + nδ, n0 + 2nδ,..n0+k.nδ> Example: An example arithmetic schedule is <100 , 200, 300, . . . .> Drawback: If nmin is a large multiple of nδ, then the approach will require many runs of the underlying algorithm. The total number of instances used may be greater than N.
  • 40. 40 Prior Work in Progressive Sampling • Geometric Sampling [Provost, Jenson & Oates KDD 1999] uses a schedule Sg= <n0, a.n0,a2.n0, ………,ak.n0> An example geometric schedule is: <100,200,400,800, . . .> Drawbacks: (a) In practice, Geometric Sampling suffers from overshooting nmin. (b) Not dependent on the dataset at hand. A good sampling algorithm is expected to have low bias and low sampling variance For example in the KDD CUP dataset, when nmin=56,600, the geometric schedule is as follows: {100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600, 51200, 102400}. Notice here that the last sample has overshot nmin by 45,800 instances
  • 41. Intelligent Dynamic Adaptive Sampling (a) Sampling schedule → adaptive to the dataset under consideration. (c) Adaptive stopping criterion to determine when the learning curve has reached a point of diminishing returns.
  • 42. 42 Definition: Chebyshev Inequality • Definition 1: Chebyshev Inequality [Bertrand and Chebyshev-1845]: In any probability distribution, the probability of the estimated quantity p’ being more than epsilon far away from the true value p after m independently drawn points is bounded by: m 1 p' ]|p'-p|Pr[ 22 2                        m 1 p' ]|p'-p|Pr[ 22 2 What are we trying to solve? Pr(acc(D) – acc(Dnmin) > ε) ≤ δ Challenge: how do we compute the accuracy of the entire dataset acc(D)?
  • 43. Myopic Strategy: One Step at a time 43 û(Di) û(Di+1)   || 1 )( || 1 ) i û(D iD i i i xf D [Average over the sample Di] The instance function f(xi) used here is a Bernoulli trial, 0-1 classification accuracy
  • 44. 44             m 1 p' ]|p'-p|Pr[ 22 2 True value u(Da)-u(Db); where |Da|>|Db|>nmin Estimated value u(Da)-u(Db); where |Da|>|Db|≥1 Plateau Region: p=0 Myopic Strategy: u(Di) - u(Di-1)                m BOOT 1 ))u(D-)u(D( )u(D-)u(D-0Pr 2 1-ii 2 2 1-ii   1 ))u(D-)u(D( 2 1-ii 2 2         BOOT m           m BOOT 1 ))u(D-)u(D( 2 1-ii 2 2 Approximation parameter Confidence parameter Bootstrapped Variance p’ (to improve the quality of the training data)
  • 45. 45 Four Possible cases for Convergence NegativeFalse 1 p' and|p'-p|(d) PositiveFalse 1 p' and|p'-p|(c) instancesmoreAdd 1 p' and|p'-p|(b) Converged 1 p' and|p'-p|(a) 22 2 22 2 22 2 22 2                                                 m m m m
  • 46. 46
  • 47. 47 Empirical Results • (Full): SN = <N>, a single sample with all the instances. This is the most commonly used method. This method suffers from both speed and memory drawbacks. • (Geo): Geometric Sampling, in which the training set size is increased geometrically, Sg = <|D1|,a.|D1|,a2.|D1|,a3.|D1|….ak.|D1|> until convergence is reached. In our experiments, we use |D1|=100, and a=2 as per the Provost et’ al study. • (IDASA): Our adaptive sampling approach using Chebyshev Bounds. We use ε = 0.001 and δ = 0.05 (95% probability) in all our experiments. • (Oracle): So = <nmin>, the optimal schedule determined by an omniscient oracle; we determined nmin empirically by analyzing the full learning curve beforehand. We use these results to empirically compare the optimum performance with other methods. • We compare these methods using the following two performance criteria: • (a) The mean computation time: Averaged over 20 runs for each dataset. • (b) The total number of instances needed to converge: If the sampling schedule, is as follows: S={|D1|,|D2|,|D3|….|Dk|}, then the total number of instances would be |D1|+|D2|+|D3|….+|Dk|.
  • 48. 48 Empirical Results (for Non-incremental Learner)
  • 49. 49 Empirical Results (for Incremental Learners) Table 2: Comparison of the mean computation time required for convergence by the different methods to obtain the same accuracy. Time is in CPU seconds (by the linux time command) Table 1: Comparison of the total number of instances required for convergence by the different methods to obtain the same accuracy
  • 50. 50 Conclusions 1) Big data is evolving and Smart data, Identity data and People data are here to stay. Think of them as the human discovery of fire, the wheel and wheat. 2) Search Engines use Big Data for their relevance and ranking problems 3) Main Bottleneck: data scientists need tools to determine what subset of data to analyze and how to clean and make sense of it 4) Intelligent Sampling addresses what subset to use. 5) Bootstrap filtering helps clean up the data. 6) In conclusion, Big data is big and it is here to stay. And it’s only getting bigger!!
  • 53. 53 Bayesian Analysis of Ensemble Filtering L1(k-NN Classifier) Training Data D L3(Linear Machine)L2(Decision Tree) θ1 θ3 θ2 θ1 δ(yj , θ1(xj)) θ2 δ(yj , θ2(xj)) θ3 δ(yj , θ3(xj)) =1 if θ .(xj) is the same as the true label yj else 0 ( xj , yj )
  • 54. 54 Bayesian Analysis of Filtering Techniques • Bayesian Analysis of Ensemble Filtering: – Ensemble Filtering [Brodley & Friedl 1999- JAIR]: An ensemble classifier detects mislabeled instances by constructing a set of classifiers (m base level detectors). – A majority vote filter tags an instance as mislabeled if more than half of the m classifiers classify it incorrectly – If < ½ then the instance is noisy and is removed – Example: Say δ(yj ,θ1(xj)) =1 (non-noisy) δ(yj ,θ2(xj)) =0 (noisy) δ(yj ,θ3(xj)) =0 (noisy) Pr(y|x,Θ*) = 1/3 =0.33 < ½ , The instance xj is treated as noisy and is removed.      * , || 1 ),|Pr( * *   xtxty ),|Pr( *  xty
  • 55. 55 What Computation is Being Performed?       * )(| 1 ,|Pr * i xty k xty i   ),|Pr( 1 * ixty k i      The computation can be interpreted as a crude form of model averaging which ignores predictive uncertainty and model uncertainty )|Pr(),|Pr( * Dxty ii i      Assigns equal weight to all the models “Probability” that the model predicts the correct output
  • 56. 56 Bootstrapped Model Averaging – Now that we know what computation is occurring in the filtering techniques, we now propose using a better approximation to model averaging. – There are lots of model averaging approaches, we choose this because it is efficient for large datasets. – This particular case uses Efron’s non-parametric bootstrapping approach which is as shown below: where the datasets D’ are the different bootstrap samples generated from the original dataset D.      DDxyxy DLD i i |'Pr,|Pr,|Pr )'(,'    
  • 57. 57 Introduction to Bootstrapping • Basic idea: – Draw datasets by sampling with replacement from the original dataset – Each perturbed dataset has the same size as the original training set D Db’ D2’ D1’
  • 59. 59 Bootstrapped Model Averaging (BMA) vs Filtering • We now try using bootstrapped model averaging and compare it with partition filtering (10 folds). We used 80% training set and a fixed 20% for testing. We eliminate noisy instances as in partition filtering, and obtained the following results: • Thus we see can conclude that bootstrap averaged model produces a better approximation to model averaging when compared to the model averaging performed by filtering. 2500 instances -> 10% noise – 250 noisy instances Partition filtering removed – 194 instances (162 instances were actually noisy, 32 instances were good ones) BMA filtering removed – 189 instances (178 instances were noisy, 11 instances were non-noisy) Overlap of noisy instances between the two techniques = 151 instances LED24 Dataset
  • 60. 60 Why remove noise? • 1) Removing noisy instances increases the predictive accuracy. This has been shown by Quinlan [Qui86] and Brodley and Friedl ([BF96][BF99]) • 2) Removing noisy instances creates a simpler model. We show this empirically in the Table. • 3) Removing noisy instances reduces the variance of the predictive accuracy: This makes a more stable learner whose accuracy estimate is more likely to be the true value.
  • 61. 61 Motivation for Correction: Spell Checker Filter Correct
  • 62. 62 Empirical Comparison of Noise Handling Techniques • We compare the following approaches for noise handling: – Robust Algorithms: To avoid overfitting, post pruning and stop conditions are used that prevent further splitting of a leaf node. – Partition Filtering: Discards instances that the C4.5 decision tree misclassifies, and builds a new tree using the remaining data – LC Approach: Corrects instances that the C4.5 decision tree misclassified and built a new tree using the corrected data.
  • 65. Conclusion 65 Our research yielded the following results: • (What size?) Sampling: – Dynamic Adaptive Sampling using Chernoff Inequality – General purpose Adaptive Sampling using Chebyshev Inequality • (Cleaning) Filtering: – Bayesian Analysis of other filtering techniques – Bootstrap Model Averaging for Noise Removal – LC Approach