SlideShare a Scribd company logo
1
Open Problems in Big Data
Analytics: A Practitioner’s View
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
Invited Talk, National Conference on Distributed
Machine Learning, Feb 2015
Contents
2
State-of-art in Big Data Analytics
Big Data Computations: Characterization
Big Data pipelines: open problems
• Start from business questions
• How quickly and accurately can we get
answers?
• Data gets stored in HDFS
• Various frameworks to process data
• Spark – machine learning
• Giraph/GraphLab – graph processing
• Storm – real-time processing
State of Art in Big Data Analytics
3
• HDFS the right storage?
• Alternatives
• Cassandra, MapR – M7, QFS,
Cleversafe, Isilion, etc.
http://www.inktank.com/news-events/new/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
State of Art in Big Data Analytics
4
5
State of Art in Big Data Analytics
• Spark the right platform for processing?
• Alternatives
• Flink
• Forge – meta domain specific
language
6
State of Art in Big Data Analytics
• Spark Streaming/Storm the right platform
for stream processing?
7
Big Data ComputationsComputations/Operations
Giant 1 (simple stats) is perfect
for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-
body), 4 (optimization) Spark
from UC Berkeley is efficient?
Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first
approach for consumer churn
analysis [2]
Interactive/On-the-fly data
processing – Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Deep Learning
Artificial Neural Networks/Deep
Belief Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social
Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio
Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:
8
Big Data Pipelines
1. Nuance – incompleteness
2. Scale
3. Timeliness
4. Privacy
5. Human Loop
9
Big Data Pipelines: Data Acquisition
• Needle in a Haystack.
• Blink DB?
• Automatic metadata discovery
10
Big Data Pipelines: Information
Extraction
• Error models for data cleaning
• Multimedia data
11
Big Data Pipelines: Analytics
• Multi-dimensional data
The network to identify the individual digits
from the input image
http://neuralnetworksanddeeplearning.com/chap1.html
Copyright @Impetus Technologies, 2014
DLNs for Face Recognition
Copyright @Impetus Technologies, 2014
Copyright @Impetus Technologies, 2015
DLN for Face Recognition
http://www.slideshare.net/hammawan/deep-neural-networks
Copyright @Impetus Technologies,
2014
Success stories of DLNs
Android voice
recognition system –
based on DLNs
Improves accuracy by
25% compared to state-
of-art
Microsoft Skype Translate software
and Digital assistant Cortana
1.2 million images, 1000
classes (ImageNet Data)
– error rate of 15.3%,
better than state of art at
26.1%
Copyright @Impetus Technologies, 2015
Success stories of DLNs…..
Senna system – PoS tagging, chunking, NER,
semantic role labeling, syntactic parsing
Comparable F1 score with state-of-art with huge speed
advantage (5 days VS few hours).
DLNs VS TF-IDF: 1 million
documents, relevance search.
3.2ms VS 1.2s.
Robot navigation
Open problems big_data_19_feb_2015_ver_0.1
18
• Hadoop = HDFS + Map-Reduce
• Useful for large scale embarrassingly
parallel processing of data sets
• Not so good for iterative, interactive
computing.
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.
• Real-time computation
• Processing specialized data structures
Conclusions
Thank You!
Mail • vijay.sa@impetus.co.in
LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran
Blogs • blogs.impetus.com
Twitter • @a_vijaysrinivas.
• Divyakant Agarwal et. al., Challenges and
Opportunities with Big Data, Computing
Research Association White Paper,
available from
http://www.cra.org/ccc/files/docs/init/bigdat
awhitepaper.pdf.
• Vijay Srinivas Agneeswaran et. al.,
Distributed Deep Learning over Spark,
available at:
http://www.datasciencecentral.com/profiles/
blogs/implementing-a-distributed-deep-
References
20

More Related Content

PPTX
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
PPTX
Big data analytics_7_giants_public_24_sep_2013
PPTX
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
PDF
Kinetica master chug_9.12
PPTX
Chug dl presentation
PDF
TensorFlow London: Cutting edge generative models
PDF
Graph Databases and Machine Learning | November 2018
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Big data analytics_7_giants_public_24_sep_2013
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Kinetica master chug_9.12
Chug dl presentation
TensorFlow London: Cutting edge generative models
Graph Databases and Machine Learning | November 2018

What's hot (20)

PPTX
Keras: A versatile modeling layer for deep learning
PDF
Predictive Maintenance Using Recurrent Neural Networks
PPTX
Comparing Big Data and Simulation Applications and Implications for Software ...
PDF
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
PDF
Graph Gurus Episode 1: Enterprise Graph
PPT
Big Graph Analytics on Neo4j with Apache Spark
PPTX
IBM Strategy for Spark
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
PDF
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
PPTX
Graph Data: a New Data Management Frontier
PDF
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
PPTX
Big Data HPC Convergence
PPTX
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
PPTX
Big Data Analysis in Hydrogen Station using Spark and Azure ML
PDF
useR 2014 jskim
PPTX
The elephantintheroom bigdataanalyticsinthecloud
PDF
Perspective on HPC-enabled AI
PDF
The Future of Data Science
PDF
Big Data is changing abruptly, and where it is likely heading
PDF
HPC + Ai: Machine Learning Models in Scientific Computing
Keras: A versatile modeling layer for deep learning
Predictive Maintenance Using Recurrent Neural Networks
Comparing Big Data and Simulation Applications and Implications for Software ...
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
Graph Gurus Episode 1: Enterprise Graph
Big Graph Analytics on Neo4j with Apache Spark
IBM Strategy for Spark
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Graph Data: a New Data Management Frontier
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Big Data HPC Convergence
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Big Data Analysis in Hydrogen Station using Spark and Azure ML
useR 2014 jskim
The elephantintheroom bigdataanalyticsinthecloud
Perspective on HPC-enabled AI
The Future of Data Science
Big Data is changing abruptly, and where it is likely heading
HPC + Ai: Machine Learning Models in Scientific Computing
Ad

Similar to Open problems big_data_19_feb_2015_ver_0.1 (20)

PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PPTX
Distributed Deep Learning + others for Spark Meetup
PPTX
10 Big Data Technologies you Didn't Know About
PPTX
My Master's Thesis
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PPTX
Dibbs spidal april6-2016
PPTX
What’s New in the Berkeley Data Analytics Stack
PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
PPTX
A machine learning and data science pipeline for real companies
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PPTX
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
PPTX
Spark and Deep Learning Frameworks at Scale 7.19.18
PDF
What is Distributed Computing, Why we use Apache Spark
PPTX
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
PDF
Big data: Challenges, Practices and Technologies
Spark Based Distributed Deep Learning Framework For Big Data Applications
Distributed Deep Learning + others for Spark Meetup
10 Big Data Technologies you Didn't Know About
My Master's Thesis
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Dibbs spidal april6-2016
What’s New in the Berkeley Data Analytics Stack
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
A machine learning and data science pipeline for real companies
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Big Data Analytics with Storm, Spark and GraphLab
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Spark and Deep Learning Frameworks at Scale 7.19.18
What is Distributed Computing, Why we use Apache Spark
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Big data: Challenges, Practices and Technologies
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to machine learning and Linear Models
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Lecture1 pattern recognition............
ISS -ESG Data flows What is ESG and HowHow
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to machine learning and Linear Models
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Qualitative Qantitative and Mixed Methods.pptx
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Open problems big_data_19_feb_2015_ver_0.1

  • 1. 1 Open Problems in Big Data Analytics: A Practitioner’s View Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus Invited Talk, National Conference on Distributed Machine Learning, Feb 2015
  • 2. Contents 2 State-of-art in Big Data Analytics Big Data Computations: Characterization Big Data pipelines: open problems
  • 3. • Start from business questions • How quickly and accurately can we get answers? • Data gets stored in HDFS • Various frameworks to process data • Spark – machine learning • Giraph/GraphLab – graph processing • Storm – real-time processing State of Art in Big Data Analytics 3
  • 4. • HDFS the right storage? • Alternatives • Cassandra, MapR – M7, QFS, Cleversafe, Isilion, etc. http://www.inktank.com/news-events/new/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs State of Art in Big Data Analytics 4
  • 5. 5 State of Art in Big Data Analytics • Spark the right platform for processing? • Alternatives • Flink • Forge – meta domain specific language
  • 6. 6 State of Art in Big Data Analytics • Spark Streaming/Storm the right platform for stream processing?
  • 7. 7 Big Data ComputationsComputations/Operations Giant 1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (N- body), 4 (optimization) Spark from UC Berkeley is efficient? Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [2] Interactive/On-the-fly data processing – Storm. OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Deep Learning Artificial Neural Networks/Deep Belief Networks Machine vision from Google [3] Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741 [3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:
  • 8. 8 Big Data Pipelines 1. Nuance – incompleteness 2. Scale 3. Timeliness 4. Privacy 5. Human Loop
  • 9. 9 Big Data Pipelines: Data Acquisition • Needle in a Haystack. • Blink DB? • Automatic metadata discovery
  • 10. 10 Big Data Pipelines: Information Extraction • Error models for data cleaning • Multimedia data
  • 11. 11 Big Data Pipelines: Analytics • Multi-dimensional data
  • 12. The network to identify the individual digits from the input image http://neuralnetworksanddeeplearning.com/chap1.html Copyright @Impetus Technologies, 2014
  • 13. DLNs for Face Recognition Copyright @Impetus Technologies, 2014
  • 14. Copyright @Impetus Technologies, 2015 DLN for Face Recognition http://www.slideshare.net/hammawan/deep-neural-networks
  • 15. Copyright @Impetus Technologies, 2014 Success stories of DLNs Android voice recognition system – based on DLNs Improves accuracy by 25% compared to state- of-art Microsoft Skype Translate software and Digital assistant Cortana 1.2 million images, 1000 classes (ImageNet Data) – error rate of 15.3%, better than state of art at 26.1%
  • 16. Copyright @Impetus Technologies, 2015 Success stories of DLNs….. Senna system – PoS tagging, chunking, NER, semantic role labeling, syntactic parsing Comparable F1 score with state-of-art with huge speed advantage (5 days VS few hours). DLNs VS TF-IDF: 1 million documents, relevance search. 3.2ms VS 1.2s. Robot navigation
  • 18. 18 • Hadoop = HDFS + Map-Reduce • Useful for large scale embarrassingly parallel processing of data sets • Not so good for iterative, interactive computing. • Beyond Hadoop Map-Reduce philosophy • Optimization and other problems. • Real-time computation • Processing specialized data structures Conclusions
  • 19. Thank You! Mail • vijay.sa@impetus.co.in LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran Blogs • blogs.impetus.com Twitter • @a_vijaysrinivas.
  • 20. • Divyakant Agarwal et. al., Challenges and Opportunities with Big Data, Computing Research Association White Paper, available from http://www.cra.org/ccc/files/docs/init/bigdat awhitepaper.pdf. • Vijay Srinivas Agneeswaran et. al., Distributed Deep Learning over Spark, available at: http://www.datasciencecentral.com/profiles/ blogs/implementing-a-distributed-deep- References 20

Editor's Notes

  • #13: Reference : http://neuralnetworksanddeeplearning.com/chap1.html Consider the problem to identify the individual digits from the input image Each image 28 by 28 pixel image. Then network is designed as follows Input layer (image) -> 28*28 = 784 neurons. Each neuron corresponds to a pixel The output layer can be identified by the number of digits to be identified i.e. 10 (0 to 9) The intermediate hidden layer can be experimented with varied number of neurons. Let us fix at 10 nodes in hidden layer
  • #14: Reference: http://neuralnetworksanddeeplearning.com/chap1.html How about recognizing a human face from given set of random images? Attack this problem in the similar fashion explained earlier. Input -> Image pixels, output -> Is it a face or not? (a single node) A face can be recognized by answering some questions like “Is there an eye in the top left?”, “Is there a nose in the middle?” etc.. Each question corresponds to a hidden layer