SlideShare a Scribd company logo
Introduction to Big Data 
Roi Blanco
2
What is Big Data? 
• A fashioned term used by some IT vendors to remarked old 
fashioned hardware and software 
• “The term itself is vague, but it is getting at something that is real… 
Big Data is a tagline for a process that has the potential to transform 
everything.” John Kleinberg 
• What I want to talk about: 
– Big Data science, cool use cases 
– Access to data, tools to process the data (Hadoop and friends’ ecosystem) 
– What’s next (now!) 
3
Now, that’s Big data 
4
Data? 
• Advances in digital sensors, communications, computation, and 
storage have created huge collections of data, capturing information 
of value to business, science, government, and society. 
• Example: search engine companies 
– transformed how people find and make use of information on a daily basis. 
• Other forms of big data are transforming the activities of companies, 
scientific researchers. 
• Machine learning on large data-sets for decision making, product 
shaping. 
5
Motivation 
• BIG DATA is an OPEN SOURCE Software Revolution 
• BIG DATA Analytics 2.0 
• What is happening right now 
• Why do we need new tools? 
• Improve decision making: 
• Measure and react in REAL-TIME 
6
Data Explosion 
text 
audio 
video 
images 
relational 
picture from Big Data Integration 
7
Real Time Decision Making 
Companies need to know: 
• what is happening right now, 
in real time, to be able to 
• react 
• anticipate and detect new 
business opportunities. 
8
Wal-Mart 
9
LHC 
10
WWW 
11
Mobile 
12
Intelligence agencies 
13
Social media 
14
Big Data 3(+3) Vs 
• Volume 
• Variety 
• Velocity 
• Value 
• Variability 
• Veracity 
15
Volume vs Velocity 
16
Controversy of Big Data 
• All data is BIG now 
• Hype to sell Hadoop 
based systems 
• Ethical concerns about 
accessibility 
• Limited access to Big 
Data creates new digital 
divides 
17
Controversy of Big Data 
• Statistical Significance: 
– When the number of 
variables grow, the 
number of fake 
correlations also grow 
– Leinweber: S&P 500 
stock index correlated 
with butter production 
in Bangladesh 
18
Need for Big Data 
McKinsey Global Institute (MGI) Report on Big Data, 2011 
19 
• WEF defined data as an asset 
just like gold or currency 
• Business opportunities to 
exploit by companies that can 
analyze information in the 
right way 
• What do your customers 
need? 
• What will they demand in the 
future?
Need for Big Data 
20 
• How do you know the 
invest was worth it? 
• In the happy success 
cases predictive analysis 
has led to income 
improvement of ~70% 
McKinsey Global Institute (MGI) Report on Big Data, 2011
Crude Oil 
21
Data Analysis 
• Most business still running on small data! 
• Is more data always better? 
– Hardly 
– past a certain point, return on adding more data diminishes to the point that 
you’re only wasting time gathering more 
• Do you need data? 
– Of course 
– … but the right data (+ interpretation) 
• Unbiased, context 
• Big data is not a magic wand for inferring causality 
• Most AI problems have been tackled from a data perspective 
– Still, unsolved (Google’s cat detector). 
22
What is data science? 
23
Why Machine Learning interest is increasing? 
• Data is everywhere 
– Increasingly captured 
– Increasingly comprehensive 
• Storage capabilities are now much cheaper, such is processing 
– In-house Hadoop clusters 
– Cloud-based processing (Amazon EC2) 
• Data is important 
– Machine learning provides effective development methodology 
– … when you cannot program a solution by hand 
– … but you have data available 
• Let the data figure out the program 
• Any company with large data sets will have an interest 
24
(HADOOP) 
25
Big Data Challenges 
Sort 10TB on 1 node = 2 days 
100-node cluster = 30 min 
26
Big Data Challenges 
“Fat” servers implies high cost 
– use cheap commodity nodes instead 
commodity 
Large number of cheap nodes implies frequent failures 
– leverage automatic fault-tolerance 
fault-tolerance 
27
Big Data Challenges 
We need new data-parallel programming model for clusters of commodity 
machines 
data-parallel 
28
MapReduce 
Published in 2004 by Google 
– MapReduce: Simplified Data Processing on Large Clusters 
Popularized by Apache Hadoop project started by Yahoo! 
– Now used by virtually everybody else Facebook, Twitter, 
Amazon, … 
29
Who uses Hadoop? 
30
Map Reduce Philosophy 
– hide complexity 
– make it scalable 
– make it cheap 
1. System Shall Manage and Heal 
Itself 
2. Performance Shall Scale 
Linearly 
3. Compute Should Move to Data 
4. Simple Core, Modular and 
Extensible 
31
Hadoop High-Level Architecture 
Name Node 
Maintains mapping of file blocks 
to data node slaves 
Job Tracker 
Schedules jobs across 
task tracker slaves 
Data Node 
Stores and serves 
blocks of data 
Hadoop Client 
Contacts Name Node for data 
or Job Tracker to submit jobs 
Task Tracker 
Runs tasks (work units) 
within a job 
Share Physical Node 
32
Pig 
33 
Pig 
A = LOAD ’data’ USING PigStorage() AS 
(f1:int, f2:int, f3:int); 
B = GROUP A BY f1; 
C = FOREACH B GENERATE COUNT ($0); 
DUMP C; 
Pig: Similar to SQL 
21 / 55 
Pig Similar to SQL
Pig powers 
34
HBase 
• Apache HBase™ is the 
Hadoop database, a 
distributed, scalable, big key-value 
35 
store 
– Linear and modular 
scalability. 
– Strictly consistent reads 
and writes. 
– Automatic and configurable 
sharding of tables 
– Failover support 
– Interoperable with Java, 
Hadoop
Hive 
• Apache project for querying 
and analyzing datasets in 
HDFS 
– Tools to enable easy data 
extract/transform/load (ETL) 
– A mechanism to impose 
structure on a variety of 
data formats 
– Access to files stored either 
directly in Apache HDFSTM 
or in other data storage 
systems such as Apache 
HBaseTM 
– Query execution via 
MapReduce 
36
Apache S4 
37
Twitter Storm 
38
Apache Mahout 
39
MOVING TOWARDS (NEAR)REALTIME
Runaway Complexity 
41
Future 
• Process data fast enough 
– BI analytics 
• Key drivers: connected devices/services 
– Tablets, smartphones, etc. 
– Your data is “always connected to the cloud” 
– Low latency (again)/enormous amount of data 
• User data 
– Categorize data to infer knowledge about a user 
• Targeting, personalization 
• 100B events per day 
– ML: from information to knowledge 
– Behavioral targeting (user features) 
• How likely am I to be interested in fashion? For how long? 
• Map to behavioral targeting categories, segment for targeting 
42
Future (II) 
• Data processed in batches 
– There are gaps! 
– Things you’ve calculated half an hour ago 
– Ok for monthly reports, not for online NRT prediction 
– Think of GEO targeting 
• You can’t go fast enough with MR 
– From big long windows to small incremental iterations 
– Micro-batches updating user knowledge 
• Use cases 
– Ad campaign allocation 
• Delay between click and deducting budget from an advertiser (overspending) 
– Personalization and targeting 
• Y! Homepage 
• Use every event on the stream to detect the interest 
– How do we train machine learning models when the data is arriving non-stop? 
• You want parameters to adapt, to change slowly 
• Maybe 99% of the data is the same! Incrementally is better 
43
Beyond Hadoop 
• YARN 
– Why if you just want to interact with the data in Hadoop? 
• Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access) 
– Those apps are great but limited to running as a single application system with 
MapReduce at the core 
– Spark (see below) and Storm have been ported to YARN already 
• Streaming 
– SAMOA 
• RDDs 
– Spark 
• Shark (Hive on Spark) 
• Analytics Architecture 
– Visualization http://visualize.yahoo.com/mail/ 
44
Future Challenges for Big Data 
• Evaluation 
• Time evolving data 
• Distributed mining 
• Compression 
• Visualization 
• Hidden Big Data 
45
Hadoop 2.0 
• No longer “only” running MR jobs 
– MR + processing low latency and streaming 
• Iterative processing 
– Hold data in memory to re-process 
• Figure the questions of what to do with data 
– BI that want to do exploration of the data really fast 
• Possible thanks to YARN + Storm(S4) + Spark + … ? 
– 350PB of data 
– >30K nodes with Yarn 
– 400K per day (6 jobs/sec) 
– 10M hours of compute with YARN 
46
Future key take-aways 
• Scalability 
• Performance 
• Flexibility 
• Programming paradigms 
– MAP/MAP/MAP .. OR REDUCE/REDUCE/ 
REDUCE 
47
Big Data Myths 
• Big Data is new 
• Big Data is objective 
• Big Data doesn’t discriminate 
• Big Data makes things smart 
• Big Data is anonymous 
• You can opt-out 
48
Big Data vs Big Reality 
• Big Data is an oxymoron 
• Big Data raises bigger issues. The term suggests assembling many 
facts to create greater, previously unseen truths. It suggests the 
certainty of math. 
• It's not the data itself but what you do with it that counts. 
49

More Related Content

PPT
Searching over the past, present and future
PPTX
From Queries to Answers in the Web
PPTX
Mining Web content for Enhanced Search
PPTX
Beyond document retrieval using semantic annotations
PPTX
Semantic search: from document retrieval to virtual assistants
PPT
Semantic Search overview at SSSW 2012
PPTX
Semantic Search tutorial at SemTech 2012
PPTX
Making things findable
Searching over the past, present and future
From Queries to Answers in the Web
Mining Web content for Enhanced Search
Beyond document retrieval using semantic annotations
Semantic search: from document retrieval to virtual assistants
Semantic Search overview at SSSW 2012
Semantic Search tutorial at SemTech 2012
Making things findable

What's hot (20)

PPTX
Knowledge Integration in Practice
PPTX
Semantic Search on the Rise
PPTX
An Introduction to Entities in Semantic Search
PPTX
Semantic Search at Yahoo
PPTX
SemTech 2011 Semantic Search tutorial
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PPTX
What happened to the Semantic Web?
PPT
Implementing Semantic Search
PDF
Reflected Intelligence: Real world AI in Digital Transformation
PPTX
The Semantic Knowledge Graph
PPTX
Large-Scale Semantic Search
PPT
Semantic Search
PPT
Semantic search
PPTX
Influence of Timeline and Named-entity Components on User Engagement
PPT
Brave new search world
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
PPT
Peter Mika's Presentation at SSSW 2011
PPT
Alamw2013
PPTX
Semtech bizsemanticsearchtutorial
PDF
Enterprise Search Share Point2009 Best Practices Final
Knowledge Integration in Practice
Semantic Search on the Rise
An Introduction to Entities in Semantic Search
Semantic Search at Yahoo
SemTech 2011 Semantic Search tutorial
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
What happened to the Semantic Web?
Implementing Semantic Search
Reflected Intelligence: Real world AI in Digital Transformation
The Semantic Knowledge Graph
Large-Scale Semantic Search
Semantic Search
Semantic search
Influence of Timeline and Named-entity Components on User Engagement
Brave new search world
Natural Language Search with Knowledge Graphs (Haystack 2019)
Peter Mika's Presentation at SSSW 2011
Alamw2013
Semtech bizsemanticsearchtutorial
Enterprise Search Share Point2009 Best Practices Final
Ad

Viewers also liked (20)

PPTX
A Big Data Concept
PDF
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
PPTX
The EB-5 Visa Program
PPTX
Top+5+world+flatness 1
PPTX
Tech training 7.17.13 pm session
KEY
Ruby Midi - Euruko 2009 Conferente
PDF
Basicgrammar1
PPTX
Presentación1
PPTX
Top+5+world+flatness 4
DOC
N.u. fichas setembro 2011
PDF
Hire Immigrants Halifax Allies Report 2010
PPTX
Profound logic 2012
PPTX
PDF
PRBS - Where YOU can make a difference
PPTX
Sm250rink
PPT
Finding support sentences for entities
PPT
Physical Science: Chapter 5, sec3
PPTX
Shipbuilding in Halifax
PDF
Beyond xUnit example-based testing: property-based testing with ScalaCheck
PDF
AOMi Simulation Training Brochure
A Big Data Concept
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
The EB-5 Visa Program
Top+5+world+flatness 1
Tech training 7.17.13 pm session
Ruby Midi - Euruko 2009 Conferente
Basicgrammar1
Presentación1
Top+5+world+flatness 4
N.u. fichas setembro 2011
Hire Immigrants Halifax Allies Report 2010
Profound logic 2012
PRBS - Where YOU can make a difference
Sm250rink
Finding support sentences for entities
Physical Science: Chapter 5, sec3
Shipbuilding in Halifax
Beyond xUnit example-based testing: property-based testing with ScalaCheck
AOMi Simulation Training Brochure
Ad

Similar to Introduction to Big Data (20)

PPTX
Big data4businessusers
PPTX
Big-Data-Seminar-6-Aug-2014-Koenig
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PDF
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Hadoop Master Class : A concise overview
PPTX
Big data by Mithlesh sadh
PPTX
Big_Data_ppt[1] (1).pptx
DOCX
Content1. Introduction2. What is Big Data3. Characte.docx
PPTX
Presentation on Big Data
PPTX
ppt final.pptx
PPT
Data analytics & its Trends
PDF
Level Seven - Expedient Big Data presentation
PPTX
Big data ppt
PDF
Bigdatappt 140225061440-phpapp01
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
PPTX
Gilbane Boston 2012 Big Data 101
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
PPTX
bigdata.pptx
PPTX
Big data unit 2
Big data4businessusers
Big-Data-Seminar-6-Aug-2014-Koenig
Introduction to Cloud computing and Big Data-Hadoop
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Hadoop Master Class : A concise overview
Big data by Mithlesh sadh
Big_Data_ppt[1] (1).pptx
Content1. Introduction2. What is Big Data3. Characte.docx
Presentation on Big Data
ppt final.pptx
Data analytics & its Trends
Level Seven - Expedient Big Data presentation
Big data ppt
Bigdatappt 140225061440-phpapp01
Architecting for Big Data: Trends, Tips, and Deployment Options
Gilbane Boston 2012 Big Data 101
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
bigdata.pptx
Big data unit 2

More from Roi Blanco (7)

PDF
Entity Linking via Graph-Distance Minimization
PPTX
Introduction to Information Retrieval
PPT
Keyword Search over RDF Graphs
PDF
Extending BM25 with multiple query operators
PPTX
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
PPTX
Effective and Efficient Entity Search in RDF data
PPT
Caching Search Engine Results over Incremental Indices
Entity Linking via Graph-Distance Minimization
Introduction to Information Retrieval
Keyword Search over RDF Graphs
Extending BM25 with multiple query operators
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Effective and Efficient Entity Search in RDF data
Caching Search Engine Results over Incremental Indices

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence

Introduction to Big Data

  • 1. Introduction to Big Data Roi Blanco
  • 2. 2
  • 3. What is Big Data? • A fashioned term used by some IT vendors to remarked old fashioned hardware and software • “The term itself is vague, but it is getting at something that is real… Big Data is a tagline for a process that has the potential to transform everything.” John Kleinberg • What I want to talk about: – Big Data science, cool use cases – Access to data, tools to process the data (Hadoop and friends’ ecosystem) – What’s next (now!) 3
  • 5. Data? • Advances in digital sensors, communications, computation, and storage have created huge collections of data, capturing information of value to business, science, government, and society. • Example: search engine companies – transformed how people find and make use of information on a daily basis. • Other forms of big data are transforming the activities of companies, scientific researchers. • Machine learning on large data-sets for decision making, product shaping. 5
  • 6. Motivation • BIG DATA is an OPEN SOURCE Software Revolution • BIG DATA Analytics 2.0 • What is happening right now • Why do we need new tools? • Improve decision making: • Measure and react in REAL-TIME 6
  • 7. Data Explosion text audio video images relational picture from Big Data Integration 7
  • 8. Real Time Decision Making Companies need to know: • what is happening right now, in real time, to be able to • react • anticipate and detect new business opportunities. 8
  • 15. Big Data 3(+3) Vs • Volume • Variety • Velocity • Value • Variability • Veracity 15
  • 17. Controversy of Big Data • All data is BIG now • Hype to sell Hadoop based systems • Ethical concerns about accessibility • Limited access to Big Data creates new digital divides 17
  • 18. Controversy of Big Data • Statistical Significance: – When the number of variables grow, the number of fake correlations also grow – Leinweber: S&P 500 stock index correlated with butter production in Bangladesh 18
  • 19. Need for Big Data McKinsey Global Institute (MGI) Report on Big Data, 2011 19 • WEF defined data as an asset just like gold or currency • Business opportunities to exploit by companies that can analyze information in the right way • What do your customers need? • What will they demand in the future?
  • 20. Need for Big Data 20 • How do you know the invest was worth it? • In the happy success cases predictive analysis has led to income improvement of ~70% McKinsey Global Institute (MGI) Report on Big Data, 2011
  • 22. Data Analysis • Most business still running on small data! • Is more data always better? – Hardly – past a certain point, return on adding more data diminishes to the point that you’re only wasting time gathering more • Do you need data? – Of course – … but the right data (+ interpretation) • Unbiased, context • Big data is not a magic wand for inferring causality • Most AI problems have been tackled from a data perspective – Still, unsolved (Google’s cat detector). 22
  • 23. What is data science? 23
  • 24. Why Machine Learning interest is increasing? • Data is everywhere – Increasingly captured – Increasingly comprehensive • Storage capabilities are now much cheaper, such is processing – In-house Hadoop clusters – Cloud-based processing (Amazon EC2) • Data is important – Machine learning provides effective development methodology – … when you cannot program a solution by hand – … but you have data available • Let the data figure out the program • Any company with large data sets will have an interest 24
  • 26. Big Data Challenges Sort 10TB on 1 node = 2 days 100-node cluster = 30 min 26
  • 27. Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead commodity Large number of cheap nodes implies frequent failures – leverage automatic fault-tolerance fault-tolerance 27
  • 28. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines data-parallel 28
  • 29. MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project started by Yahoo! – Now used by virtually everybody else Facebook, Twitter, Amazon, … 29
  • 31. Map Reduce Philosophy – hide complexity – make it scalable – make it cheap 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data 4. Simple Core, Modular and Extensible 31
  • 32. Hadoop High-Level Architecture Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves blocks of data Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Task Tracker Runs tasks (work units) within a job Share Physical Node 32
  • 33. Pig 33 Pig A = LOAD ’data’ USING PigStorage() AS (f1:int, f2:int, f3:int); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); DUMP C; Pig: Similar to SQL 21 / 55 Pig Similar to SQL
  • 35. HBase • Apache HBase™ is the Hadoop database, a distributed, scalable, big key-value 35 store – Linear and modular scalability. – Strictly consistent reads and writes. – Automatic and configurable sharding of tables – Failover support – Interoperable with Java, Hadoop
  • 36. Hive • Apache project for querying and analyzing datasets in HDFS – Tools to enable easy data extract/transform/load (ETL) – A mechanism to impose structure on a variety of data formats – Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM – Query execution via MapReduce 36
  • 42. Future • Process data fast enough – BI analytics • Key drivers: connected devices/services – Tablets, smartphones, etc. – Your data is “always connected to the cloud” – Low latency (again)/enormous amount of data • User data – Categorize data to infer knowledge about a user • Targeting, personalization • 100B events per day – ML: from information to knowledge – Behavioral targeting (user features) • How likely am I to be interested in fashion? For how long? • Map to behavioral targeting categories, segment for targeting 42
  • 43. Future (II) • Data processed in batches – There are gaps! – Things you’ve calculated half an hour ago – Ok for monthly reports, not for online NRT prediction – Think of GEO targeting • You can’t go fast enough with MR – From big long windows to small incremental iterations – Micro-batches updating user knowledge • Use cases – Ad campaign allocation • Delay between click and deducting budget from an advertiser (overspending) – Personalization and targeting • Y! Homepage • Use every event on the stream to detect the interest – How do we train machine learning models when the data is arriving non-stop? • You want parameters to adapt, to change slowly • Maybe 99% of the data is the same! Incrementally is better 43
  • 44. Beyond Hadoop • YARN – Why if you just want to interact with the data in Hadoop? • Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access) – Those apps are great but limited to running as a single application system with MapReduce at the core – Spark (see below) and Storm have been ported to YARN already • Streaming – SAMOA • RDDs – Spark • Shark (Hive on Spark) • Analytics Architecture – Visualization http://visualize.yahoo.com/mail/ 44
  • 45. Future Challenges for Big Data • Evaluation • Time evolving data • Distributed mining • Compression • Visualization • Hidden Big Data 45
  • 46. Hadoop 2.0 • No longer “only” running MR jobs – MR + processing low latency and streaming • Iterative processing – Hold data in memory to re-process • Figure the questions of what to do with data – BI that want to do exploration of the data really fast • Possible thanks to YARN + Storm(S4) + Spark + … ? – 350PB of data – >30K nodes with Yarn – 400K per day (6 jobs/sec) – 10M hours of compute with YARN 46
  • 47. Future key take-aways • Scalability • Performance • Flexibility • Programming paradigms – MAP/MAP/MAP .. OR REDUCE/REDUCE/ REDUCE 47
  • 48. Big Data Myths • Big Data is new • Big Data is objective • Big Data doesn’t discriminate • Big Data makes things smart • Big Data is anonymous • You can opt-out 48
  • 49. Big Data vs Big Reality • Big Data is an oxymoron • Big Data raises bigger issues. The term suggests assembling many facts to create greater, previously unseen truths. It suggests the certainty of math. • It's not the data itself but what you do with it that counts. 49