SlideShare a Scribd company logo
Making Sense out of Big Data
Peter Morgan - July 2013
Table of Contents
1. Definition and Overview
2. Data Sources
3. Databases
4. Data Analytics
Glossary
References
2
1. Definition and Overview
3
What is big data?
More and more data is being collected and stored each day
4
Four main components
• Data
– Structured and unstructured
• Databases
– Proprietary and open source
• Query language
– Querying the database
• Analytics
– Analysing the data
5
How big is big?
• Large data sets
– Greater than 1,000 Terabytes? (1 Petabyte)
– 1,000,000 Terabytes? (1 Exabyte)
• Excel 2013 can have 1,048,576 rows by 16,384 columns
– About 10 Gigabyte of data
• Only going to get bigger
– 90% of all data produced in the past two years !
– Rate is increasing
• Recall
– Giga = 10⁹
– Tera = 10¹²
– Peta = 10¹⁵
– Exa = 10¹⁸
6
Big Data Evolution
7
2. Data Sources
8
Where does the data come from?
• Science – particle, astrophysics
• Industry – oil, finance, telecom
– Actually all verticals
• Social – Facebook, LinkedIn, Twitter
• Medicine – genome, neuroscience
• Government – census, education, police
• Sports – statistics
• Environment – weather, sensors
9
Unstructured Data
• 80% of data is unstructured
• NoSQL
• Document based
– Documents
– Texts, tweets
– Emails
– Machine logs
– Blogs
– Web pages
– Photos
– Videos (YouTube)
• Graph based
– Social media sites
– Facebook has 1.1billions users (Microstrategy, July 27, 2013)
10
Why do we need to use big data?
Use in public and private sector to:
• Make faster and more accurate business decisions
• Make accurate predictions
• Gain competitive advantage
• Implement smarter marketing – CRM
• Discover new opportunities
• Enhance Business Intelligence
• Enable fraud detection
• Reduce crime
• Improve scientific research
• Quicken analysis (up to real time)
– Weeks, days  minutes, seconds
11
Big Data Startup - Case Study
• Rocket Fuel
• No. 4 on Forbes' 2013 Most Promising Companies In
America list
• Digital advertising startup
• Screens over 26 billion ads per day
• “Advertising that learns” big data platform
• Distributed planet-scale computing engine
• Hadoop implementation
• Founders from Yahoo!, Salesforce.com, DoubleClick
• Targeting algorithms use lifestyle, purchase intent and
social data
12
Some big statistics
13
3. Databases
14
Database Timeline
15
Relational databases – SQL
Proprietary
• Oracle DB
• IBM DB2
• Microsoft SQL
• SAP
• EMC
Open Source
• MySQL
• PostgresQL
• Drizzle
• Firebird
16
Non-relational databases – NoSQL
• BigTable – Google
• Cassandra – Facebook
• Eucalyptus – Amazon
• Hbase – Hadoop
• MongoDB – 10Gen
• Neo4j - NeoTechnologies
• CouchDB - Apache
• CouchBase
• Riak - Basho
• Redis - Pivotal
17
4. Big Data Analytics
18
Big Data Analytics - Incumbents
• Oracle – Exadata, Exalytics
• Microsoft – HDInsight, xVelocity
• IBM – Netezza, Cognos, BigInsights
• SAP – HANA, Business Objects
• EMC – Pivotal (Greenplum)
• HP – Vertica, HAVEn
• All run on Hadoop
19
Big Data Analytics – Pure Plays
• Pure plays – definition:
– Been around more than 20 years
– Purely data analytic companies
• Teradata - Aster
• SAS
• Microstrategy
20
Big Data Analytics – New Entrants
• Hortonworks
• Cloudera
• MapR
• Acunu
• Pentaho
• Tableau
• Talend
• Splunk
21
(Some of) IBM’s Big Data Acquisitions
• Algorithmics
– Oct 2011, $400million
• OpenPages
– Oct 2010, ?
• Netezza
– Sept 2010, $1.7billion
• SPSS
– Jan 2010, $1.2billion
• Cognos
– Jan 2008, $4.9billion
• About $10billion in four years
http://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_IBM
22
Big Data Science Tools
• Hadoop
• NoSQL
• MapReduce
• R
• Matlab
• Python
• Statistics
23
Big Data Hadoop Stack
• Hadoop is the de facto big data operating system
• Developed from Google and Yahoo! (2005)
• It is distributed, open source and managed by Apache
24
Analytic Technologies
• A/B testing
• Genetic algorithms
• Machine learning
• Natural language
processing
• Neural networks
• Pattern recognition
• Anomaly detection
• Decision tree
• Predictive modeling
• Regression testing
• Sentiment analysis
• Signal processing
• Simulations
• Time series analysis
• Visualization
• Multivariate analysis
• Text analytics
25
Glossary
• OLTP = On Line Transactional Processing
• OLAP = On Line Analytic Processing
• ODBC = Open DataBase Connectivity
• IMDB = In Memory DataBase
• CRUD = Create, Read, Update, Delete
• ETL = Extract, Transform and Load
• CDO = Chief Data Officer
• NLP = Natural Language Processing
• GQL = Graph Query Language
• AaaS = Analytics as a Service
• EDW = Enterprise Data Warehouse
26
References
• Microstrategy website, 27 July, 2013, Michael Saylor
Presentation at Microstrategy World 2013,
http://www.microstrategy.com/
• Teradata website www.teradata.com
• Wikipedia http://en.wikipedia.org/wiki/
• Google images www.google.co.uk
• IBM website www.ibm.com
• Youtube www.youtube.com
• Hadoop www.hortonworks.com
27
Any Questions?
28

More Related Content

PPT
Hack reduce introduction
PPTX
Mining Big Data in Real Time
PPTX
Introduction to Big Data
PPTX
Big data
PPTX
PPTX
Big data(1st presentation)
PDF
From Big Data to Fast Data
PDF
VoltDB Big Data Camp LA 2014 - Scott Jar
Hack reduce introduction
Mining Big Data in Real Time
Introduction to Big Data
Big data
Big data(1st presentation)
From Big Data to Fast Data
VoltDB Big Data Camp LA 2014 - Scott Jar

What's hot (20)

PPTX
Big Stream Processing Systems, Big Graphs
PPTX
Big data and data mining
PPTX
Wikibon Big Data Capital Markets Day 2014
PPTX
Big data
PPTX
Presentation Big Data
PDF
Open source for customer analytics
PPTX
PDF
Big Data, Big Deal: For Future Big Data Scientists
PPTX
Bigdatacooltools
PPTX
Data mining with big data
PPT
Data mining with big data
DOCX
JPJ1417 Data Mining With Big Data
PPTX
Big data
PPTX
Are you ready for BIG DATA?
PDF
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
PDF
Business intelligence architectures.pdf
PPTX
Data mining with big data
PPTX
Big data and data mining
PPTX
Introduction of big data and analytics
PPTX
Big data
Big Stream Processing Systems, Big Graphs
Big data and data mining
Wikibon Big Data Capital Markets Day 2014
Big data
Presentation Big Data
Open source for customer analytics
Big Data, Big Deal: For Future Big Data Scientists
Bigdatacooltools
Data mining with big data
Data mining with big data
JPJ1417 Data Mining With Big Data
Big data
Are you ready for BIG DATA?
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Business intelligence architectures.pdf
Data mining with big data
Big data and data mining
Introduction of big data and analytics
Big data
Ad

Similar to Big data – An Introduction, July 2013 (20)

PPTX
Big Data
PDF
Big Data Analytics
PPTX
Big_Data_ppt[1] (1).pptx
PPTX
Big data Analytics
PPTX
A Big Data Concept
PPTX
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
PPTX
Big data
PPTX
Big data session five ( a )f
PPTX
big-data-8722-m8RQ3h1.pptx
PPTX
Special issues on big data
PPTX
Presentation on Big Data
PDF
Big data and analytics
PPTX
Kartikey tripathi
PDF
Big Data Analytics M1.pdf big data analytics
PPTX
What is big data
PPTX
BIG DATA & DATA ANALYTICS
PDF
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
PPTX
Fundamentals of Big Data
PPTX
Foundations of Big Data: Concepts, Techniques, and Applications
Big Data
Big Data Analytics
Big_Data_ppt[1] (1).pptx
Big data Analytics
A Big Data Concept
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data
Big data session five ( a )f
big-data-8722-m8RQ3h1.pptx
Special issues on big data
Presentation on Big Data
Big data and analytics
Kartikey tripathi
Big Data Analytics M1.pdf big data analytics
What is big data
BIG DATA & DATA ANALYTICS
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
Fundamentals of Big Data
Foundations of Big Data: Concepts, Techniques, and Applications
Ad

More from Peter Morgan (13)

PDF
Agentic AI, A Business Overview - May 2025
PPTX
Towards AGI Berlin - Building AGI, May 2019
PPTX
AI in Physics - University of Washington, Jan 2024
PDF
Towards a General Theory of Intelligence - April 2018
PPTX
Simulation Hypothesis 2017
PDF
AI Developments Aug 2017
PPTX
London Exponential Technologies Meetup, July 2017
PPTX
Robotics Overview 2016
PDF
AI and Blockchain 2017
PDF
AI in Healthcare 2017
PPTX
AI Predictions 2017
PDF
AI State of Play Dec 2016 NYC
PDF
Machine Learning - Where to Next?, May 2015
Agentic AI, A Business Overview - May 2025
Towards AGI Berlin - Building AGI, May 2019
AI in Physics - University of Washington, Jan 2024
Towards a General Theory of Intelligence - April 2018
Simulation Hypothesis 2017
AI Developments Aug 2017
London Exponential Technologies Meetup, July 2017
Robotics Overview 2016
AI and Blockchain 2017
AI in Healthcare 2017
AI Predictions 2017
AI State of Play Dec 2016 NYC
Machine Learning - Where to Next?, May 2015

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Big data – An Introduction, July 2013

  • 1. Making Sense out of Big Data Peter Morgan - July 2013
  • 2. Table of Contents 1. Definition and Overview 2. Data Sources 3. Databases 4. Data Analytics Glossary References 2
  • 3. 1. Definition and Overview 3
  • 4. What is big data? More and more data is being collected and stored each day 4
  • 5. Four main components • Data – Structured and unstructured • Databases – Proprietary and open source • Query language – Querying the database • Analytics – Analysing the data 5
  • 6. How big is big? • Large data sets – Greater than 1,000 Terabytes? (1 Petabyte) – 1,000,000 Terabytes? (1 Exabyte) • Excel 2013 can have 1,048,576 rows by 16,384 columns – About 10 Gigabyte of data • Only going to get bigger – 90% of all data produced in the past two years ! – Rate is increasing • Recall – Giga = 10⁹ – Tera = 10¹² – Peta = 10¹⁵ – Exa = 10¹⁸ 6
  • 9. Where does the data come from? • Science – particle, astrophysics • Industry – oil, finance, telecom – Actually all verticals • Social – Facebook, LinkedIn, Twitter • Medicine – genome, neuroscience • Government – census, education, police • Sports – statistics • Environment – weather, sensors 9
  • 10. Unstructured Data • 80% of data is unstructured • NoSQL • Document based – Documents – Texts, tweets – Emails – Machine logs – Blogs – Web pages – Photos – Videos (YouTube) • Graph based – Social media sites – Facebook has 1.1billions users (Microstrategy, July 27, 2013) 10
  • 11. Why do we need to use big data? Use in public and private sector to: • Make faster and more accurate business decisions • Make accurate predictions • Gain competitive advantage • Implement smarter marketing – CRM • Discover new opportunities • Enhance Business Intelligence • Enable fraud detection • Reduce crime • Improve scientific research • Quicken analysis (up to real time) – Weeks, days  minutes, seconds 11
  • 12. Big Data Startup - Case Study • Rocket Fuel • No. 4 on Forbes' 2013 Most Promising Companies In America list • Digital advertising startup • Screens over 26 billion ads per day • “Advertising that learns” big data platform • Distributed planet-scale computing engine • Hadoop implementation • Founders from Yahoo!, Salesforce.com, DoubleClick • Targeting algorithms use lifestyle, purchase intent and social data 12
  • 16. Relational databases – SQL Proprietary • Oracle DB • IBM DB2 • Microsoft SQL • SAP • EMC Open Source • MySQL • PostgresQL • Drizzle • Firebird 16
  • 17. Non-relational databases – NoSQL • BigTable – Google • Cassandra – Facebook • Eucalyptus – Amazon • Hbase – Hadoop • MongoDB – 10Gen • Neo4j - NeoTechnologies • CouchDB - Apache • CouchBase • Riak - Basho • Redis - Pivotal 17
  • 18. 4. Big Data Analytics 18
  • 19. Big Data Analytics - Incumbents • Oracle – Exadata, Exalytics • Microsoft – HDInsight, xVelocity • IBM – Netezza, Cognos, BigInsights • SAP – HANA, Business Objects • EMC – Pivotal (Greenplum) • HP – Vertica, HAVEn • All run on Hadoop 19
  • 20. Big Data Analytics – Pure Plays • Pure plays – definition: – Been around more than 20 years – Purely data analytic companies • Teradata - Aster • SAS • Microstrategy 20
  • 21. Big Data Analytics – New Entrants • Hortonworks • Cloudera • MapR • Acunu • Pentaho • Tableau • Talend • Splunk 21
  • 22. (Some of) IBM’s Big Data Acquisitions • Algorithmics – Oct 2011, $400million • OpenPages – Oct 2010, ? • Netezza – Sept 2010, $1.7billion • SPSS – Jan 2010, $1.2billion • Cognos – Jan 2008, $4.9billion • About $10billion in four years http://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_IBM 22
  • 23. Big Data Science Tools • Hadoop • NoSQL • MapReduce • R • Matlab • Python • Statistics 23
  • 24. Big Data Hadoop Stack • Hadoop is the de facto big data operating system • Developed from Google and Yahoo! (2005) • It is distributed, open source and managed by Apache 24
  • 25. Analytic Technologies • A/B testing • Genetic algorithms • Machine learning • Natural language processing • Neural networks • Pattern recognition • Anomaly detection • Decision tree • Predictive modeling • Regression testing • Sentiment analysis • Signal processing • Simulations • Time series analysis • Visualization • Multivariate analysis • Text analytics 25
  • 26. Glossary • OLTP = On Line Transactional Processing • OLAP = On Line Analytic Processing • ODBC = Open DataBase Connectivity • IMDB = In Memory DataBase • CRUD = Create, Read, Update, Delete • ETL = Extract, Transform and Load • CDO = Chief Data Officer • NLP = Natural Language Processing • GQL = Graph Query Language • AaaS = Analytics as a Service • EDW = Enterprise Data Warehouse 26
  • 27. References • Microstrategy website, 27 July, 2013, Michael Saylor Presentation at Microstrategy World 2013, http://www.microstrategy.com/ • Teradata website www.teradata.com • Wikipedia http://en.wikipedia.org/wiki/ • Google images www.google.co.uk • IBM website www.ibm.com • Youtube www.youtube.com • Hadoop www.hortonworks.com 27