random notes on big data
Chen Peng, Jianqiang Wang, Yang Huang
April 19, 2013
What is big data
● Volume: Gigabytes-
>Terabytes -
>Petabytes.
● Velocity: time
sensitive, streaming,
real-time.
Jet engine: 20TB/hr
GE: (minds + machines)
● Variety:
structured/unstructur
ed.
● Value: insights,
analytical systems.
Challenges: collect, store, organize, analyze and share
External
> web sites (blogs/reviews)
> social media (Facebook, LinkedIn,
Google+, Twitter)
> images and videos
> ...
Internal
> transactions
> server logs
> machines and sensors
> emails
> ...
Variety
Value Hierarchy
Raw Data
Normalized
Insight
Recommendation
Transact
Data is now a strategic asset
Technology stack & corresponding
firms
Google
App Engine
Google
BigQuery
Scalable
application
development and
execution
environment
Google
Compute Engine
Virtual machines
Run arbitrary workloads
at scale
(e.g. Hadoop, scientific
computing)
Google Cloud Platform
Google
Cloud Storage
Storage
Connecting glue between
each step of the data
pipeline
Data analysis
Querying large datasets
+ third party apps for
visualization (e.g.
Tableau)
Big data analytics
Analytics is
The scientific process of transforming data into
insights for making better decisions.
Data Insight Decision
IT logs, cloud,
social media,
sensors,
experiments,
etc.
statistical &
operations research
modeling
judgement,
constraints,
intuition
"resource" "product" "goal"
Predictive analytics extracts information from data and
use it to predict future trends and behavior patterns.
regression models
discrete choice models
time series models
classification models (decision tree, random forest, support vector machine,
neural network, etc.)
clustering models (k-means, density based, graph based, etc.)
association analysis
...
Big data analytics
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics
Always keep in mind...
> business objectives are the origin of every data mining solution
> data preparation is more than half of the data mining process
> all patterns are subject to change
> there will always be new knowledge
Always pause and ask yourself:
Does this work relate to the business question we try to answer?
Is the original business question still valid?
Industry Use-cases/Application
Healthcare Drug development
Patient monitoring
Electronic Medical Records
Utilities Smart grid optimization
Retail &
marketing
Customer loyalty and churn analysis
Targeted product and services offerings
Product sentiment analysis
Marketing campaign optimization
Financial
services
Fraud detection & prevention
Anti-money laundering
Telecom Customer churn mitigation
Geospatial analytics
Call data record (CDR) analysis
Use cases by industry
Industry applications of big data
analytics
Customer acquisition
predict customers' buying habits in order to promote relevant products at
multiple touch points.
http://www.youtube.com/watch?feature=player_embedded&v=3WspJ16Ubhw
Clinical decision support
Experts use predictive analysis in health care primarily to determine which
patients are at risk of developing certain conditions, like diabetes, asthma, heart
disease, and other lifetime illnesses.
Cross sale
predictive analytics can help analyze customers' spending, usage and other
behavior, leading to efficient cross sales, or selling additional products to
current customers (beer & diaper)
Ads targeting
http://www.slideshare.net/dennyglee/yahoo-tao-case-study-excerpt
Fraud detection
A predictive model can help weed out the "bads" and reduce a business's
exposure to fraud.
Image and Speech Recognition
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.
com/en/us/people/jeff/MIT_BigData_Sep2012.pdf
Operations
Jet Engine + Humans
http://www.youtube.com/watch?v=JHc4ZTTWKrQ
Industry applications of big data
analytics
Amazon wareouse operational efficiency: http://www.youtube.com/watch?
v=Kafs9tZskuo
Beer and diaper
Random notes on big data
What are those startups doing?
Bloomreach
http://www.youtube.com/watch?feature=player_embedded&v=K12awAj4tW8
Datastax
http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee-
its-popularity.html?pagewanted=all
Paraccel
http://www.paraccel.com/solutions/paraccel-solutions-big-data.php#.UXG207WG3Ct
Kaggle
http://www.kaggle.com/c/acm-sf-chapter-hackathon-big
VC funding for "Big Data"
Data from 71 start-ups. Funding is
counted starting from 2004.
VC Funding Activity
Data from 71 start-ups. Funding is
counted starting from 2004.
Interesting view points
" Special (domain) knowledge becomes less relevant;
organizations should focus on collecting people who know
how to extract value and insights from data."
" In god we trust. All others must bring data."
" The usefulness of a variable in a model is inversely
related to the time you spend creating it."
"Noise is convex but information is concave."
"Big data is sexy but small data is beautiful."
noise
information
data size
Interesting view points
"All models are wrong, but some are useful."
"Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it; everyone thinks everyone
else is doing it, so they claim they are doing it."
"Statistics: The Art and Science of Learning from Data"
The danger of big data
Open discussion
Potential opportunities / challenges for
entrepreneurs?
- visualization
- internet of things
- analytics as a service (a3
s)
Standardization v.s. customization
Human and data interaction
- data v.s intuition
Back-Up Slides
Data Science v.s. OR
risk management strategic planning
predictive analytics optimization
Risk
Measurable of Objective
skill sets of data scientists
Random notes on big data
Big data types
● Web & social media: clickstream, web content,
amazon reviews, facebook postings & 'like'...
● M2M:smart meters, oil rig sensor reading, GPS
signals...
● Transaction:retail store, healthcare claims, utility
billing...
● Biometrics:fingerprint, face, voice, handwriting..
● Human-generated data:call logs, emails, surveys...
Web & social media
● Transaction: orders, revenue,
● Conversion: click thru, convert to
purchase,...
● Session: length, bounce rate
● Lifetime value: repeat, frequency,...
● Social interaction: intensity,
influence,...
Shopping cart analysis
CTR prediction
Personalization
Retention/customer
churn
A/B testing
Targeted ads
Lifetime value
Interesting data visualization
projects
wind map
http://hint.fm/wind/gallery/oct-30.js.html
Some analytical problems people
deal with at Google ...
● search ranking
Processing Pipeline
Hadoop
MapReduce
log
sensor
web
...
Structured
Data
Note: Hadoop -- an open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2 license. It supports the running of applications on large
clusters of commodity hardware. Orginated from Google MapReduce and further developed/promoted by
Yahoo.
SQL
HIVE
Dremel ...
Analytics
Big Data
Cloud
Computing
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
How big is big?
When your data set becomes so large that you have to
start innovating around how to collect, store, organize,
analyze and share it ...
External
> web sites (blogs/reviews)
> social media (Facebook,
LinkedIn, Google+, Twitter)
> images and videos
> ...
Internal
> transactions
> server logs
> machines and sensors
> emails
> ...
Health
care
Sentiment
analysis
Patient
monitoring
Genetic
Testing
Electronic
Medical
Records
Utilities Smart
Meters
Retail Loyalty
programs
RFID tags Recommenda
tion, market
basket
Face
recognition
Telcos Customer
churn
Location-
based
IT Machine
log
Web &
Social
media
M2M Transaction Biometrics Human-
generat
ed
Example of semantic graph
Random notes on big data
Call Data Record
Random notes on big data
Random notes on big data
Random notes on big data
Random notes on big data
What is Hadoop

More Related Content

PDF
Random notes on big data
PDF
Big data in marketing at harvard business club nick1 june 15 2013
PDF
Big Data in Banking (White paper)
PPT
Data Mining in Life Insurance Business
PPTX
Data mining PPT
PPTX
Big data
PDF
Panel: Powering Business Decision Making
 
PDF
TechConnectr's Big Data Connection. Digital Marketing KPIs, Targeting, Analy...
Random notes on big data
Big data in marketing at harvard business club nick1 june 15 2013
Big Data in Banking (White paper)
Data Mining in Life Insurance Business
Data mining PPT
Big data
Panel: Powering Business Decision Making
 
TechConnectr's Big Data Connection. Digital Marketing KPIs, Targeting, Analy...

What's hot (20)

PDF
Marketing analytics for the Banking Industry
PPTX
Big Data Meetup by Chad Richeson
PDF
Big Data Analytics and its Application in E-Commerce
PDF
Predictive analytics km chicago
PDF
Financial services use cases
PDF
ATPI Expert Insight Analytics
PPTX
Analystics in banking and financial services
DOC
Vendor strategies: Operational Business Intelligence for Agile Enterprises
PDF
Big Data in Retail - Examples in Action
PDF
Hidden security and privacy consequences around mobility (Infosec 2013)
DOCX
Predictive Analytics, Contextual Computing, and Big Data
PDF
uae views on big data
PDF
Big data & analytics for banking new york lars hamberg
PPTX
Data Mining in Retail Industries
PDF
13 pv-do es-18-bigdata-v3
PDF
Big data Business Use Cases
PDF
BBDO Proximity: Big-data May 2013
PPTX
Rulex big data and analytics
PPTX
Machine learning with sabyasachi upadhya
PPT
Data mining & data warehousing
Marketing analytics for the Banking Industry
Big Data Meetup by Chad Richeson
Big Data Analytics and its Application in E-Commerce
Predictive analytics km chicago
Financial services use cases
ATPI Expert Insight Analytics
Analystics in banking and financial services
Vendor strategies: Operational Business Intelligence for Agile Enterprises
Big Data in Retail - Examples in Action
Hidden security and privacy consequences around mobility (Infosec 2013)
Predictive Analytics, Contextual Computing, and Big Data
uae views on big data
Big data & analytics for banking new york lars hamberg
Data Mining in Retail Industries
13 pv-do es-18-bigdata-v3
Big data Business Use Cases
BBDO Proximity: Big-data May 2013
Rulex big data and analytics
Machine learning with sabyasachi upadhya
Data mining & data warehousing
Ad

Viewers also liked (10)

PPT
Eucalyptus gnuNify 2012
PDF
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
PDF
Building your own personal cloud with Eucalyptus
PPSX
Deploying private cloud with eucalyptus
PDF
Eucalyptus - An Open-source Infrastructure for Cloud Computing
PPT
Open Source Cloud Computing -Eucalyptus
PPTX
Leadership resilience amid disruption: A report from the front lines
PDF
Working With Big Data
PDF
Analytics Trends 2016: The next evolution
PPTX
What is Big Data?
Eucalyptus gnuNify 2012
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
Building your own personal cloud with Eucalyptus
Deploying private cloud with eucalyptus
Eucalyptus - An Open-source Infrastructure for Cloud Computing
Open Source Cloud Computing -Eucalyptus
Leadership resilience amid disruption: A report from the front lines
Working With Big Data
Analytics Trends 2016: The next evolution
What is Big Data?
Ad

Similar to Random notes on big data (20)

PDF
SuanIct-Bigdata desktop-final
PDF
Big data analytics with Apache Hadoop
PPT
Big data
PPTX
BigDataFinal.pptx
PPTX
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
PPT
01-introduction.ppt the paper that you can unless you want to join me because...
PDF
Big Data Analytics
PDF
Level Seven - Expedient Big Data presentation
PPTX
Bigdata and Hadoop with applications
PPTX
BigData.pptx
PPTX
Big Data Analytics MIS presentation
PPTX
Trends in data analytics
PPTX
Presentation on Big Data
PPTX
Big_Data_ppt[1] (1).pptx
PPTX
A Big Data Concept
PDF
Mastering Big Data: Tools, Techniques, and Applications
PPTX
The future of big data analytics
PDF
MBA-TU-Thailand:BigData for business startup.
PDF
Big data and analytics
PPTX
Big Data in Business Application use case and benefits
SuanIct-Bigdata desktop-final
Big data analytics with Apache Hadoop
Big data
BigDataFinal.pptx
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
01-introduction.ppt the paper that you can unless you want to join me because...
Big Data Analytics
Level Seven - Expedient Big Data presentation
Bigdata and Hadoop with applications
BigData.pptx
Big Data Analytics MIS presentation
Trends in data analytics
Presentation on Big Data
Big_Data_ppt[1] (1).pptx
A Big Data Concept
Mastering Big Data: Tools, Techniques, and Applications
The future of big data analytics
MBA-TU-Thailand:BigData for business startup.
Big data and analytics
Big Data in Business Application use case and benefits

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
STKI Israel Market Study 2025 version august
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Modernising the Digital Integration Hub
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Unlock new opportunities with location data.pdf
PDF
August Patch Tuesday
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Five Habits of High-Impact Board Members
PDF
Zenith AI: Advanced Artificial Intelligence
PPT
Module 1.ppt Iot fundamentals and Architecture
DOCX
search engine optimization ppt fir known well about this
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
STKI Israel Market Study 2025 version august
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Final SEM Unit 1 for mit wpu at pune .pptx
Modernising the Digital Integration Hub
DP Operators-handbook-extract for the Mautical Institute
Tartificialntelligence_presentation.pptx
Chapter 5: Probability Theory and Statistics
A review of recent deep learning applications in wood surface defect identifi...
Getting started with AI Agents and Multi-Agent Systems
Unlock new opportunities with location data.pdf
August Patch Tuesday
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Five Habits of High-Impact Board Members
Zenith AI: Advanced Artificial Intelligence
Module 1.ppt Iot fundamentals and Architecture
search engine optimization ppt fir known well about this
Web Crawler for Trend Tracking Gen Z Insights.pptx

Random notes on big data

  • 1. random notes on big data Chen Peng, Jianqiang Wang, Yang Huang April 19, 2013
  • 2. What is big data
  • 3. ● Volume: Gigabytes- >Terabytes - >Petabytes. ● Velocity: time sensitive, streaming, real-time. Jet engine: 20TB/hr GE: (minds + machines) ● Variety: structured/unstructur ed. ● Value: insights, analytical systems.
  • 4. Challenges: collect, store, organize, analyze and share External > web sites (blogs/reviews) > social media (Facebook, LinkedIn, Google+, Twitter) > images and videos > ... Internal > transactions > server logs > machines and sensors > emails > ... Variety
  • 6. Technology stack & corresponding firms
  • 7. Google App Engine Google BigQuery Scalable application development and execution environment Google Compute Engine Virtual machines Run arbitrary workloads at scale (e.g. Hadoop, scientific computing) Google Cloud Platform Google Cloud Storage Storage Connecting glue between each step of the data pipeline Data analysis Querying large datasets + third party apps for visualization (e.g. Tableau)
  • 8. Big data analytics Analytics is The scientific process of transforming data into insights for making better decisions. Data Insight Decision IT logs, cloud, social media, sensors, experiments, etc. statistical & operations research modeling judgement, constraints, intuition "resource" "product" "goal"
  • 9. Predictive analytics extracts information from data and use it to predict future trends and behavior patterns. regression models discrete choice models time series models classification models (decision tree, random forest, support vector machine, neural network, etc.) clustering models (k-means, density based, graph based, etc.) association analysis ... Big data analytics Descriptive Analytics Predictive Analytics Prescriptive Analytics
  • 10. Always keep in mind... > business objectives are the origin of every data mining solution > data preparation is more than half of the data mining process > all patterns are subject to change > there will always be new knowledge Always pause and ask yourself: Does this work relate to the business question we try to answer? Is the original business question still valid?
  • 11. Industry Use-cases/Application Healthcare Drug development Patient monitoring Electronic Medical Records Utilities Smart grid optimization Retail & marketing Customer loyalty and churn analysis Targeted product and services offerings Product sentiment analysis Marketing campaign optimization Financial services Fraud detection & prevention Anti-money laundering Telecom Customer churn mitigation Geospatial analytics Call data record (CDR) analysis Use cases by industry
  • 12. Industry applications of big data analytics Customer acquisition predict customers' buying habits in order to promote relevant products at multiple touch points. http://www.youtube.com/watch?feature=player_embedded&v=3WspJ16Ubhw Clinical decision support Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions, like diabetes, asthma, heart disease, and other lifetime illnesses. Cross sale predictive analytics can help analyze customers' spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers (beer & diaper) Ads targeting http://www.slideshare.net/dennyglee/yahoo-tao-case-study-excerpt
  • 13. Fraud detection A predictive model can help weed out the "bads" and reduce a business's exposure to fraud. Image and Speech Recognition http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google. com/en/us/people/jeff/MIT_BigData_Sep2012.pdf Operations Jet Engine + Humans http://www.youtube.com/watch?v=JHc4ZTTWKrQ Industry applications of big data analytics Amazon wareouse operational efficiency: http://www.youtube.com/watch? v=Kafs9tZskuo
  • 16. What are those startups doing? Bloomreach http://www.youtube.com/watch?feature=player_embedded&v=K12awAj4tW8 Datastax http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee- its-popularity.html?pagewanted=all Paraccel http://www.paraccel.com/solutions/paraccel-solutions-big-data.php#.UXG207WG3Ct Kaggle http://www.kaggle.com/c/acm-sf-chapter-hackathon-big
  • 17. VC funding for "Big Data" Data from 71 start-ups. Funding is counted starting from 2004.
  • 18. VC Funding Activity Data from 71 start-ups. Funding is counted starting from 2004.
  • 19. Interesting view points " Special (domain) knowledge becomes less relevant; organizations should focus on collecting people who know how to extract value and insights from data." " In god we trust. All others must bring data." " The usefulness of a variable in a model is inversely related to the time you spend creating it." "Noise is convex but information is concave." "Big data is sexy but small data is beautiful." noise information data size
  • 20. Interesting view points "All models are wrong, but some are useful." "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it; everyone thinks everyone else is doing it, so they claim they are doing it." "Statistics: The Art and Science of Learning from Data"
  • 21. The danger of big data
  • 22. Open discussion Potential opportunities / challenges for entrepreneurs? - visualization - internet of things - analytics as a service (a3 s) Standardization v.s. customization Human and data interaction - data v.s intuition
  • 24. Data Science v.s. OR risk management strategic planning predictive analytics optimization Risk Measurable of Objective skill sets of data scientists
  • 26. Big data types ● Web & social media: clickstream, web content, amazon reviews, facebook postings & 'like'... ● M2M:smart meters, oil rig sensor reading, GPS signals... ● Transaction:retail store, healthcare claims, utility billing... ● Biometrics:fingerprint, face, voice, handwriting.. ● Human-generated data:call logs, emails, surveys...
  • 27. Web & social media ● Transaction: orders, revenue, ● Conversion: click thru, convert to purchase,... ● Session: length, bounce rate ● Lifetime value: repeat, frequency,... ● Social interaction: intensity, influence,... Shopping cart analysis CTR prediction Personalization Retention/customer churn A/B testing Targeted ads Lifetime value
  • 28. Interesting data visualization projects wind map http://hint.fm/wind/gallery/oct-30.js.html
  • 29. Some analytical problems people deal with at Google ... ● search ranking
  • 30. Processing Pipeline Hadoop MapReduce log sensor web ... Structured Data Note: Hadoop -- an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Orginated from Google MapReduce and further developed/promoted by Yahoo. SQL HIVE Dremel ... Analytics Big Data Cloud Computing http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
  • 31. How big is big? When your data set becomes so large that you have to start innovating around how to collect, store, organize, analyze and share it ... External > web sites (blogs/reviews) > social media (Facebook, LinkedIn, Google+, Twitter) > images and videos > ... Internal > transactions > server logs > machines and sensors > emails > ...
  • 32. Health care Sentiment analysis Patient monitoring Genetic Testing Electronic Medical Records Utilities Smart Meters Retail Loyalty programs RFID tags Recommenda tion, market basket Face recognition Telcos Customer churn Location- based IT Machine log Web & Social media M2M Transaction Biometrics Human- generat ed