SlideShare a Scribd company logo
Big Data Use Cases in the cloud
Peter Sirota, GM Elastic MapReduce
@petersirota
What is Big Data?
Computer generated data
 Application server logs (web sites, games)
 Sensor data (weather, water, smart grids)
 Images/videos (traffic, security cameras)
Human generated data
 Twitter “Firehose” (50 mil tweets/day 1,400% growth
per year)
 Blogs/Reviews/Emails/Pictures
Social graphs
 Facebook, linked-in, contacts
Big Data is full of valuable, unanswered questions!
Why is Big Data Hard (and Getting Harder)?
Data Volume
 Unconstrained growth
 Current systems don’t scale
Why is Big Data Hard (and Getting Harder)?
Why is Big Data Hard (and Getting Harder)?
Data Structure
 Need to consolidate data from multiple data sources
in multiple formats across multiple businesses
Why is Big Data Hard (and Getting Harder)?
Changing Data Requirements
 Faster response time of fresher data
 Sampling is not good enough and history is important
 Increasing complexity of analytics
 Users demand inexpensive experimentation
We need tools built specifically for Big Data!
Innovation #1:
Apache Hadoop
The MapReduce computational paradigm
Open source, scalable, fault tolerant, distributed system‐
Hadoop lowers the cost of developing a distributed
system for data processing
Innovation #2:
Amazon Elastic Compute Cloud (EC2)
“provides resizable compute capacity in the cloud.”
Amazon EC2 lowers the cost of operating a
distributed system for data processing
Amazon Elastic MapReduce =
Amazon EC2 + Hadoop
Elastic MapReduce applications
Targeted advertising / Clickstream analysis
Security: anti-virus, fraud detection, image recognition
Pattern matching / Recommendations
Data warehousing / BI
Bio-informatics (Genome analysis)
Financial simulation (Monte Carlo simulation)
File processing (resize jpegs, video encoding)
Web indexing
Clickstream Analysis –
Big Box Retailer came to Razorfish
 3.5 billion records
 71 million unique cookies
 1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
Clickstream Analysis –
Targeted Ad
User recently
purchased a sports
movie and is
searching for video
games (1.7 Million per day)
Clickstream Analysis –
Lots of experimentation but final design:
 100 node on-demand Elastic MapReduce cluster running Hadoop
Clickstream Analysis –
Processing time dropped from 2+ days to 8 hours
(with lots more data)
Clickstream Analysis –
Increased Return On Ad Spend by 500%
World’s largest handmade marketplace
 8.9 million items
 1 billion page view per month
 $320MM 2010 GMS
• Easy to ‘backfill’ and run experiments just boot up a cluster
with 100, 500, or 1000 nodes
Production DB
snapshots
Production DB
snapshots
Web event
logs
Web event
logs ETL – Step
1
ETL – Step
1
ETL – Step
2
ETL – Step
2
JobJob
JobJob
JobJob
Recommendations
The Taste Test http://www.etsy.com/tastetest
Recommendations
etsy.com/gifts
Gift Ideas for Facebook Friends
•
• Yelp generates close to 400GB of logs per day
Yelp
• Yelp does not have a physical MapReduce cluster
• Running 250 production clusters per week
• All of those run on Elastic MapReduce
MapReduce at Yelp
Features driven by MapReduce
Features driven by MapReduce
• Analyze ad stats (reporting, billing, algorithm
inputs)
• Analyze A/B test results
• Detect duplicate business listings
• Email bounce processing
• Identify bots based on traffic patterns
More MapReduce uses
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Big Data @ foursquare
9/23/2011 Amazon EMR Strata Justin Moore - @injust
How do we use EMR?
• Map-Reduce
– Run algorithms on our entire dataset
– Streaming jobs, complex analyses
• Hive
– Business intelligence
– Exploratory analyses
– Infographics!
9/23/2011 Amazon EMR Strata Justin Moore - @injust
How big is our data?
• Global reach (North Pole, Space)
• Native app for almost every smartphone, SMS,
web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Our Stack
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Computing venue-to-venue similarity
• Spin up 40 node cluster
• Submit Ruby streaming job
– Invert User x Venue matrix
– Grab Co-occurrences
– Compute similarity
• Spin down cluster
• Load data to app server
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Who is checking in?
9/23/2011 Amazon EMR Strata Justin Moore - @injust
What are people doing?
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Where are our users?
9/23/2011 Amazon EMR Strata Justin Moore - @injust
When do people go to a place?
Thursday Friday Saturday Sunday
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Why are people checking in?
• Explore their city, discover new places
• Find friends, meet up
• Save with local deals
• Get insider tips on venues
• Personal analytics, diary
• Follow brands and celebrities
• Earn points, badges, gamification of life
• The list grows…
9/23/2011 Amazon EMR Strata Justin Moore - @injust
How can we leverage these insights?
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Join us!
foursquare is hiring
www.foursquare.com/jobs
Justin Moore
@injust
justin@foursquare.com
http://aws.amazon.com/elasticmapreduce/

More Related Content

PDF
Report: EDA of TV shows & movies available on Netflix
PDF
Accelerating Innovation with Spark-(Beth Smith, IBM)
PDF
Trends towards the merge of HPC + Big Data systems
PDF
Utilizing Human Data Validation For KPI Analysis And Machine Learning
PPTX
SacHacks Keynote Open Source Software and IBM Z
PDF
COCA COLA INDIA STRATEGY
PDF
Hadoop 2 - Going beyond MapReduce
PDF
Big Data and Hadoop in the Cloud
Report: EDA of TV shows & movies available on Netflix
Accelerating Innovation with Spark-(Beth Smith, IBM)
Trends towards the merge of HPC + Big Data systems
Utilizing Human Data Validation For KPI Analysis And Machine Learning
SacHacks Keynote Open Source Software and IBM Z
COCA COLA INDIA STRATEGY
Hadoop 2 - Going beyond MapReduce
Big Data and Hadoop in the Cloud

Similar to Big Data (20)

PPTX
Finding business value in Big Data
PPTX
Aaum Analytics event - Big data in the cloud
PPT
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
PPTX
Big data4businessusers
PPTX
Big Data Lessons from the Cloud
PDF
Random notes on big data
PDF
Meruvian - Introduction to MapR
PPTX
Integrating Hadoop into your enterprise IT environment
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PDF
Big data on_aws in korea by abhishek sinha (lunch and learn)
PDF
Create your Big Data vision and Hadoop-ify your data warehouse
PDF
Big data and analytics
PPTX
Introduction to Harnessing Big Data
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PDF
Big data analytics with Apache Hadoop
PDF
Big Data and Implications on Platform Architecture
PDF
Big Data Analytics
PPT
Big Data = Big Decisions
Finding business value in Big Data
Aaum Analytics event - Big data in the cloud
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Big data4businessusers
Big Data Lessons from the Cloud
Random notes on big data
Meruvian - Introduction to MapR
Integrating Hadoop into your enterprise IT environment
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big data on_aws in korea by abhishek sinha (lunch and learn)
Create your Big Data vision and Hadoop-ify your data warehouse
Big data and analytics
Introduction to Harnessing Big Data
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
Big data analytics with Apache Hadoop
Big Data and Implications on Platform Architecture
Big Data Analytics
Big Data = Big Decisions
Ad

More from TUSHAR GARG (7)

PDF
4 aa4 3925enw
PPT
Big data use cases in the cloud presentation
PPT
Retail lessons learned from the first data driven business and future direct...
DOCX
Questionaire Design
PDF
Job description
PPTX
F&d ppt internship
PPT
Big data Analytics
4 aa4 3925enw
Big data use cases in the cloud presentation
Retail lessons learned from the first data driven business and future direct...
Questionaire Design
Job description
F&d ppt internship
Big data Analytics
Ad

Recently uploaded (20)

PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
chrmotography.pptx food anaylysis techni
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Business_Capability_Map_Collection__pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Microsoft 365 products and services descrption
PPTX
Managing Community Partner Relationships
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Introduction to Data Science and Data Analysis
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
STERILIZATION AND DISINFECTION-1.ppthhhbx
chrmotography.pptx food anaylysis techni
CYBER SECURITY the Next Warefare Tactics
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Microsoft Core Cloud Services powerpoint
Navigating the Thai Supplements Landscape.pdf
A Complete Guide to Streamlining Business Processes
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
DU, AIS, Big Data and Data Analytics.ppt
Business_Capability_Map_Collection__pptx
SAP 2 completion done . PRESENTATION.pptx
Microsoft 365 products and services descrption
Managing Community Partner Relationships
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Introduction to Data Science and Data Analysis
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
[EN] Industrial Machine Downtime Prediction
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...

Big Data

  • 1. Big Data Use Cases in the cloud Peter Sirota, GM Elastic MapReduce @petersirota
  • 2. What is Big Data?
  • 3. Computer generated data  Application server logs (web sites, games)  Sensor data (weather, water, smart grids)  Images/videos (traffic, security cameras)
  • 4. Human generated data  Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)  Blogs/Reviews/Emails/Pictures Social graphs  Facebook, linked-in, contacts
  • 5. Big Data is full of valuable, unanswered questions!
  • 6. Why is Big Data Hard (and Getting Harder)?
  • 7. Data Volume  Unconstrained growth  Current systems don’t scale Why is Big Data Hard (and Getting Harder)?
  • 8. Why is Big Data Hard (and Getting Harder)? Data Structure  Need to consolidate data from multiple data sources in multiple formats across multiple businesses
  • 9. Why is Big Data Hard (and Getting Harder)? Changing Data Requirements  Faster response time of fresher data  Sampling is not good enough and history is important  Increasing complexity of analytics  Users demand inexpensive experimentation
  • 10. We need tools built specifically for Big Data!
  • 11. Innovation #1: Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault tolerant, distributed system‐ Hadoop lowers the cost of developing a distributed system for data processing
  • 12. Innovation #2: Amazon Elastic Compute Cloud (EC2) “provides resizable compute capacity in the cloud.” Amazon EC2 lowers the cost of operating a distributed system for data processing
  • 13. Amazon Elastic MapReduce = Amazon EC2 + Hadoop
  • 14. Elastic MapReduce applications Targeted advertising / Clickstream analysis Security: anti-virus, fraud detection, image recognition Pattern matching / Recommendations Data warehousing / BI Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs, video encoding) Web indexing
  • 15. Clickstream Analysis – Big Box Retailer came to Razorfish  3.5 billion records  71 million unique cookies  1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)
  • 16. Clickstream Analysis – Targeted Ad User recently purchased a sports movie and is searching for video games (1.7 Million per day)
  • 17. Clickstream Analysis – Lots of experimentation but final design:  100 node on-demand Elastic MapReduce cluster running Hadoop
  • 18. Clickstream Analysis – Processing time dropped from 2+ days to 8 hours (with lots more data)
  • 19. Clickstream Analysis – Increased Return On Ad Spend by 500%
  • 20. World’s largest handmade marketplace  8.9 million items  1 billion page view per month  $320MM 2010 GMS
  • 21. • Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes Production DB snapshots Production DB snapshots Web event logs Web event logs ETL – Step 1 ETL – Step 1 ETL – Step 2 ETL – Step 2 JobJob JobJob JobJob
  • 22. Recommendations The Taste Test http://www.etsy.com/tastetest
  • 24. • • Yelp generates close to 400GB of logs per day Yelp
  • 25. • Yelp does not have a physical MapReduce cluster • Running 250 production clusters per week • All of those run on Elastic MapReduce MapReduce at Yelp
  • 26. Features driven by MapReduce
  • 27. Features driven by MapReduce
  • 28. • Analyze ad stats (reporting, billing, algorithm inputs) • Analyze A/B test results • Detect duplicate business listings • Email bounce processing • Identify bots based on traffic patterns More MapReduce uses
  • 29. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Big Data @ foursquare
  • 30. 9/23/2011 Amazon EMR Strata Justin Moore - @injust How do we use EMR? • Map-Reduce – Run algorithms on our entire dataset – Streaming jobs, complex analyses • Hive – Business intelligence – Exploratory analyses – Infographics!
  • 31. 9/23/2011 Amazon EMR Strata Justin Moore - @injust How big is our data? • Global reach (North Pole, Space) • Native app for almost every smartphone, SMS, web, mobile-web • 10M+ users, 15M+ venues, ~1B check-ins • Terabytes of log data
  • 32. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Our Stack
  • 33. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Computing venue-to-venue similarity • Spin up 40 node cluster • Submit Ruby streaming job – Invert User x Venue matrix – Grab Co-occurrences – Compute similarity • Spin down cluster • Load data to app server
  • 34. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Who is checking in?
  • 35. 9/23/2011 Amazon EMR Strata Justin Moore - @injust What are people doing?
  • 36. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Where are our users?
  • 37. 9/23/2011 Amazon EMR Strata Justin Moore - @injust When do people go to a place? Thursday Friday Saturday Sunday
  • 38. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Why are people checking in? • Explore their city, discover new places • Find friends, meet up • Save with local deals • Get insider tips on venues • Personal analytics, diary • Follow brands and celebrities • Earn points, badges, gamification of life • The list grows…
  • 39. 9/23/2011 Amazon EMR Strata Justin Moore - @injust How can we leverage these insights?
  • 40. 9/23/2011 Amazon EMR Strata Justin Moore - @injust Join us! foursquare is hiring www.foursquare.com/jobs Justin Moore @injust justin@foursquare.com