SlideShare a Scribd company logo
Data + Algorithms = Knowledge




Facebook Analytics


                  With Elastic Map/Reduce
                      – a Hands-on Workshop

                                            November 12, 2012
                                        J Singh, DataThinks.org




                             1
Take-away Messages

• Map Reduce is simple, Hadoop is one implementation of MR…
   – …made even simpler by services like Elastic Map Reduce


• But Map Reduce requires a different style of programming…
   – …and a different set of techniques for debugging


• Facebook data can get big very quickly…
   – …and storage and bandwidth costs can dominate your solution


• Analytics is an iterative (agile) process…
   – …each iteration requires evaluating results, and tuning the algorithms,
     possibly the acquisition of more data

                       © J Singh, 2012                                  2
                                2
Signing Up for AWS

The steps required to obtain an AWS account
   Create an AWS account (http://aws.amazon.com).
    –   http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for-
        amazon-web-services-8700872
    –   Requires a valid credit card and a phone based identification.
   Sign in to the AWS Management Console
    – http://aws.amazon.com/console




                          © J Singh, 2012                                   3
                                   3
Elastic Map Reduce Resources

• Summary of the offering

• Elastic MapReduce Training

• Getting Started Guide

• Developers Guide




                     © J Singh, 2012   4
                              4
MapReduce Conceptual Underpinnings

• Based on Functional Programming model
   – From Lisp
       • (map square '(1 2 3 4))   (1 4 9 16)
       • (reduce plus '(1 4 9 16))   30
   – From APL
       • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at
     the same time
                     © J Singh, 2012                            5
                              5
MapReduce Flow




            © J Singh, 2012   6
                     6
Elastic Map Reduce – Summary

• Hadoop installed and maintained by Amazon
   – We can focus on programming
   – Offers a few options on map and reduce programs

• Streaming
   – Map and Reduce programs
     connect through stdin and
     stdout
   – Allows Map and Reduce to be
     written in any language
• Hive, Pig
   – Translates to Map/Reduce JARs
   – Can cascade M/R pipelines
• Custom JAR – for special cases

                      © J Singh, 2012                  7
                               7
Elastic Map Reduce – Architecture

• Starting with data in S3

• EMR Service initiates the job
• Hadoop Master coordinates
  operation
• Slave nodes are initiated and
  data loaded into them
• Extra nodes can be invoked if
  needed

• Results are copied back into S3
   – Nodes are destroyed

                      © J Singh, 2012   8
                               8
Elastic Map Reduce – Word Count

• Use the AWS Management Console >> Elastic MapReduce
  – Define Job Flow
      • Hadoop Version 1.0.3
      • Run your own application
          – Steaming
  – Specify Parameters
      • For input files,
        elasticmapreduce/samples/wordcount/input
      • For output files, you need to define your own S3 bucket
          – In a separate browser tab, AWS Management Console >> S3
          – Bucket names can include lowercase letters, numbers, period, dash
      • Mapper code can be seen at http://goo.gl/EbCme
          – Copy this code to one of your buckets
          – Specify path <your-bucket>/wordSplitter.py
                           © J Singh, 2012                                  9
                                    9
Elastic Map Reduce – Word Count (p2)

• Configure EC2 Instances
• Advanced Options
   – Optional: Amazon EC2 Key Pair
       • To log into the master and make changes to a running job
          – E.g,, add extra nodes to speed up processing
   – Amazon S3 Log Path
       • <your-bucket>/log-2012-11-12--19-30
• Accept all other defaults and go!




                       © J Singh, 2012                              10
                                10
Monitoring Operation

• AWS Management Console provides a view into the
  operation




  – These screen-shots were taken at minute 27 of a 30-minute
    run
  – Configuration default in this case was for 2 map slots
  – First slot became available at 12:00, second around 12:10

                   © J Singh, 2012                              11
                           11
Elastic Map Reduce – Debugging

• AWS console and the log files provide clues on what went
  wrong and how to fix it

• Make a change that will break the operation and examine
  the AWS console to find the error you introduced
   – Introduce a parsing error in the mapper program
   – Uncomment these lines to have it raise an exception
                 import random
                 x = 1 / random.randint(0,1000)
   – Save the file to an S3 bucket and run
   – Can you find where EMR reveals what happened?


                     © J Singh, 2012                         12
                             12
Facebook Analytics – Summary

• Extend the architecture
   – Import Facebook data into S3
   – Change Map Reduce programs as required




                      © J Singh, 2012         13
                              13
Facebook Analytics – Observations

• Fetching and staging data is the real challenge in putting
  together an analytics solution
   – For unstructured data, it requires
       • An understanding of the data model at the source
       • Custom code to read it


   – For structured data, consider Pig/Hive (higher-level Hadoop
     components)
       • Pig/Hive can read/write tables formatted as CSV/TSV files in S3
          – Either we need to bring files into S3
          – Or point Pig/Hive at a JDBC connection
       • An opportunity to rethink the ETL pipeline?


                       © J Singh, 2012                                 14
                                 14
Facebook Analytics – Data Collection

• The exercise is based on everyone‟s Facebook data
• Log into http://apps.facebook.com/map-reduce-workshop
   – Requires permission to get
       • Information about you,
       • Your friends,
       • Your likes, your friends‟ likes.
   – Randomly selects 10 of those friends
   – Randomly selects 25 of their likes
   – Anonymizes your friends‟ Facebook IDs before storing into
     S3
• All data, even though opaque, will be deleted at the end of
 the workshop

                        © J Singh, 2012                          15
                                  15
Facebook Analytics – Data Collected




Original = 75   Friends = 750        Likes = up to about 20,000

• Each user record shows anonymized user ID and their likes
   –   4110002004281   ['21506845769', '345722385482735', '93433060687']




                        © J Singh, 2012                              16
                                16
Facebook Analytics – Likes Count

• Use the AWS Management Console >> Elastic MapReduce
  – Define Job Flow
      • Hadoop Version 1.0.3
      • Run Your Own Application
         – Streaming
  – Specify Parameters
      • For input files, use bucket datathinks-users
      • For output files, you need to define your own S3 bucket
         – In a separate browser tab, AWS Management Console >> S3
      • Mapper: copy goo.gl/PcLK4 into a bucket you own
  – Advanced options:
      • Choose a fresh log file location
  – Accept all other defaults and go!
                       © J Singh, 2012                               17
                               17
Viewing the Results

• The results of Data Analysis are available in S3.
   – Partial example:     139784736075551      1
                          140413412750046      6
                          184331976202         3
                          220854914702193      1
                          29092950651          1


• How to interpret the results.
   – Sort by frequency, then examine most frequent likes
       • 140413412750046 is cryptic
       • But http://www.facebook.com/pages/w/140413412750046
         reveals what it is (DataThinks)
• Requires further action: what to do with the results?
                        © J Singh, 2012                        18
                                18
Algorithm Discussion

• The algorithm based on exact matches for likes may be
  too restrictive
  – „Ella Fitzgerald‟ != „Duke Ellington‟
  – But people who like Ella Fitzgerald may be reachable the
    same way as people who like Duke Ellington

  – An idea to explore further:
      • Is there a way to find ID‟s that we might consider equivalent?




                      © J Singh, 2012                                    19
                              19
Data Collected and Embellished




Original = 75   Friends = 750   Likes = 15,000   Similar Likes = 150,000




                         © J Singh, 2012                                   20
                                  20
Extended Facebook Analytics – Summary

• Extend the architecture
   – Get mappers to fetch “similar likes” from the internet




                        © J Singh, 2012                       21
                                21
Facebook Analytics – Showing Results

• The other challenge in putting together an analytics
  solution is displaying results
   – Demo of our results page




                    © J Singh, 2012                      22
                            22
Take-away Messages

• Map Reduce is simple, Hadoop is one implementation of MR…
   – …made even simpler by services like Elastic Map Reduce


• But Map Reduce requires a different style of programming…
   – …and a different set of techniques for debugging


• Facebook data can get big very quickly…
   – …and storage and bandwidth costs can dominate your solution


• Analytics is an iterative (agile) process…
   – …each iteration requires evaluating results, and tuning the algorithms,
     possibly the acquisition of more data

                       © J Singh, 2012                                  23
                                23
Thank you

• J Singh
   – President, Early Stage IT
       • Technology Services and Strategy for Startups


• DataThinks.org is a service of Early Stage IT
   – “Big Data” analytics solutions




                      © J Singh, 2012                    24
                              24

More Related Content

PPTX
Big Data Laboratory
PPTX
The Hadoop Ecosystem
PDF
OpenLSH - a framework for locality sensitive hashing
PDF
Hadoop ecosystem
PDF
Future of Data Intensive Applicaitons
PDF
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
PPTX
Introduction to Pig
PDF
Extending Hadoop for Fun & Profit
Big Data Laboratory
The Hadoop Ecosystem
OpenLSH - a framework for locality sensitive hashing
Hadoop ecosystem
Future of Data Intensive Applicaitons
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Introduction to Pig
Extending Hadoop for Fun & Profit

What's hot (20)

PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PPTX
Hadoop Ecosystem
PPTX
Facebook Retrospective - Big data-world-europe-2012
PPTX
Apache Tez – Present and Future
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PDF
Summary machine learning and model deployment
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
PPTX
Pig, Making Hadoop Easy
PPTX
Introduction to the Hadoop EcoSystem
PDF
Hadoop Primer
PPT
Nextag talk
PPTX
Functional Programming and Big Data
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
Drilling into Data with Apache Drill
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PDF
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
PDF
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
Qubole @ AWS Meetup Bangalore - July 2015
Hadoop Ecosystem
Facebook Retrospective - Big data-world-europe-2012
Apache Tez – Present and Future
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Summary machine learning and model deployment
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Pig, Making Hadoop Easy
Introduction to the Hadoop EcoSystem
Hadoop Primer
Nextag talk
Functional Programming and Big Data
Hive Training -- Motivations and Real World Use Cases
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Hadoop Hive Talk At IIT-Delhi
Drilling into Data with Apache Drill
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Getting started with Hadoop, Hive, and Elastic MapReduce
Ad

Viewers also liked (17)

PPTX
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
PPTX
MapReduce Paradigm
PDF
Scaling your analytics with Amazon EMR
PPTX
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
PPTX
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
PDF
Mapreduce Algorithms
PDF
BigData_Chp5: Putting it all together
PDF
BigData_TP3 : Spark
PDF
Building a Sustainable Data Platform on AWS
PPT
Hadoop MapReduce Fundamentals
PPTX
MapReduce in Simple Terms
ODP
Big data, map reduce and beyond
PDF
Bigtable and Dynamo
PDF
Dynamo and BigTable - Review and Comparison
PPT
Slideshare Powerpoint presentation
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
MapReduce Paradigm
Scaling your analytics with Amazon EMR
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Mapreduce Algorithms
BigData_Chp5: Putting it all together
BigData_TP3 : Spark
Building a Sustainable Data Platform on AWS
Hadoop MapReduce Fundamentals
MapReduce in Simple Terms
Big data, map reduce and beyond
Bigtable and Dynamo
Dynamo and BigTable - Review and Comparison
Slideshare Powerpoint presentation
Ad

Similar to Facebook Analytics with Elastic Map/Reduce (20)

PDF
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
PDF
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
PPTX
SQL to NoSQL: Top 6 Questions
PPTX
MongoDB for Spatio-Behavioral Data Analysis and Visualization
PDF
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
PDF
Powering a Startup with Apache Spark with Kevin Kim
PPT
eHarmony in the Cloud
PPTX
Shop talk - Project Server 2013
PDF
Accelerating Data Science with Better Data Engineering on Databricks
PPTX
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
PDF
Using Power BI and Azure as analytics engine for business applications
PPTX
Dax & sql in power bi
PDF
L19 Application Architecture
PPTX
Tableau & MongoDB: Visual Analytics at the Speed of Thought
PDF
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
PDF
Tableau Seattle BI Event How Tableau Changed My Life
PDF
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
PPTX
EMR and DynamoDB
PDF
Building a Front End for a Sensor Data Cloud
PPTX
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
SQL to NoSQL: Top 6 Questions
MongoDB for Spatio-Behavioral Data Analysis and Visualization
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
Powering a Startup with Apache Spark with Kevin Kim
eHarmony in the Cloud
Shop talk - Project Server 2013
Accelerating Data Science with Better Data Engineering on Databricks
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
Using Power BI and Azure as analytics engine for business applications
Dax & sql in power bi
L19 Application Architecture
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Tableau Seattle BI Event How Tableau Changed My Life
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
EMR and DynamoDB
Building a Front End for a Sensor Data Cloud
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint

More from J Singh (19)

PPTX
Designing analytics for big data
PDF
Open LSH - september 2014 update
PPTX
PaaS - google app engine
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
PPTX
Data Analytic Technology Platforms: Options and Tradeoffs
PPTX
Social Media Mining using GAE Map Reduce
PPTX
High Throughput Data Analysis
PPTX
NoSQL and MapReduce
PPTX
CS 542 -- Concurrency Control, Distributed Commit
PPTX
CS 542 -- Failure Recovery, Concurrency Control
PPTX
CS 542 -- Query Optimization
PPTX
CS 542 -- Query Execution
PPTX
CS 542 Putting it all together -- Storage Management
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPTX
CS 542 Database Index Structures
PPTX
CS 542 Controlling Database Integrity and Performance
PPTX
CS 542 Overview of query processing
PPTX
CS 542 Introduction
PDF
Cloud Computing from an Entrpreneur's Viewpoint
Designing analytics for big data
Open LSH - september 2014 update
PaaS - google app engine
Mining of massive datasets using locality sensitive hashing (LSH)
Data Analytic Technology Platforms: Options and Tradeoffs
Social Media Mining using GAE Map Reduce
High Throughput Data Analysis
NoSQL and MapReduce
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Query Optimization
CS 542 -- Query Execution
CS 542 Putting it all together -- Storage Management
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Database Index Structures
CS 542 Controlling Database Integrity and Performance
CS 542 Overview of query processing
CS 542 Introduction
Cloud Computing from an Entrpreneur's Viewpoint

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release

Facebook Analytics with Elastic Map/Reduce

  • 1. Data + Algorithms = Knowledge Facebook Analytics With Elastic Map/Reduce – a Hands-on Workshop November 12, 2012 J Singh, DataThinks.org 1
  • 2. Take-away Messages • Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 2 2
  • 3. Signing Up for AWS The steps required to obtain an AWS account  Create an AWS account (http://aws.amazon.com). – http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for- amazon-web-services-8700872 – Requires a valid credit card and a phone based identification.  Sign in to the AWS Management Console – http://aws.amazon.com/console © J Singh, 2012 3 3
  • 4. Elastic Map Reduce Resources • Summary of the offering • Elastic MapReduce Training • Getting Started Guide • Developers Guide © J Singh, 2012 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2012 5 5
  • 6. MapReduce Flow © J Singh, 2012 6 6
  • 7. Elastic Map Reduce – Summary • Hadoop installed and maintained by Amazon – We can focus on programming – Offers a few options on map and reduce programs • Streaming – Map and Reduce programs connect through stdin and stdout – Allows Map and Reduce to be written in any language • Hive, Pig – Translates to Map/Reduce JARs – Can cascade M/R pipelines • Custom JAR – for special cases © J Singh, 2012 7 7
  • 8. Elastic Map Reduce – Architecture • Starting with data in S3 • EMR Service initiates the job • Hadoop Master coordinates operation • Slave nodes are initiated and data loaded into them • Extra nodes can be invoked if needed • Results are copied back into S3 – Nodes are destroyed © J Singh, 2012 8 8
  • 9. Elastic Map Reduce – Word Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run your own application – Steaming – Specify Parameters • For input files, elasticmapreduce/samples/wordcount/input • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 – Bucket names can include lowercase letters, numbers, period, dash • Mapper code can be seen at http://goo.gl/EbCme – Copy this code to one of your buckets – Specify path <your-bucket>/wordSplitter.py © J Singh, 2012 9 9
  • 10. Elastic Map Reduce – Word Count (p2) • Configure EC2 Instances • Advanced Options – Optional: Amazon EC2 Key Pair • To log into the master and make changes to a running job – E.g,, add extra nodes to speed up processing – Amazon S3 Log Path • <your-bucket>/log-2012-11-12--19-30 • Accept all other defaults and go! © J Singh, 2012 10 10
  • 11. Monitoring Operation • AWS Management Console provides a view into the operation – These screen-shots were taken at minute 27 of a 30-minute run – Configuration default in this case was for 2 map slots – First slot became available at 12:00, second around 12:10 © J Singh, 2012 11 11
  • 12. Elastic Map Reduce – Debugging • AWS console and the log files provide clues on what went wrong and how to fix it • Make a change that will break the operation and examine the AWS console to find the error you introduced – Introduce a parsing error in the mapper program – Uncomment these lines to have it raise an exception import random x = 1 / random.randint(0,1000) – Save the file to an S3 bucket and run – Can you find where EMR reveals what happened? © J Singh, 2012 12 12
  • 13. Facebook Analytics – Summary • Extend the architecture – Import Facebook data into S3 – Change Map Reduce programs as required © J Singh, 2012 13 13
  • 14. Facebook Analytics – Observations • Fetching and staging data is the real challenge in putting together an analytics solution – For unstructured data, it requires • An understanding of the data model at the source • Custom code to read it – For structured data, consider Pig/Hive (higher-level Hadoop components) • Pig/Hive can read/write tables formatted as CSV/TSV files in S3 – Either we need to bring files into S3 – Or point Pig/Hive at a JDBC connection • An opportunity to rethink the ETL pipeline? © J Singh, 2012 14 14
  • 15. Facebook Analytics – Data Collection • The exercise is based on everyone‟s Facebook data • Log into http://apps.facebook.com/map-reduce-workshop – Requires permission to get • Information about you, • Your friends, • Your likes, your friends‟ likes. – Randomly selects 10 of those friends – Randomly selects 25 of their likes – Anonymizes your friends‟ Facebook IDs before storing into S3 • All data, even though opaque, will be deleted at the end of the workshop © J Singh, 2012 15 15
  • 16. Facebook Analytics – Data Collected Original = 75 Friends = 750 Likes = up to about 20,000 • Each user record shows anonymized user ID and their likes – 4110002004281 ['21506845769', '345722385482735', '93433060687'] © J Singh, 2012 16 16
  • 17. Facebook Analytics – Likes Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run Your Own Application – Streaming – Specify Parameters • For input files, use bucket datathinks-users • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 • Mapper: copy goo.gl/PcLK4 into a bucket you own – Advanced options: • Choose a fresh log file location – Accept all other defaults and go! © J Singh, 2012 17 17
  • 18. Viewing the Results • The results of Data Analysis are available in S3. – Partial example: 139784736075551 1 140413412750046 6 184331976202 3 220854914702193 1 29092950651 1 • How to interpret the results. – Sort by frequency, then examine most frequent likes • 140413412750046 is cryptic • But http://www.facebook.com/pages/w/140413412750046 reveals what it is (DataThinks) • Requires further action: what to do with the results? © J Singh, 2012 18 18
  • 19. Algorithm Discussion • The algorithm based on exact matches for likes may be too restrictive – „Ella Fitzgerald‟ != „Duke Ellington‟ – But people who like Ella Fitzgerald may be reachable the same way as people who like Duke Ellington – An idea to explore further: • Is there a way to find ID‟s that we might consider equivalent? © J Singh, 2012 19 19
  • 20. Data Collected and Embellished Original = 75 Friends = 750 Likes = 15,000 Similar Likes = 150,000 © J Singh, 2012 20 20
  • 21. Extended Facebook Analytics – Summary • Extend the architecture – Get mappers to fetch “similar likes” from the internet © J Singh, 2012 21 21
  • 22. Facebook Analytics – Showing Results • The other challenge in putting together an analytics solution is displaying results – Demo of our results page © J Singh, 2012 22 22
  • 23. Take-away Messages • Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 23 23
  • 24. Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2012 24 24

Editor's Notes

  • #8: Get started with Hadoop
  • #9: Get started with Hadoop