SlideShare a Scribd company logo
Spotting Hadoop in the wild
                         Practical use cases from Last.fm and Massive Media


                                             @klbostee



Thursday 12 January 12
• “Data scientist is a job title for an
                         employee who analyses data, particularly
                         large amounts of it, to help a business gain a
                         competitive edge” —WhatIs.com
                    • “Someone who can obtain, scrub, explore,
                         model and interpret data, blending
                         hacking, statistics and machine
                         learning” —Hilary Mason, bit.ly


Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
                    • 2009: Data & Scalability Engineer at Last.fm
                    • 2011: Data Scientist at Massive Media




Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
                    • 2009: Data & Scalability Engineer at Last.fm
                    • 2011: Data Scientist at Massive Media
                    • Created Dumbo, a Python API for Hadoop
                    • Contributed some code to Hadoop itself
                    • Organized several HUGUK meetups
Thursday 12 January 12
What are those yellow things?




Thursday 12 January 12
Core principles


                    • Distributed
                    • Fault tolerant
                    • Sequential reads and writes
                    • Data locality

Thursday 12 January 12
Pars pro toto

                                                 Pig     Hive

                                         HBase
                             ZooKeeper

                                                  MapReduce

                                                  HDFS

                         Hadoop itself is basically the kernel that
                         provides a file system and task scheduler


Thursday 12 January 12
Hadoop file system




                         DataNode   DataNode   DataNode




Thursday 12 January 12
Hadoop file system

                         File A =




                          DataNode       DataNode   DataNode




Thursday 12 January 12
Hadoop file system

                         File A =

                         File B =




                          DataNode       DataNode   DataNode




Thursday 12 January 12
Hadoop file system
                                           Linux
                         File A =
                                           block
                         File B =
                                             Hadoop
                                               block


                          DataNode       DataNode      DataNode




Thursday 12 January 12
Hadoop file system
                                             Linux
                         File A =
                                             block
                         File B =
                                              Hadoop
                                                block
                         No random writes!

                          DataNode       DataNode       DataNode




Thursday 12 January 12
Hadoop task scheduler


                         TaskTracker   TaskTracker   TaskTracker


                         DataNode      DataNode      DataNode




Thursday 12 January 12
Hadoop task scheduler
                         Job A =


                         TaskTracker   TaskTracker   TaskTracker


                          DataNode     DataNode      DataNode




Thursday 12 January 12
Hadoop task scheduler
                         Job A =              Job B =


                         TaskTracker   TaskTracker      TaskTracker


                          DataNode     DataNode         DataNode




Thursday 12 January 12
Some practical tips


                    • Install a distribution
                    • Use compression
                    • Consider increasing your block size
                    • Watch out for small files

Thursday 12 January 12
HBase

                                                  Pig     Hive

                                         HBase
                             ZooKeeper

                                                   MapReduce

                                                   HDFS

                         HBase is a database on top of HDFS that
                         can easily be accessed from MapReduce


Thursday 12 January 12
Data model
                                 Column family A       Column family B

                    Row keys   Column X   Column Y   Column U   Column V


                         ...      ...        ...        ...        ...




Thursday 12 January 12
Data model
                                 Column family A       Column family B

                    Row keys   Column X   Column Y   Column U   Column V
         sorted




                         ...      ...        ...        ...        ...




Thursday 12 January 12
Data model
                                   Column family A       Column family B

                    Row keys    Column X    Column Y   Column U    Column V
         sorted




                         ...        ...        ...        ...         ...



                    •    Configurable number of versions per cell
                    •    Each cell version has a timestamp
                    •    TTL can be specified per column family


Thursday 12 January 12
Random becomes sequential


                            ...       KeyValue

                                      KeyValue
                         KeyValue




                                                 sorted
                                                          HDFS
                         KeyValue
                                        ...
                                      KeyValue

                         Commit log   Memstore



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...              KeyValue

                                             KeyValue
                         KeyValue




                                                        sorted
                                                                 HDFS
                         KeyValue
                                               ...
                                             KeyValue

                         Commit log          Memstore



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                                       HDFS
                         KeyValue
                                                     ...
                         KeyValue                  KeyValue

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                   KeyValue
                                                                       HDFS
                         KeyValue
                                                      ...
                         KeyValue                  KeyValue

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                   KeyValue
                                                                            HDFS
                         KeyValue
                                                      ...              sequential
                         KeyValue                  KeyValue              write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue                  High write throughput!


                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                                 sorted
                                                   KeyValue
                                                                               HDFS
                         KeyValue
                                                      ...                 sequential
                         KeyValue                  KeyValue                 write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue                  High write throughput!
                                                                           + efficient scans
                                                                           + free empty cells
                                                                           + no fragmentation
                            ...                    KeyValue                + ...

                                                   KeyValue
                         KeyValue




                                                                 sorted
                                                   KeyValue
                                                                               HDFS
                         KeyValue
                                                      ...                 sequential
                         KeyValue                  KeyValue                 write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Horizontal scaling
                Row keys                        sorted




Thursday 12 January 12
Horizontal scaling
                Row keys                        sorted




Thursday 12 January 12
Horizontal scaling
                Row keys                               sorted




                         Region




                   RegionServer




Thursday 12 January 12
Horizontal scaling
                Row keys                                    sorted




                         Region           Region         Region

                         Region
                           ...              ...             ...
                   RegionServer         RegionServer   RegionServer




Thursday 12 January 12
Horizontal scaling
                Row keys                                                sorted




                         Region                Region                Region

                         Region
                           ...                   ...                    ...
                   RegionServer              RegionServer          RegionServer

                •        Each region has its own commit log and memstores
                •        Moving regions is easy since the data is all in HDFS
                •        Strong consistency as each region is served only once

Thursday 12 January 12
Some practical tips

                    • Restrict the number of regions per server
                    • Restrict the number column families
                    • Use compression
                    • Increase file descriptor limits on nodes
                    • Use a large enough buffer when scanning

Thursday 12 January 12
Look, a herd of Hadoops!




Thursday 12 January 12
• “Last.fm lets you effortlessly keep a record
                         of what you listen to from any player. Based
                         on your taste, Last.fm recommends you
                         more music and concerts” —Last.fm
                    • Over 60 billion tracks scrobbled since 2003
                    • Started using Hadoop in 2006, before Yahoo

Thursday 12 January 12
• “Massive Media is the social media
                         company behind the successful digital
                         brands Netlog.com and Twoo.com.
                         We enable members to meet nearby
                         people instantly” —MassiveMedia.eu
                    • Over 80 million users on web and mobile
                    • Using Hadoop for about a year now
Thursday 12 January 12
Hadoop adoption

                    1. Business intelligence
                    2. Testing and experimentation
                    3. Fraud and abuse detection
                    4. Product features
                    5. PR and marketing



Thursday 12 January 12
Hadoop adoption




                                                         m f
                                                       st.
                                                     La
                    1. Business intelligence         √
                    2. Testing and experimentation   √
                    3. Fraud and abuse detection     √
                    4. Product features              √
                    5. PR and marketing              √



Thursday 12 January 12
Hadoop adoption




                                                                       ia
                                                                     ed
                                                                    Me
                                                         m


                                                                  siv
                                                           f
                                                       st.


                                                                as
                                                     La


                                                               M
                    1. Business intelligence         √ √
                    2. Testing and experimentation   √ √
                    3. Fraud and abuse detection     √ √
                    4. Product features              √ √
                    5. PR and marketing              √



Thursday 12 January 12
Business intelligence




Thursday 12 January 12
Testing and experimentation




Thursday 12 January 12
Fraud and abuse detection




Thursday 12 January 12
Fraud and abuse detection




Thursday 12 January 12
Product features




Thursday 12 January 12
PR and marketing




Thursday 12 January 12
Let’s dive into the first use case!




Thursday 12 January 12
Goals and requirements

                    • Timeseries graphs of 1000 or so metrics
                    • Segmented over about 10 dimensions




Thursday 12 January 12
Goals and requirements

                    • Timeseries graphs of 1000 or so metrics
                    • Segmented over about 10 dimensions
                    1. Scale with very large number of events
                    2. History for graphs must be long enough
                    3. Accessing the graphs must be instantaneous
                    4. Possibility to analyse in detail when needed


Thursday 12 January 12
Attempt #1

                    • Log table in MySQL
                    • Generate graphs from this table on-the-fly




Thursday 12 January 12
Attempt #1

                    • Log table in MySQL
                    • Generate graphs from this table on-the-fly
                    1. Large number of events      √
                    2. Long enough history          ⁄
                    3. Instantaneous access         ⁄
                    4. Analyse in detail           √

Thursday 12 January 12
Attempt #2

                    • Counters in MySQL table
                    • Update counters on every event




Thursday 12 January 12
Attempt #2

                    • Counters in MySQL table
                    • Update counters on every event
                    1. Large number of events      ⁄
                    2. Long enough history        √
                    3. Instantaneous access       √
                    4. Analyse in detail           ⁄

Thursday 12 January 12
Attempt #3

                    • Put log files in HDFS through syslog-ng
                    • MapReduce on logs and write to HBase




Thursday 12 January 12
Attempt #3

                    • Put log files in HDFS through syslog-ng
                    • MapReduce on logs and write to HBase
                    1. Large number of events      √
                    2. Long enough history         √
                    3. Instantaneous access        √
                    4. Analyse in detail           √

Thursday 12 January 12
Architecture

                          Syslog-ng

                           HDFS

                         MapReduce

                           HBase


Thursday 12 January 12
Architecture

                          Syslog-ng

                           HDFS
                                         Realtime
                                        processing
                         MapReduce

                           HBase


Thursday 12 January 12
Architecture

                              Syslog-ng

                               HDFS
                                             Realtime
                   Ad-hoc                   processing
                             MapReduce
                   results

                               HBase


Thursday 12 January 12
HBase schema

                    • Separate table for each time granularity
                    • Global segmentations in row keys
                         •   <language>||<country>||...|||<timestamp>
                         •   * for “not specified”
                         •   trailing *s are omitted
                    • Further segmentations in column keys
                     • e.g. payments_via_paypal, payments_via_sms
                    • Related metrics in same column family
Thursday 12 January 12
Questions?



Thursday 12 January 12

More Related Content

PPTX
Hadoop fundamentals
PPT
Hadoop - Introduction to Hadoop
PDF
Hadoop Overview & Architecture
 
PDF
Introduction to Big Data & Hadoop
PDF
Omaha Java Users Group - Introduction to HBase and Hadoop
PPSX
PDF
Large Scale Math with Hadoop MapReduce
Hadoop fundamentals
Hadoop - Introduction to Hadoop
Hadoop Overview & Architecture
 
Introduction to Big Data & Hadoop
Omaha Java Users Group - Introduction to HBase and Hadoop
Large Scale Math with Hadoop MapReduce

What's hot (20)

PDF
Integration of HIve and HBase
ODP
Hadoop demo ppt
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PDF
An Introduction to the World of Hadoop
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PDF
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
PPTX
Hive vs Hbase, a Friendly Competition
PDF
Data model for analysis of scholarly documents in the MapReduce paradigm
PPTX
Understanding hdfs
KEY
Intro to Neo4j presentation
ODP
Kerry osborne hadoop meets exadata
PDF
Large-Scale Data Storage and Processing for Scientists with Hadoop
PPTX
Huhadoop - v1.1
DOCX
Hadoop Seminar Report
PDF
Modeling with Hadoop kdd2011
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
ODP
Hadoop Meets Exadata- Kerry Osborne
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Integration of HIve and HBase
Hadoop demo ppt
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
An Introduction to the World of Hadoop
EclipseCon Keynote: Apache Hadoop - An Introduction
Sf NoSQL MeetUp: Apache Hadoop and HBase
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Hive vs Hbase, a Friendly Competition
Data model for analysis of scholarly documents in the MapReduce paradigm
Understanding hdfs
Intro to Neo4j presentation
Kerry osborne hadoop meets exadata
Large-Scale Data Storage and Processing for Scientists with Hadoop
Huhadoop - v1.1
Hadoop Seminar Report
Modeling with Hadoop kdd2011
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop Meets Exadata- Kerry Osborne
Overview of Big data, Hadoop and Microsoft BI - version1
Ad

Similar to Spotting Hadoop in the wild (20)

PDF
Pig and Python to Process Big Data
PDF
Hadoop: A Hands-on Introduction
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
Tools and techniques for data science
PDF
Hadoop programming
PPTX
MapReduce Paradigm
PPTX
MapReduce Paradigm
PPTX
Hadoop For Enterprises
PDF
Introduction to Hadoop
PDF
Apache Hadoop Talk at QCon
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
PDF
The Family of Hadoop
PPT
Hadoop Technology
PPTX
Hadoop hbase mapreduce
PPTX
Large scale computing with mapreduce
PPSX
Hadoop-Quick introduction
KEY
Intro To Hadoop
PDF
Data Processing in the Work of NoSQL? An Introduction to Hadoop
PDF
Hadoop, HDFS and MapReduce
Pig and Python to Process Big Data
Hadoop: A Hands-on Introduction
Apache Hadoop & Friends at Utah Java User's Group
Tools and techniques for data science
Hadoop programming
MapReduce Paradigm
MapReduce Paradigm
Hadoop For Enterprises
Introduction to Hadoop
Apache Hadoop Talk at QCon
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
The Family of Hadoop
Hadoop Technology
Hadoop hbase mapreduce
Large scale computing with mapreduce
Hadoop-Quick introduction
Intro To Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Hadoop, HDFS and MapReduce
Ad

Recently uploaded (20)

PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Architecture types and enterprise applications.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
The various Industrial Revolutions .pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Group 1 Presentation -Planning and Decision Making .pptx
cloud_computing_Infrastucture_as_cloud_p
1 - Historical Antecedents, Social Consideration.pdf
Tartificialntelligence_presentation.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Developing a website for English-speaking practice to English as a foreign la...
O2C Customer Invoices to Receipt V15A.pptx
Enhancing emotion recognition model for a student engagement use case through...
Module 1.ppt Iot fundamentals and Architecture
Assigned Numbers - 2025 - Bluetooth® Document
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A contest of sentiment analysis: k-nearest neighbor versus neural network
Getting started with AI Agents and Multi-Agent Systems
Architecture types and enterprise applications.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
NewMind AI Weekly Chronicles - August'25-Week II
The various Industrial Revolutions .pptx
Final SEM Unit 1 for mit wpu at pune .pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Spotting Hadoop in the wild

  • 1. Spotting Hadoop in the wild Practical use cases from Last.fm and Massive Media @klbostee Thursday 12 January 12
  • 2. • “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com • “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.ly Thursday 12 January 12
  • 3. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media Thursday 12 January 12
  • 4. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media • Created Dumbo, a Python API for Hadoop • Contributed some code to Hadoop itself • Organized several HUGUK meetups Thursday 12 January 12
  • 5. What are those yellow things? Thursday 12 January 12
  • 6. Core principles • Distributed • Fault tolerant • Sequential reads and writes • Data locality Thursday 12 January 12
  • 7. Pars pro toto Pig Hive HBase ZooKeeper MapReduce HDFS Hadoop itself is basically the kernel that provides a file system and task scheduler Thursday 12 January 12
  • 8. Hadoop file system DataNode DataNode DataNode Thursday 12 January 12
  • 9. Hadoop file system File A = DataNode DataNode DataNode Thursday 12 January 12
  • 10. Hadoop file system File A = File B = DataNode DataNode DataNode Thursday 12 January 12
  • 11. Hadoop file system Linux File A = block File B = Hadoop block DataNode DataNode DataNode Thursday 12 January 12
  • 12. Hadoop file system Linux File A = block File B = Hadoop block No random writes! DataNode DataNode DataNode Thursday 12 January 12
  • 13. Hadoop task scheduler TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 14. Hadoop task scheduler Job A = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 15. Hadoop task scheduler Job A = Job B = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 16. Some practical tips • Install a distribution • Use compression • Consider increasing your block size • Watch out for small files Thursday 12 January 12
  • 17. HBase Pig Hive HBase ZooKeeper MapReduce HDFS HBase is a database on top of HDFS that can easily be accessed from MapReduce Thursday 12 January 12
  • 18. Data model Column family A Column family B Row keys Column X Column Y Column U Column V ... ... ... ... ... Thursday 12 January 12
  • 19. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... Thursday 12 January 12
  • 20. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... • Configurable number of versions per cell • Each cell version has a timestamp • TTL can be specified per column family Thursday 12 January 12
  • 21. Random becomes sequential ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log Memstore Thursday 12 January 12
  • 22. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log Memstore Thursday 12 January 12
  • 23. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore write Thursday 12 January 12
  • 24. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore write Thursday 12 January 12
  • 25. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 26. Random becomes sequential KeyValue High write throughput! ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 27. Random becomes sequential KeyValue High write throughput! + efficient scans + free empty cells + no fragmentation ... KeyValue + ... KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 28. Horizontal scaling Row keys sorted Thursday 12 January 12
  • 29. Horizontal scaling Row keys sorted Thursday 12 January 12
  • 30. Horizontal scaling Row keys sorted Region RegionServer Thursday 12 January 12
  • 31. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer Thursday 12 January 12
  • 32. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer • Each region has its own commit log and memstores • Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only once Thursday 12 January 12
  • 33. Some practical tips • Restrict the number of regions per server • Restrict the number column families • Use compression • Increase file descriptor limits on nodes • Use a large enough buffer when scanning Thursday 12 January 12
  • 34. Look, a herd of Hadoops! Thursday 12 January 12
  • 35. • “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm • Over 60 billion tracks scrobbled since 2003 • Started using Hadoop in 2006, before Yahoo Thursday 12 January 12
  • 36. • “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu • Over 80 million users on web and mobile • Using Hadoop for about a year now Thursday 12 January 12
  • 37. Hadoop adoption 1. Business intelligence 2. Testing and experimentation 3. Fraud and abuse detection 4. Product features 5. PR and marketing Thursday 12 January 12
  • 38. Hadoop adoption m f st. La 1. Business intelligence √ 2. Testing and experimentation √ 3. Fraud and abuse detection √ 4. Product features √ 5. PR and marketing √ Thursday 12 January 12
  • 39. Hadoop adoption ia ed Me m siv f st. as La M 1. Business intelligence √ √ 2. Testing and experimentation √ √ 3. Fraud and abuse detection √ √ 4. Product features √ √ 5. PR and marketing √ Thursday 12 January 12
  • 42. Fraud and abuse detection Thursday 12 January 12
  • 43. Fraud and abuse detection Thursday 12 January 12
  • 45. PR and marketing Thursday 12 January 12
  • 46. Let’s dive into the first use case! Thursday 12 January 12
  • 47. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions Thursday 12 January 12
  • 48. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions 1. Scale with very large number of events 2. History for graphs must be long enough 3. Accessing the graphs must be instantaneous 4. Possibility to analyse in detail when needed Thursday 12 January 12
  • 49. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly Thursday 12 January 12
  • 50. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly 1. Large number of events √ 2. Long enough history ⁄ 3. Instantaneous access ⁄ 4. Analyse in detail √ Thursday 12 January 12
  • 51. Attempt #2 • Counters in MySQL table • Update counters on every event Thursday 12 January 12
  • 52. Attempt #2 • Counters in MySQL table • Update counters on every event 1. Large number of events ⁄ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail ⁄ Thursday 12 January 12
  • 53. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase Thursday 12 January 12
  • 54. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase 1. Large number of events √ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail √ Thursday 12 January 12
  • 55. Architecture Syslog-ng HDFS MapReduce HBase Thursday 12 January 12
  • 56. Architecture Syslog-ng HDFS Realtime processing MapReduce HBase Thursday 12 January 12
  • 57. Architecture Syslog-ng HDFS Realtime Ad-hoc processing MapReduce results HBase Thursday 12 January 12
  • 58. HBase schema • Separate table for each time granularity • Global segmentations in row keys • <language>||<country>||...|||<timestamp> • * for “not specified” • trailing *s are omitted • Further segmentations in column keys • e.g. payments_via_paypal, payments_via_sms • Related metrics in same column family Thursday 12 January 12