SlideShare a Scribd company logo
Hadoop at Musicmetric

     Dr Jameel Syed
         April 2012
Music has moved online
• The world has changed
  –   Do you buy vinyl/tapes/CDs of music?
  –   Do you buy music downloads?
  –   Do you download illegal content from BitTorrent?
  –   Do you listen to music on YouTube?
  –   Do you “like” bands on Facebook?
  –   Do you subscribe to Spotify?
  –   Do you listen on the radio to the weekly charts on a
      Sunday afternoon?
• What’s happening online?
How popular am I?
Who are my fans?
Where are my fans?
What is the press saying?
Who is popular?
Data Science in the Music Industry
• Raw Data
    – Social media/networks (Facebook, YouTube,
      Twitter, Last.fm...)
    – BitTorrent
    – Online reviews
• Raw Data -> Derived Data -> Insight
    – Who is popular right now/in the immediate
      future?
    – What was the effect of appearing at a festival?
    – Which artists are (becoming) popular with
      listeners with certain demographics (in a
      region)?
• Data processing, machine learning &
  statistical methods
    –   Sentiment analysis
    –   Named Entity Recognition
    –   Ranking
    –   Segmentation
Data Pipeline - Overview

                  Data Processing
              Anomaly                    Key-Value           Web
   Raw Data                Aggregation               API
              Detection                    Store           Application




• Engineering approach
  – KISS
  – Decoupled components
Data Pipeline - Input

                  Data Processing
              Anomaly                    Key-Value           Web
   Raw Data                Aggregation               API
              Detection                    Store           Application




• Input
  – Distributed data collection from public internet
    sources
      • Real-time system constraints: 24/7 hourly data
      • Changing format, scope
  – Customers providing private data feeds
      • e.g. sales and streaming data
Data Pipeline - Output

                   Data Processing
               Anomaly                    Key-Value           Web
   Raw Data                 Aggregation               API
               Detection                    Store           Application




• Output
  – Sparse data requests about hundreds of thousands of artists
  – Timeliness
  – Lots of combinations (by country/city, by release/track,
    diff/cumulative, hourly/daily/weekly, charts…)
  – Need to reprocess over EVERYTHING (new metadata, re-
    delivery of data, anomaly detection)
Why Hadoop?
• Outgrew initial solution for data processing
  over existing data
  – How long should daily processing take?
  – I/O (disk seeks)
• Additional data
  – BitTorrent scale-up
  – iTunes sales
  – Spotify plays
Hadoop Cluster
•    12 physical servers + 2 KVM virtual machines
•    Cloudera CDH3/Ubuntu 10.04 LTS
•    2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)
•    24GB RAM, 4x 2TB WD
•    Gb Ethernet (no link aggregation yet)
•    ~2.5KW (max 4KW)

       mm-addax                 mm-rhino-01                mm-rhino-02

    Edge Server              Primary Name Node          Secondary Name Node
                                 Job Tracker
      mm-impala                  Zoo Keeper

     NFS Server                                   mm-rhino-03

       DHCP/PXE/DNS                   Data Node 01
                                                  mm-rhino-10
      mm-gazelle
                                      Data Node 02
                                              …
                                                  mm-rhino-11
    Private Hadoop
    network                           Data Node 09
Data Storage & Processing
                             Hadoop
      Private Data           Raw data       Processed        Time series


                                                                                    Voldemort


      Public Data
                              Push To     Preprocess    Generate      HDFS to KVS
                              Hadoop                    timeseries


                             RabbitMQ
                              To Hadoop   Preprocess    Timeseries     To_KVS


•   E.g. BitTorrent input data: per 1TB
•   Pre-processed: 200GB
•   Raw time series: 37GB
•   Filtered/artist data: 2.5GB
•   KVS: 1.9GB
Opportunities
• Hive/Pig/HBase
• Mahout
• Nutch
Open Questions & Challenges
• Organizational readiness
    – Planning
    – Access
    – Experience
• Cluster maintenance
    – Unlikely to replicate production setup
    – 24/7 (ish)
    – What can be switched off when (and is it handled automatically)?
• Resource scheduling
• Workflow
• Amazon EMR vs own hardware?
    – Predictable workload/cost?
    – In for a penny, in for a pound
    – Hotel California
• DBA equivalent on Hadoop? HDA
We are hiring

jobs@musicmetric.com
      @tilapia

More Related Content

PPTX
So, What Does a Data Scientist do?
PDF
A Data Scientist in the Music Industry
PDF
Tools and techniques for data science
PPTX
R programming language - Mustafa Wahedi
PDF
What is Big Data?
PDF
Cheat sheets for data scientists
PPTX
Python for data science
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
So, What Does a Data Scientist do?
A Data Scientist in the Music Industry
Tools and techniques for data science
R programming language - Mustafa Wahedi
What is Big Data?
Cheat sheets for data scientists
Python for data science
Unexpected Challenges in Large Scale Machine Learning by Charles Parker

What's hot (10)

PPTX
Intro big data analytics
PDF
Intro to Python for Data Science
PDF
Data science a practitioner's perspective
PDF
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
PDF
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
PDF
Python for Data Science
PPTX
Course Information for March 25th Batch
PDF
Day in the life of a data librarian [presentation for ANU 23Things group]
PDF
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
Intro big data analytics
Intro to Python for Data Science
Data science a practitioner's perspective
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
Python for Data Science
Course Information for March 25th Batch
Day in the life of a data librarian [presentation for ANU 23Things group]
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
Ad

Viewers also liked (19)

PPT
Tic 1
PDF
Rada Seniorów
DOC
Formularz konsultacji społecznych
PPTX
Wireless Systems
PPT
Tic 2
PDF
Neet株式会社(仮)の組織形態についてのご提案
PDF
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
DOC
manoj_kumar_resume
PPT
Pirates 7
PDF
selection
PPT
Lakshya_Concept
PDF
PPTX
Wist-je-datjes over UiTPASregio's
PPTX
B&D Eolas - Catalogue des formations webmarketing - 2015
PPT
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
PPTX
هرم الغذائي فاطمة المحيشي
DOC
Cover Proposal Pembangunan Masjid
PPTX
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
PDF
GO Menstrual , de Miranda Gray
Tic 1
Rada Seniorów
Formularz konsultacji społecznych
Wireless Systems
Tic 2
Neet株式会社(仮)の組織形態についてのご提案
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
manoj_kumar_resume
Pirates 7
selection
Lakshya_Concept
Wist-je-datjes over UiTPASregio's
B&D Eolas - Catalogue des formations webmarketing - 2015
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
هرم الغذائي فاطمة المحيشي
Cover Proposal Pembangunan Masjid
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
GO Menstrual , de Miranda Gray
Ad

Similar to Hadoop at Musicmetric (20)

PDF
Hadoop on Azure, Blue elephants
PDF
Hadoop, Taming Elephants
PDF
Introduction to Hadoop
PDF
Big Data/Hadoop Infrastructure Considerations
PDF
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
KEY
Processing Big Data
PDF
Searching conversations with hadoop
PDF
Hadoop Distributed File System
PPT
Borthakur hadoop univ-research
PDF
The Evolution of Big Data at Spotify
PDF
GPU Acceleration for Financial Services
PPTX
Steve Watt Presentation
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Hadoop for shanghai dev meetup
PPTX
Introduction To Big Data & Hadoop
PDF
How to build a data stack from scratch
PPSX
Hadoop-Quick introduction
PPTX
Hadoop ppt1
PDF
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
PPTX
Hadoop as Data Refinery - Steve Loughran
Hadoop on Azure, Blue elephants
Hadoop, Taming Elephants
Introduction to Hadoop
Big Data/Hadoop Infrastructure Considerations
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Processing Big Data
Searching conversations with hadoop
Hadoop Distributed File System
Borthakur hadoop univ-research
The Evolution of Big Data at Spotify
GPU Acceleration for Financial Services
Steve Watt Presentation
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop for shanghai dev meetup
Introduction To Big Data & Hadoop
How to build a data stack from scratch
Hadoop-Quick introduction
Hadoop ppt1
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Hadoop as Data Refinery - Steve Loughran

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Hadoop at Musicmetric

  • 1. Hadoop at Musicmetric Dr Jameel Syed April 2012
  • 2. Music has moved online • The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon? • What’s happening online?
  • 4. Who are my fans?
  • 5. Where are my fans?
  • 6. What is the press saying?
  • 8. Data Science in the Music Industry • Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews • Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)? • Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  • 9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Engineering approach – KISS – Decoupled components
  • 10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  • 11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  • 12. Why Hadoop? • Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks) • Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  • 13. Hadoop Cluster • 12 physical servers + 2 KVM virtual machines • Cloudera CDH3/Ubuntu 10.04 LTS • 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm) • 24GB RAM, 4x 2TB WD • Gb Ethernet (no link aggregation yet) • ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  • 14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS • E.g. BitTorrent input data: per 1TB • Pre-processed: 200GB • Raw time series: 37GB • Filtered/artist data: 2.5GB • KVS: 1.9GB
  • 16. Open Questions & Challenges • Organizational readiness – Planning – Access – Experience • Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)? • Resource scheduling • Workflow • Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California • DBA equivalent on Hadoop? HDA