Hadoop at Musicmetric

Hadoop at Musicmetric

Dr Jameel Syed
April 2012

Music has moved online
• The world has changed
– Do you buy vinyl/tapes/CDs of music?
– Do you buy music downloads?
– Do you download illegal content from BitTorrent?
– Do you listen to music on YouTube?
– Do you “like” bands on Facebook?
– Do you subscribe to Spotify?
– Do you listen on the radio to the weekly charts on a
Sunday afternoon?
• What’s happening online?

Data Science in the Music Industry
• Raw Data
– Social media/networks (Facebook, YouTube,
Twitter, Last.fm...)
– BitTorrent
– Online reviews
• Raw Data -> Derived Data -> Insight
– Who is popular right now/in the immediate
future?
– What was the effect of appearing at a festival?
– Which artists are (becoming) popular with
listeners with certain demographics (in a
region)?
• Data processing, machine learning &
statistical methods
– Sentiment analysis
– Named Entity Recognition
– Ranking
– Segmentation

Data Pipeline - Overview

Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application

• Engineering approach
– KISS
– Decoupled components

Data Pipeline - Input

Data Processing

• Input
– Distributed data collection from public internet
sources
• Real-time system constraints: 24/7 hourly data
• Changing format, scope
– Customers providing private data feeds
• e.g. sales and streaming data

Data Pipeline - Output

Data Processing

• Output
– Sparse data requests about hundreds of thousands of artists
– Timeliness
– Lots of combinations (by country/city, by release/track,
diff/cumulative, hourly/daily/weekly, charts…)
– Need to reprocess over EVERYTHING (new metadata, re-
delivery of data, anomaly detection)

Why Hadoop?
• Outgrew initial solution for data processing
over existing data
– How long should daily processing take?
– I/O (disk seeks)
• Additional data
– BitTorrent scale-up
– iTunes sales
– Spotify plays

Hadoop Cluster
• 12 physical servers + 2 KVM virtual machines
• Cloudera CDH3/Ubuntu 10.04 LTS
• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)
• 24GB RAM, 4x 2TB WD
• Gb Ethernet (no link aggregation yet)
• ~2.5KW (max 4KW)

mm-addax mm-rhino-01 mm-rhino-02

Edge Server Primary Name Node Secondary Name Node
Job Tracker
mm-impala Zoo Keeper

NFS Server mm-rhino-03

DHCP/PXE/DNS Data Node 01
mm-rhino-10
mm-gazelle
Data Node 02
…
mm-rhino-11
Private Hadoop
network Data Node 09

Data Storage & Processing
Hadoop
Private Data Raw data Processed Time series

Voldemort

Public Data
Push To Preprocess Generate HDFS to KVS
Hadoop timeseries

RabbitMQ
To Hadoop Preprocess Timeseries To_KVS

• E.g. BitTorrent input data: per 1TB
• Pre-processed: 200GB
• Raw time series: 37GB
• Filtered/artist data: 2.5GB
• KVS: 1.9GB

Opportunities
• Hive/Pig/HBase
• Mahout
• Nutch

Open Questions & Challenges
• Organizational readiness
– Planning
– Access
– Experience
• Cluster maintenance
– Unlikely to replicate production setup
– 24/7 (ish)
– What can be switched off when (and is it handled automatically)?
• Resource scheduling
• Workflow
• Amazon EMR vs own hardware?
– Predictable workload/cost?
– In for a penny, in for a pound
– Hotel California
• DBA equivalent on Hadoop? HDA

We are hiring

jobs@musicmetric.com
@tilapia

Hadoop at Musicmetric

More Related Content

What's hot (10)

Viewers also liked (19)

Similar to Hadoop at Musicmetric (20)

Recently uploaded (20)

Hadoop at Musicmetric