SlideShare a Scribd company logo
Как мне медведь на ухо 
наступил 
ищем структуру в данных Last.fm 
(Tone deaf: finding structure in Last.fm data) 
Andrei Zhabinski 
hadoop engineer @ adform.com
Data Algorithm
Data Features Algorithm
“raw” features 
name = fry 
gender = male 
age = 1029 
height = 171 
weight = 68 
occupation = delivery boy 
income = $510 
balance = $4.3 billion
transformed / combined features 
log(age) = 6.94 
income^2 = 260100 
height * weight = 11628
expert knowledge 
(domain specific) 
planet(GeoIP) = Earth 
working_hours(delivery boy) = 8
expert knowledge is really useful 
Part-of-Speech Tags 
Stemming/Lemmatization 
Gabor filters 
Haar wavelets 
SURF descriptors
but what if I’m not an 
expert?
Tone deaf: finding structure in Last.fm data
Helps to find structure in “raw” data
Clustering
Principal Component Analysis
Autoencoders RBM
consider image data 
raw pixel matrix structured patterns
deep learning
Images are too 
mainstream!
Last.Fm dataset 360K 
(http://mtg.upf.edu/node/1671) 
● userid, artname, plays 
● data from 2001-2009 
● 360K unique users 
● 290K unique artists 
● 17M records
Most popular 
artname audience 
1 radiohead 77348 
2 the beatles 76339 
3 coldplay 66738 
4 red hot chili peppers 48989 
5 muse 47015 
6 metallica 45301 
7 pink floyd 44506 
8 the killers 41280 
9 linkin park 39833 
10 nirvana 39534
# of plays statistics 
julia> describe(dataset) 
plays 
Min 0.0 
1st Qu. 35.0 
Median 94.0 
Mean 215.18519944356137 
3rd Qu. 224.0 
Max 419157.0 
NAs 0 
NA% 0.0%
419157 * 3 min = 873 days
2.39 years
NOFX
Audience histogram
Noisy and computationally hard 
Let’s reduce dataset 
290K => 1K 
17M => 8M
As simple as possible 
>= 10 
< 10 
1 
0
Like-matrix 
user #1 user #2 ... user #n 
artist #1 0 0 ... 1 
artist #2 1 0 ... 0 
... ... ... ... ... 
artist #m 1 1 ... 0
Restricted Boltzmann Machines 
Two formulations: 
● neural network - 
vertices are “neurons” 
connected via weighted 
edges 
● Markov random field - 
vertices are random 
variables, edges show 
dependencies
feature #1 
feature #2 
feature #3 
feature #4 
“mega”- 
feature #1 
“mega”- 
feature #2 
“mega”- 
feature #3 
RBM “compresses” input, 
converting “raw” data into 
more high-level 
representation
short probability theory review
Random Variable 
Not a single value, but a set 
of values with associated 
probabilities
Probability Distribution 
probability distribution 
X=0 X=1 
0.3 0.7 
joint probability 
distribution 
X1=0 X1=1 
X2=0 0.15 0.25 
X2=1 0.4 0.3
Like-matrix 
user #1 user #2 ... user #n 
artist #1 0 0 ... 1 
artist #2 1 0 ... 0 
... ... ... ... ... 
artist #m 1 1 ... 0 
observation (object)
Like-matrix 
user #1 user #2 ... user #n 
artist #1 0 0 ... 1 
artist #2 1 0 ... 0 
... ... ... ... ... 
artist #m 1 1 ... 0 
binary random variable (feature)
artist #1 
artist #2 
artist #3 
artist #4 
? ? ? 
2 sets of random variables: 
● visible variables - we know their 
values (e.g. “likes”) 
● hidden variables - some hidden 
counterparts of visible ones 
(genres? countries? decades?)
Ideally, we would like to get probability distribution table 
V0 V1 V2 ... H1 H2 ... P(...) 
0 0 0 ... 0 0 ... 0.002 
0 0 0 ... 0 1 ... 0.0004 
0 0 0 ... 1 0 ... 0.007
total number of combinations 
2^1100 = 
13582985290493858492773514283592667786034938469 
31744549748519669727813092754241848720539208320 
75605922985782629538473834750387255432349299711 
55548342800628721885763499406390331782864144164 
68073076683716052622317651279843577212995655335 
52860322030803807757597323201989850948840040691 
16123084147875437183658467465148948790552744165 
376L
RBM Training: step 1 
Sample hiddens 
given visibles
RBM Training: step 2 
Sample visibles 
given hiddens 
(reconstruct)
RBM Training: step 3 
Calculate error 
and adjust 
weights 
dW 
training result - weight matrix that minimizes reconstruction error
Hidden variables after training 
● each hidden units 
activates visible ones 
with different weights 
● weights represent 
strength of some 
concepts 
W1 W2 W3 W4
Learned concepts: pop / metal 
VS. 
Beyonce Metallica
Learned concepts: hard rock / epatage 
VS. 
Dream Theater Lady Gaga
Learned concepts: classic / alternative 
VS. 
Bob Dylan Linkin Park
Artist portrait 
each artist may be 
described via weighted 
vector of concepts
Radiohead 
Artist portrait 
axis weight 
classic rock/grunge 0.23 
pop/rock 0.24 
electronic/country -0.23 
pop rock/hard rock 0.37 
hardcore/alternative -0.1
Daft Punk 
Artist portrait 
axis weight 
classic rock/grunge 0.09 
pop/rock 0.18 
electronic/country -0.82 
famous/little-known -0.56 
girl band/boy band 0.15
Recommendations 
Features
Recommendations 
Features
Recommendations 
Features
Questions? 
https://github.com/dfdx

More Related Content

PDF
Learning ProcessingJS
PDF
Stefan Kanev: Clojure, ClojureScript and Why They're Awesome at I T.A.K.E. Un...
PPTX
adders/subtractors, multiplexers, intro to ISA
 
PPTX
How i won a golf set from reg.ru
PPTX
Game playing
DOCX
Tugas kelompok mtk soal 2
PDF
lastfm contentdashboards project description
PDF
Virtual Machine Maanager
Learning ProcessingJS
Stefan Kanev: Clojure, ClojureScript and Why They're Awesome at I T.A.K.E. Un...
adders/subtractors, multiplexers, intro to ISA
 
How i won a golf set from reg.ru
Game playing
Tugas kelompok mtk soal 2
lastfm contentdashboards project description
Virtual Machine Maanager

Viewers also liked (13)

PDF
Ramunas Urbonas. The Journey
PDF
Ramunas Balukonis. Research DWH
PDF
Dionizas Antipenkovas. Big Data Intro
PDF
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
PDF
Ed Snelson. Counterfactual Analysis
PDF
Thomas Jensen. Machine Learning
PDF
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
PDF
Tadas Pivorius. Married to Cassandra
PDF
Andrei Kirilenkov. Vertica
PDF
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
PDF
Brian Bulkowski. Aerospike
PDF
Machine Learning with Spark MLlib
PDF
Продуктовая Аналитика — Карго Культ в современных компаниях
Ramunas Urbonas. The Journey
Ramunas Balukonis. Research DWH
Dionizas Antipenkovas. Big Data Intro
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Ed Snelson. Counterfactual Analysis
Thomas Jensen. Machine Learning
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Tadas Pivorius. Married to Cassandra
Andrei Kirilenkov. Vertica
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Brian Bulkowski. Aerospike
Machine Learning with Spark MLlib
Продуктовая Аналитика — Карго Культ в современных компаниях
Ad

Similar to Tone deaf: finding structure in Last.fm data (20)

PDF
A Learning to Rank Project on a Daily Song Ranking Problem
PDF
Music recommendations @ MLConf 2014
PDF
Predicting the future of music #scichallenge2017
PPTX
Teaching Computers to Listen to Music
PDF
Random Forests R vs Python by Linda Uruchurtu
PDF
Kaggle kenneth
PDF
visualization-discography-analysis(2)
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PDF
ML+Hadoop at NYC Predictive Analytics
PDF
IRJET- Implementation of Emotion based Music Recommendation System using SVM ...
PPTX
Emotion based music player
PDF
From Power Chord to the Power of Models - Oredev
PDF
Music Personalization : Real time Platforms.
PDF
Big data and machine learning @ Spotify
PDF
Bangla song genre recognition using artificial neural network
PDF
Deep Learning Meetup #5
PDF
PPTX
Music genre detection using hidden markov models
PDF
GreenMonster
A Learning to Rank Project on a Daily Song Ranking Problem
Music recommendations @ MLConf 2014
Predicting the future of music #scichallenge2017
Teaching Computers to Listen to Music
Random Forests R vs Python by Linda Uruchurtu
Kaggle kenneth
visualization-discography-analysis(2)
FORECASTING MUSIC GENRE (RNN - LSTM)
ML+Hadoop at NYC Predictive Analytics
IRJET- Implementation of Emotion based Music Recommendation System using SVM ...
Emotion based music player
From Power Chord to the Power of Models - Oredev
Music Personalization : Real time Platforms.
Big data and machine learning @ Spotify
Bangla song genre recognition using artificial neural network
Deep Learning Meetup #5
Music genre detection using hidden markov models
GreenMonster
Ad

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Introduction to machine learning and Linear Models
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Introduction to Data Science and Data Analysis
PPTX
Computer network topology notes for revision
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Analytics and business intelligence.pdf
[EN] Industrial Machine Downtime Prediction
Introduction to machine learning and Linear Models
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Business Ppt On Nestle.pptx huunnnhhgfvu
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Data Science and Data Analysis
Computer network topology notes for revision
climate analysis of Dhaka ,Banglades.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Tone deaf: finding structure in Last.fm data