SlideShare a Scribd company logo
Text	mining	of	Beauty	Blogs:
Text	mining	of	Beauty	Blogs:
О	чем	говорят	женщины?	
Артем	Просветов
Data	Scientist,	CleverDATA
cleverdata.ru |		info@cleverdata.ru
Raw	blog	data
Raw	data:	98,496 pages	in	format	of	~	1,000,000	files.
Ready	for	analysis:	58,719 English	pages	(59.6%)
40.4%	data:	empty	pages	and	pages	with	errors,	not	English	pages	
(23,461),	photo/video	pages	without	text	(2,315),	articles	from	
techcrunch.com	(3,402)
cleverdata.ru |		info@cleverdata.ru
From	60k of	pages	→		~2000 authors.
Pages	→	Authors
cleverdata.ru |		info@cleverdata.ru
Mean	blog	post	size	(in	words)
One can distinguish 2 populations
of bloggers:
• twitter style' authors with short
posts (~20%)
• full-length bloggers with 200-
500 mean words per post
(~80%)
cleverdata.ru |		info@cleverdata.ru
Used APIs and services:
- Sentity (https://sentity.io/)
- Twinword (https://www.twinword.com/)
- Textualinsights (http://www.textualinsights.com/)
- VivekN (https://github.com/vivekn/sentiment-web)
Sentiment	analysis
cleverdata.ru |		info@cleverdata.ru
Sentiment	analysis
• - the resulting sentiment rate is based
on 4 independent rate systems.
• - the majority of the blogs have positive
emotion rate.
• - the mean sentiment rate is «positive
warm» 0.72.
• - all this results are intuitively consistent
and are in a good agreement with
manual tests
cleverdata.ru |		info@cleverdata.ru
We used a	few traffic rank systems:
Estimation of blog efficiency
• Alexa Rank,	that basically audits and makes public the frequency of
visits on various Web sites.
• Yandex Thematic Citation Index (TIC),	that determines the
“credibility”	of Internet resources based on a	qualitative assessment
of links to other sites.
• Google Page Rank,	that works by counting the number and quality
of links to blog to determine a	rough estimate of how important the
website is.
cleverdata.ru |		info@cleverdata.ru
Content relevance rate is based on fuzzy string matching:
- Every company product name was string matched with all amount of blogs.	
- String matching is based on Levinstein's metric.
- Pages with 90%	matching rate were marked up.
- Tests with direct brand name matching showed that we get about 90-100%	
accuracy on each product name deppends on words in title.	
- The result relevance rate for each author is summed from all marks of
his/hers pages.
Relevance	Rate
cleverdata.ru |		info@cleverdata.ru
Levenshtein distance is a	string metric for measuring the difference between
two sequences.	
Informally,	the Levenshtein distance between two words is the minimum
number of single-character edits (i.e.	insertions,	deletions or substitutions)	
required to change one word into the other.
Levinshtein distance between 'beer'	and 'bread'	is 44/100
Levenshtein	distance
cleverdata.ru |		info@cleverdata.ru
The	most	active	authors
write	with		sentiment
rate	in	short	range:	
0.74	+/- 0.03
Sentiment rate
Blogsize(pages)
Sentiments	vs	Blog	size
cleverdata.ru |		info@cleverdata.ru
The	most	discussed	
blogs	have	middle-
size	authors.
Log(Blog size)
Meandiscussion
Discussion	vs	Blog	size
cleverdata.ru |		info@cleverdata.ru
Again,	2	kinds	of	bloggers:
- 'twitter	style'	authors
with	short	posts
- full-length	bloggers
Log(mean words per page)
Log(Blogsize)
Words	vs	Pages
cleverdata.ru |		info@cleverdata.ru
f	you	want	to	make	a	big	
discussion,	you	should	
praise	something.
All	highly	discussed	
authors	are	sentiment	
positive	(>=0.4)
Sentiment rate
Meandiscussion
Discussion	vs	Sentiments
cleverdata.ru |		info@cleverdata.ru
We use Klout service to rank authors
according to online social influence.	
Klout measures the size of a	user's
social media network and correlates the
content created to measure how other
users interact with that content.
- the median Klout score is 40.1
Using	of	Klout	score	for	bloggers
cleverdata.ru |		info@cleverdata.ru
One can distinguish a	population
of beginner bloggers with low
Klout score,	that have tendency
to amplification of sentiments.
Sentiment rate
Kloutscore
Sentiments	vs	Klout	score
cleverdata.ru |		info@cleverdata.ru
• Amount	of		blog	pages
• Mean	discussion	size	
• AlexaRank +	YandexTIC +	Google	PageRank
• Relevance	rate
• Sentiment	rate
• Klout score
Final	Author	Rating	is	based	on
cleverdata.ru |		info@cleverdata.ru
4	independent	sentiment
rating	systems	are	combined
Alexa	Rank
Yandex	Thematic	Citation	Index	
Google	PageRank
list	of	most	PR	effective	authors		
Pragmatic	statistical	information
key	recommendations	for	blogger
resulting	sentiment	rate	is
fully	consistent	with	tests
Blog			
efficiency
rating	
Blog
relevance
rating
Sentiment	
analysis
Make	your	data	clever
Based	on	fuzzy	string	
matching	
Blog	rating	in	
accordance	to	
mentions	of	company	
products	in	text
cleverdata.ru |		info@cleverdata.ru
Name Url Sentiment Pages Mean	
Comments
Hayley	Carr http://www.londonbeautyqueen.com 0.71 229 10.9
Luzanne http://pinkpeonies.co.za 0.77 66 68.3
Allison http://www.neversaydiebeauty.com 0.70 182 42.9
Mica	Kelly,	Beth,	
Jessica	Diner
http://blog.birchbox.co.uk 0.74 196 0.26
Poonam http://beautyandmakeupmatters.com 0.78 142 4.3
Silvie http://mysillylittlegang.com 0.74 571 0.64
TOP	Rated	Authors
cleverdata.ru |		info@cleverdata.ru
Testing	the	result
Hayley	Carr (Top	Rated	Author):	
“BlaBlaBla is	definitely	a	brand	to	be	reckoned	with...	All	of	the	
BlaBlaBla products	have	multiple	purposes,	as	well	as	smelling	
and	feeling	fabulous;	the	packaging	is	clean	and	fresh	whilst	
still	looking	great	in	your	bathroom,	as	well	as	having	unique	
application	methods	that	only	aid	the	product	performance...	
It's	definitely	worth	checking	out	this	growing	brand,	before	it	
starts	taking	over	the	world.	“
cleverdata.ru |		info@cleverdata.ru
Authors	←→		Products
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
Finding	the	most	perspective	
for	promotion	products
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
Let's	build	document-term	
matrix,	where	each	row	is	a	
document,	each	term	is	a	
column	and	a	color	intensity	
indicates	that	a	term	appears	in	
a	document	at	least	once.	
We	can	use	TF-IDF	method	
to	get	document-term	matrix.	
Finding	topics:
the	document-term	matrix
cleverdata.ru |		info@cleverdata.ru
Finding	topics:	TF	- IDF
• Term	frequency	TF(t,d) is	the	number	of	times	that	term	t	
occurs	in	document	d.
• The	inverse	document	frequency	(IDF)	is	a	measure	of	how	
much	information	the	word	provides,	that	is,	whether	the	
term	is	common	or	rare	across	all	documents.
• Term	frequency–inverse	document	frequency,	is	a	numerical	
statistic	that	is	intended	to	reflect	how	important	a	word	is	
to	a	document	in	a	collection	or	corpus.
cleverdata.ru |		info@cleverdata.ru
• NMF	is	a	variant	of	Matrix	
Factorization	where	we	start	
with	a	matrix	D with	document-
term	matrix,	and	constrain	the	
elements	of	W and	T to	be		non-
negative.
• Lets	us	interpret	each	row	of	the	
T matrix	as	a		topic.
Topic	extraction:	NMF
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
• For	each	author	we	build	document-term	matrix.
• For	each	document-term	matrix	we	perform	matrix	
factorization	and	find	main	topics
• For	each	product	we	match	product	name	with	
main	topics	of	author	and	find	the	rate	of	intensity.		
• If	author	have	exact	product	name	in	one	of	
his/hers	titles,	we	set	the	rate	of	intensity	to	0 (the	
author	has	already	made	review	of	the	the
product).
Topic	extraction
cleverdata.ru |		info@cleverdata.ru
Thus	for	each	pair	of	author-product	we	find	rate	of	intensity	and	we	can	
visualize	it	in	form	of	heatmap	where	products	are	sorted	by	mean	rate	of	
intensity	and	authors	are	sorted	by	author	rating:	
Note:	the	most	rated	authors	are	highly	intensive	on	matrix	
The	intensity	matrix
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
Next	we	extract	the	most	resonance	peaks	from	product-author	matrix	of	intensity.	
After	each	peak	extraction	the	column	with	a	peak	is	dropped,	so	for	each	author	
we	get	only	one	product.	
We	need	to	build	recommendations	only	for	4	products	and	we	can	select	40	
best	rated	authors	for	this	task.	
The	intensity	matrix
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
• Profit!
cleverdata.ru |		info@cleverdata.ru
BlaBlaBla	Body	Oil Allison	 http://www.neversaydiebeauty.com
BlaBlaBla	Wrinkle	
Repair
Cindy	Batchelor http://mystylespot.net
BlaBlaBla	Face	Serum Marie	Papachatzis http://iamthemakeupjunkie.blogspot.ru
BlaBlaBla	Face	Oil Emily	- Style	Lobster http://stylelobster.com
The	resulting	associations
Data Science Weekend 2017. CleverDATA. Text mining of beauty blogs: о чем говорят женщины?

More Related Content

PDF
Data Science Weekend 2017. E-Contenta. Классификация текстов: в поисках сереб...
PDF
Data Science Weekend 2017. Qlean. Как устроено машинное обучение в Qlean
PDF
Data Science Weekend 2017. Urbica. Дизайн города, основанный на данных
PDF
Data Science Weekend 2017. New Professions Lab. Образование в области Data Sc...
PDF
Data Science Weekend 2017. МегаФон. Аналитика больших данных в телекоме. Опыт...
PDF
Data Science Weekend 2017. Brand Analytics. Исследование трендов потребления ...
PDF
Data Science Weekend 2017. Intento. Machine to Machine Communication in the ...
PDF
Data Science Weekend 2017. 1С-Битрикс. Чатбот для подсказки ответов на вопросы
Data Science Weekend 2017. E-Contenta. Классификация текстов: в поисках сереб...
Data Science Weekend 2017. Qlean. Как устроено машинное обучение в Qlean
Data Science Weekend 2017. Urbica. Дизайн города, основанный на данных
Data Science Weekend 2017. New Professions Lab. Образование в области Data Sc...
Data Science Weekend 2017. МегаФон. Аналитика больших данных в телекоме. Опыт...
Data Science Weekend 2017. Brand Analytics. Исследование трендов потребления ...
Data Science Weekend 2017. Intento. Machine to Machine Communication in the ...
Data Science Weekend 2017. 1С-Битрикс. Чатбот для подсказки ответов на вопросы

Viewers also liked (13)

PPTX
Data Science Weekend 2017. Segmento, На пути к идеальной диалоговой системе
PDF
Presentazione Savino Università Bocconi
PPTX
Онлайн советник по маркетингу Роман Васильев факты и цифры
PPTX
BizTalks. Роман Кумар Виас (Qlean)
DOCX
2016 and 2017 Data Mining Projects @ TMKS Infotech
PDF
Data Science Week 2016. Segmento, "Digital Employee"
PDF
Data Science Week 2016. Inten.to. "Мессенджеры и персональные ассистенты"
PDF
Data Science Week 2016. Rambler & Co. "Пайплайн машинного обучения на Apache ...
PDF
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
PDF
Data Science Week 2016. SkyEng. "Data-driven экономика компании"
PDF
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
PDF
4 sas and big data short
PDF
Data Science Week 2016. Sberbank
Data Science Weekend 2017. Segmento, На пути к идеальной диалоговой системе
Presentazione Savino Università Bocconi
Онлайн советник по маркетингу Роман Васильев факты и цифры
BizTalks. Роман Кумар Виас (Qlean)
2016 and 2017 Data Mining Projects @ TMKS Infotech
Data Science Week 2016. Segmento, "Digital Employee"
Data Science Week 2016. Inten.to. "Мессенджеры и персональные ассистенты"
Data Science Week 2016. Rambler & Co. "Пайплайн машинного обучения на Apache ...
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
Data Science Week 2016. SkyEng. "Data-driven экономика компании"
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
4 sas and big data short
Data Science Week 2016. Sberbank
Ad

More from Newprolab (8)

PDF
Data Science Week 2016. QIWI. "Поиск сообществ в графах пользователей переводов"
PPTX
Data Science Week 2016. Microsoft. "Интернет вещей и предиктивная аналитика ...
PPTX
Data Science Week 2016. GlowByte, "Культура работы с данными"
PDF
Data Science Week 2016. DCA. "Ваш телефон вас понимает. Персонализированные п...
PDF
Data Science Week 2016. RockStat. "Мультиканальная атрибуция на основе вовлеч...
PDF
Data Science Week 2016. New Professions Lab. "Образование в области Big Data"
PDF
Data Science Week 2016. Homeapp. "Создание розничного data-driven продукта"
PDF
Data Science Week 2016. E-Contenta. "Data science в медиа-компаниях"
Data Science Week 2016. QIWI. "Поиск сообществ в графах пользователей переводов"
Data Science Week 2016. Microsoft. "Интернет вещей и предиктивная аналитика ...
Data Science Week 2016. GlowByte, "Культура работы с данными"
Data Science Week 2016. DCA. "Ваш телефон вас понимает. Персонализированные п...
Data Science Week 2016. RockStat. "Мультиканальная атрибуция на основе вовлеч...
Data Science Week 2016. New Professions Lab. "Образование в области Big Data"
Data Science Week 2016. Homeapp. "Создание розничного data-driven продукта"
Data Science Week 2016. E-Contenta. "Data science в медиа-компаниях"
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Foundation of Data Science unit number two notes
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Quality review (1)_presentation of this 21
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Mega Projects Data Mega Projects Data
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
Galatica Smart Energy Infrastructure Startup Pitch Deck
Foundation of Data Science unit number two notes
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Quality review (1)_presentation of this 21
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Clinical guidelines as a resource for EBP(1).pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Major-Components-ofNKJNNKNKNKNKronment.pptx

Data Science Weekend 2017. CleverDATA. Text mining of beauty blogs: о чем говорят женщины?