SlideShare a Scribd company logo
Thinking in (Text) Clustering
(No math, be not afraid)
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
Software Engineering2017/4/13
Outline
 Background
 What can be clustered?
 Problems in K-XXX (Means/Medoid/Center…)
 Similarity Measure
 Convex and Concave
 Problems in Gaussian Mixture Model
 Problems in Matrix Factorization
 Multinomial and Sparsity
2
Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF,
Multinomial Distribution
Basics, not
state-of-the-art
Software Engineering2017/4/13
Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc
Software Engineering2017/4/13
Background
Dimensional Reduction (DR)
 Clustering
 Text Clustering, Webpage Clustering, Image Clustering…
 Summarization
Document Summarization, Image Summarization…
 Factorization
 Rating Matrix Factorization, Image Non-negative Factorization
4
Automatic Applicable Explainable
 Basic Requirement
Clustering (Text)
Software Engineering2017/4/13
 Related Research Areas
 Dimensional Reduction (DR)
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Artificial Intelligence
 (Text) Clustering
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
DR
Data Mining
ArtificialIntelligence
Machine
Learning
Machine
Translation
(Text)
Clustering
 We all know what (text) clustering is, right?
 Widely-accepted topic, since everyone knows it
Software Engineering2017/4/13
What can be clustered?
6
Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41),
(5.234, 3.56, 4.454, 6.78)
Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0)
Data Sample 3:(China, modern, people, gov.), (policy,
paper, conference, chair), (report, solution, UN, UK)
Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj)
Data Sample 5:(▲▼♦), (♣♠█),(■□●)
Software Engineering2017/4/13
Is there anything that
cannot be clustered?
7
Yes, but not related to us
What can be clustered?
Anything which a similarity
measure can be defined over
Matrix topology
All kinds of data can be
clustered
Software Engineering2017/4/13
K-Means Trap
8
Defects of K-Means, K-
Medoid,K-XXX
 How many K?
 Where are the initial centers?
 Do the data really form a
sphere?
 Do the data really follow
Minkowski /Euclidean distance?
Software Engineering2017/4/13
How about these?
What kind of data that K-XXX better fits?
What kind of data that the methods relying
on distance-similarity computation better fit?
CONVEX
Software Engineering2017/4/13
Alternative
 Gaussian Mixture Model
Software Engineering2017/4/13
Alternative
 Gaussian Mixture Model
11
Why Gaussian  central limit theorem
Is central limit theorem always applicable in
real-world cases?
1. Parameter Tuning
2. High applicability of Gaussian distribution
How to estimate parameters?
Expectation-Maximization
No closed-form solution
Software Engineering2017/4/13
Alternative
 Matrix Factorization
12
No closed solution
‘Cause we are not in
department of math
SVD, PMF, NMF, Tensor
Factorization…
Software Engineering2017/4/13
Triangle
1313
Is there no perfect method here?
What we probably want
 No constraint in the form
of data
 No assumption in data
distribution
 Closed-solution
Triangle borrowed from
distributed computing
Software Engineering2017/4/13
Triangle (Cont.)
I do not know whether such a
method exists or not
Form
Distribution Closed-solution
Hierarchical
Clustering?
GMM/Gaussian
Process
K-Means/Medoid
impossible
Matrix Factorization
impossible impossible
Software Engineering2017/4/13
Multinomial Distribution
Discrete Data (Text)
15
One document:
(0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0
meeting,0,0,0,0,report,0,….)
Multinomial distribution
Clustering 
Sampling
Markov Chain
Monte Carlo
Friendly to
sparsity
Software Engineering2017/4/13
Sparsity
Sparsity brings a lot of problems
16
 Also in clustering  What can we do?
➢ Ensemble Learning (Ensemble clustering)
➢ Missing values pre-filling
➢ Tuning ☺
➢ …
10000 words 
1 term
Software Engineering2017/4/13
Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Random Thoughts in Clustering’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc.
 Website
 You can download all slides of mine
➢ http://web.xidian.edu.cn/ysxu/teach.html
➢ http://liu.cs.uic.edu/yueshenxu/
➢ http://www.slideshare.net/obamaxys2011
➢ https://www.researchgate.net/profile/Yueshen_Xu
17
Software Engineering2017/4/13 18
Q&A

More Related Content

PDF
An Abstract Framework for Agent-Based Explanations in AI
PDF
Machine Learning part 2 - Introduction to Data Science
PPT
WILF2011 - slides
PDF
PPT
Cristopher M. Bishop's tutorial on graphical models
PDF
(Hierarchical) Topic Modeling_Yueshen Xu
PDF
Finite Element Analysis
PDF
Interactive Analysis of Word Vector Embeddings
An Abstract Framework for Agent-Based Explanations in AI
Machine Learning part 2 - Introduction to Data Science
WILF2011 - slides
Cristopher M. Bishop's tutorial on graphical models
(Hierarchical) Topic Modeling_Yueshen Xu
Finite Element Analysis
Interactive Analysis of Word Vector Embeddings

What's hot (15)

PPTX
(Hierarchical) topic modeling
DOC
Resume
PDF
Interactive Learning of Bayesian Networks
PDF
Utilizing Graph Theory to Model Forensic Examination
PDF
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
PPTX
Data visualization
PDF
Seminar_Koga_Yuki_v2.pdf
PPT
Argumentation Trails and Topic Maps
PPTX
Collnet turkey feroz-core_scientific domain
PPTX
Collnet _Conference_Turkey
DOCX
Maths concept map
PDF
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
PDF
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
PDF
algorithms
PDF
Model Evaluation in the land of Deep Learning
(Hierarchical) topic modeling
Resume
Interactive Learning of Bayesian Networks
Utilizing Graph Theory to Model Forensic Examination
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
Data visualization
Seminar_Koga_Yuki_v2.pdf
Argumentation Trails and Topic Maps
Collnet turkey feroz-core_scientific domain
Collnet _Conference_Turkey
Maths concept map
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
algorithms
Model Evaluation in the land of Deep Learning
Ad

Similar to Thinking in clustering yueshen xu (20)

PDF
Geometric Deep Learning
PDF
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
PPTX
Machine Learning basics
PPTX
Regression with Microsoft Azure & Ms Excel
PDF
MODEL_FOR_SEMANTICALLY_RICH_POINT_CLOUD.pdf
PDF
Automatically Answering And Generating Machine Learning Final Exams
PDF
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
PDF
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
PDF
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
PDF
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
PPTX
Application of discrete mathematics in IT
PDF
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATEL
DOCX
Self Study Business Approach to DS_01022022.docx
PDF
Introduction to neural networks and Keras
PPTX
Dms introduction Sharmila Chidaravalli
PDF
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
PDF
Big Data Conference
PDF
A Blended Approach to Analytics at Data Tactics Corporation
PPT
Irmac presentation for website
PDF
Introduction to Model-Based Machine Learning
Geometric Deep Learning
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Machine Learning basics
Regression with Microsoft Azure & Ms Excel
MODEL_FOR_SEMANTICALLY_RICH_POINT_CLOUD.pdf
Automatically Answering And Generating Machine Learning Final Exams
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Application of discrete mathematics in IT
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATEL
Self Study Business Approach to DS_01022022.docx
Introduction to neural networks and Keras
Dms introduction Sharmila Chidaravalli
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
Big Data Conference
A Blended Approach to Analytics at Data Tactics Corporation
Irmac presentation for website
Introduction to Model-Based Machine Learning
Ad

More from Yueshen Xu (20)

PDF
Context aware service recommendation
PDF
Course review for ir class 本科课件
PDF
Semantic web 本科课件
PDF
Recommender system slides for undergraduate
PDF
推荐系统 本科课件
PDF
Text classification 本科课件
PDF
Text clustering (information retrieval, in chinese)
PDF
Non parametric bayesian learning in discrete data
PDF
聚类 (Clustering)
PDF
Yueshen xu cv
PDF
徐悦甡简历
PDF
Learning to recommend with user generated content
PDF
Social recommender system
PPT
Summary on the Conference of WISE 2013
PDF
Topic model an introduction
PPTX
Acoustic modeling using deep belief networks
PPT
Summarization for dragon star program
PPT
Aggregation computation over distributed data streams(the final version)
PPT
Aggregation computation over distributed data streams
PPT
Analysis on tcp ip protocol stack
Context aware service recommendation
Course review for ir class 本科课件
Semantic web 本科课件
Recommender system slides for undergraduate
推荐系统 本科课件
Text classification 本科课件
Text clustering (information retrieval, in chinese)
Non parametric bayesian learning in discrete data
聚类 (Clustering)
Yueshen xu cv
徐悦甡简历
Learning to recommend with user generated content
Social recommender system
Summary on the Conference of WISE 2013
Topic model an introduction
Acoustic modeling using deep belief networks
Summarization for dragon star program
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams
Analysis on tcp ip protocol stack

Recently uploaded (20)

PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Quality review (1)_presentation of this 21
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to machine learning and Linear Models
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
annual-report-2024-2025 original latest.
PDF
Foundation of Data Science unit number two notes
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21
Reliability_Chapter_ presentation 1221.5784
Introduction to machine learning and Linear Models
ISS -ESG Data flows What is ESG and HowHow
Supervised vs unsupervised machine learning algorithms
Business Ppt On Nestle.pptx huunnnhhgfvu
IBA_Chapter_11_Slides_Final_Accessible.pptx
Miokarditis (Inflamasi pada Otot Jantung)
annual-report-2024-2025 original latest.
Foundation of Data Science unit number two notes
.pdf is not working space design for the following data for the following dat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Thinking in clustering yueshen xu

  • 1. Thinking in (Text) Clustering (No math, be not afraid) Yueshen Xu (lecturer) ysxu@xidian.edu.cn / xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML
  • 2. Software Engineering2017/4/13 Outline  Background  What can be clustered?  Problems in K-XXX (Means/Medoid/Center…)  Similarity Measure  Convex and Concave  Problems in Gaussian Mixture Model  Problems in Matrix Factorization  Multinomial and Sparsity 2 Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF, Multinomial Distribution Basics, not state-of-the-art
  • 3. Software Engineering2017/4/13 Background  Information Overloading 3 we need summarization Visualization Dimensional Reduction Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
  • 4. Software Engineering2017/4/13 Background Dimensional Reduction (DR)  Clustering  Text Clustering, Webpage Clustering, Image Clustering…  Summarization Document Summarization, Image Summarization…  Factorization  Rating Matrix Factorization, Image Non-negative Factorization 4 Automatic Applicable Explainable  Basic Requirement Clustering (Text)
  • 5. Software Engineering2017/4/13  Related Research Areas  Dimensional Reduction (DR)  Text Mining  Natural Language Processing  Computational Linguistics  Information Retrieval  Artificial Intelligence  (Text) Clustering Some Concepts 5 Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining DR Data Mining ArtificialIntelligence Machine Learning Machine Translation (Text) Clustering  We all know what (text) clustering is, right?  Widely-accepted topic, since everyone knows it
  • 6. Software Engineering2017/4/13 What can be clustered? 6 Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41), (5.234, 3.56, 4.454, 6.78) Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0) Data Sample 3:(China, modern, people, gov.), (policy, paper, conference, chair), (report, solution, UN, UK) Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj) Data Sample 5:(▲▼♦), (♣♠█),(■□●)
  • 7. Software Engineering2017/4/13 Is there anything that cannot be clustered? 7 Yes, but not related to us What can be clustered? Anything which a similarity measure can be defined over Matrix topology All kinds of data can be clustered
  • 8. Software Engineering2017/4/13 K-Means Trap 8 Defects of K-Means, K- Medoid,K-XXX  How many K?  Where are the initial centers?  Do the data really form a sphere?  Do the data really follow Minkowski /Euclidean distance?
  • 9. Software Engineering2017/4/13 How about these? What kind of data that K-XXX better fits? What kind of data that the methods relying on distance-similarity computation better fit? CONVEX
  • 11. Software Engineering2017/4/13 Alternative  Gaussian Mixture Model 11 Why Gaussian  central limit theorem Is central limit theorem always applicable in real-world cases? 1. Parameter Tuning 2. High applicability of Gaussian distribution How to estimate parameters? Expectation-Maximization No closed-form solution
  • 12. Software Engineering2017/4/13 Alternative  Matrix Factorization 12 No closed solution ‘Cause we are not in department of math SVD, PMF, NMF, Tensor Factorization…
  • 13. Software Engineering2017/4/13 Triangle 1313 Is there no perfect method here? What we probably want  No constraint in the form of data  No assumption in data distribution  Closed-solution Triangle borrowed from distributed computing
  • 14. Software Engineering2017/4/13 Triangle (Cont.) I do not know whether such a method exists or not Form Distribution Closed-solution Hierarchical Clustering? GMM/Gaussian Process K-Means/Medoid impossible Matrix Factorization impossible impossible
  • 15. Software Engineering2017/4/13 Multinomial Distribution Discrete Data (Text) 15 One document: (0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0 meeting,0,0,0,0,report,0,….) Multinomial distribution Clustering  Sampling Markov Chain Monte Carlo Friendly to sparsity
  • 16. Software Engineering2017/4/13 Sparsity Sparsity brings a lot of problems 16  Also in clustering  What can we do? ➢ Ensemble Learning (Ensemble clustering) ➢ Missing values pre-filling ➢ Tuning ☺ ➢ … 10000 words  1 term
  • 17. Software Engineering2017/4/13 Reference  My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)  ‘Random Thoughts in Clustering’  ‘Non-parametric Bayesian learning in discrete data’  ‘The research of topic modeling in text mining’  ‘Matrix factorization with user generated content’  …, etc.  Website  You can download all slides of mine ➢ http://web.xidian.edu.cn/ysxu/teach.html ➢ http://liu.cs.uic.edu/yueshenxu/ ➢ http://www.slideshare.net/obamaxys2011 ➢ https://www.researchgate.net/profile/Yueshen_Xu 17