SlideShare a Scribd company logo
Gautier Marti, Hong Kong Machine Learning Meetup Season 4 Episode 4
A quick demo of Top2Vec
With application on 2020 10-K business descriptions
Discover latent “topics”
• Find automatically the abstract
“topics” which occur in a
collection of documents

• For example, for news:

topic 1: “trade war”

topic 2: “in
fl
ation”

topic 3: “sport”

topic 4: “healthcare”

topic 5: “environment”

…
Topic Modeling
LDA: Standard method with practical shortcomings
• LDA: Main method for topic modeling for
the past 20 years

• LDA relies on bag-of-words which ignore
ordering and semantics of words

• LDA relies on lots of arbitrary meta-
parameters such as

- custom stop-word lists

- stemming and lemmatization

• LDA requires the number of topics to be
known in advance (unrealistic)

• As a result, topics found by LDA are not
always very stable…
Latent Dirichlet Allocation (LDA)
Distributed Representations of Topics
• Recent arxiv paper (2020)

• GitHub: ddangelov/Top2Vec

• Does not require the number of
topics to be known in advance

• Does not require text pre-
processing
Top2Vec
5-step algorithm
• Step 1: Create a joint embedding
of documents and words

• For example, using:

- Universal Sentence Encoder

- BERT Sentence Transformer
Top2Vec: How does it work?
5-step algorithm
• Step 2: Project the document
and word vectors to a smaller
dimension (e.g. 2D)

• For example, using:

- UMAP
Top2Vec: How does it work?
5-step algorithm
• Step 3: Find dense clusters;

a cluster is a topic

• For example, using:

- HDBSCAN
Top2Vec: How does it work?
5-step algorithm
• Step 4: Find the centroid of each
cluster in original high dimension



This centroid is the topic vector
Top2Vec: How does it work?
5-step algorithm
• Step 5: Find the n-closest word
vectors to the topic vector





These words de
fi
ne the topic
Top2Vec: How does it work?
Top2Vec Demo: Application on 10-K filings
2020 10-K business descriptions
What is a 10-K?
• Forms 10-K are mandatory
annual reports for U.S. public
companies

• Company management o
ff
er a
detailed picture of 

- the company’s business,

- the risks it faces,

- the operating/
fi
nancial results

• Available for free on

https://www.sec.gov/edgar.shtml
10-K corporate filings
Business description
• Each 10-K starts with a thorough
business description

• We learn about its main products
and services, subsidiaries it owns,
markets it operates in

• It may also include:

- competition the company faces

- regulations that apply to it

- labor issues

- special operating costs

- seasonal factors

- …
10-K corporate filings
Topic modeling of business descriptions
1. Download the 2020 10-Ks

2. Extract business descriptions

3. Apply Top2Vec



Blog + colab notebook:

https://marti.ai/ml/2021/11/14/
top2vec-10k-business.html

Top2Vec Demo
Results
• Top2Vec found 10 topics

• 9 topics correspond to sectors

• 1 topic corresponds to
COVID-19 disruptions
Top2Vec Demo
So what?
• Top2Vec as a tool works well

• On corporate
fi
lings, it essentially
recovers sectors as topics…

not very informative!

• Can we uncover residual topics?
Top2Vec on 10-K

More Related Content

PDF
A Universe of Knowledge Graphs
PPTX
PPTX
Ozone- Object store for Apache Hadoop
PPTX
Gobernanza de datos - Azure Purview
PPTX
Probabilistic retrieval model
PDF
Elasticsearch: An Overview
PDF
Oracle db architecture
PPT
MySQL Cluster Basics
A Universe of Knowledge Graphs
Ozone- Object store for Apache Hadoop
Gobernanza de datos - Azure Purview
Probabilistic retrieval model
Elasticsearch: An Overview
Oracle db architecture
MySQL Cluster Basics

What's hot (20)

PDF
Migration From Oracle to PostgreSQL
PDF
The Complete MariaDB Server tutorial
PPTX
Hidden markov models
PPTX
Database Consolidation using Oracle Multitenant
PPT
RDF and OWL
PPTX
Netflix Data Pipeline With Kafka
PPT
Dewey Classification
PPTX
Introduction to Oracle Data Guard Broker
PPTX
A comparative analysis of library classification systems
PPTX
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
PPTX
Informatio retrival evaluation
PDF
Kubeflow
PDF
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
PPT
Marc 21
PPTX
S.R. Ranganathan:Three Planes of Work.
PPTX
Understand oracle real application cluster
PPTX
Evolution of Big Data Messaging
PPTX
How to Migrate from Oracle to EDB Postgres
PPTX
Call Numbers
PDF
MySQL Cluster 8.0 tutorial
Migration From Oracle to PostgreSQL
The Complete MariaDB Server tutorial
Hidden markov models
Database Consolidation using Oracle Multitenant
RDF and OWL
Netflix Data Pipeline With Kafka
Dewey Classification
Introduction to Oracle Data Guard Broker
A comparative analysis of library classification systems
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Informatio retrival evaluation
Kubeflow
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Marc 21
S.R. Ranganathan:Three Planes of Work.
Understand oracle real application cluster
Evolution of Big Data Messaging
How to Migrate from Oracle to EDB Postgres
Call Numbers
MySQL Cluster 8.0 tutorial
Ad

Similar to A quick demo of Top2Vec With application on 2020 10-K business descriptions (9)

PDF
Interactive Latent Dirichlet Allocation
PPTX
Topic Extraction using Machine Learning
PPTX
Topic extraction using machine learning
PDF
Treasure Data Summer Internship Final Report
PDF
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
PDF
Streaming topic model training and inference
PDF
LDAvis
PDF
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
PDF
Discovering User's Topics of Interest in Recommender Systems
Interactive Latent Dirichlet Allocation
Topic Extraction using Machine Learning
Topic extraction using machine learning
Treasure Data Summer Internship Final Report
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Streaming topic model training and inference
LDAvis
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
Discovering User's Topics of Interest in Recommender Systems
Ad

More from Gautier Marti (20)

PDF
Using Large Language Models in 10 Lines of Code
PDF
What deep learning can bring to...
PDF
cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Dist...
PDF
How deep generative models can help quants reduce the risk of overfitting?
PDF
Generating Realistic Synthetic Data in Finance
PDF
Applications of GANs in Finance
PDF
My recent attempts at using GANs for simulating realistic stocks returns
PDF
Takeaways from ICML 2019, Long Beach, California
PDF
A review of two decades of correlations, hierarchies, networks and clustering...
PDF
Autoregressive Convolutional Neural Networks for Asynchronous Time Series
PDF
Some contributions to the clustering of financial time series - Applications ...
PDF
Clustering CDS: algorithms, distances, stability and convergence rates
PDF
Clustering Financial Time Series using their Correlations and their Distribut...
PDF
A closer look at correlations
PDF
Clustering Financial Time Series: How Long is Enough?
PDF
Optimal Transport vs. Fisher-Rao distance between Copulas
PDF
On Clustering Financial Time Series - Beyond Correlation
PDF
Optimal Transport between Copulas for Clustering Time Series
PDF
On the stability of clustering financial time series
PDF
Clustering Random Walk Time Series
Using Large Language Models in 10 Lines of Code
What deep learning can bring to...
cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Dist...
How deep generative models can help quants reduce the risk of overfitting?
Generating Realistic Synthetic Data in Finance
Applications of GANs in Finance
My recent attempts at using GANs for simulating realistic stocks returns
Takeaways from ICML 2019, Long Beach, California
A review of two decades of correlations, hierarchies, networks and clustering...
Autoregressive Convolutional Neural Networks for Asynchronous Time Series
Some contributions to the clustering of financial time series - Applications ...
Clustering CDS: algorithms, distances, stability and convergence rates
Clustering Financial Time Series using their Correlations and their Distribut...
A closer look at correlations
Clustering Financial Time Series: How Long is Enough?
Optimal Transport vs. Fisher-Rao distance between Copulas
On Clustering Financial Time Series - Beyond Correlation
Optimal Transport between Copulas for Clustering Time Series
On the stability of clustering financial time series
Clustering Random Walk Time Series

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
1_Introduction to advance data techniques.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
SAP 2 completion done . PRESENTATION.pptx
Lecture1 pattern recognition............
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Fluorescence-microscope_Botany_detailed content
Reliability_Chapter_ presentation 1221.5784
[EN] Industrial Machine Downtime Prediction
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Qualitative Qantitative and Mixed Methods.pptx
1_Introduction to advance data techniques.pptx

A quick demo of Top2Vec With application on 2020 10-K business descriptions

  • 1. Gautier Marti, Hong Kong Machine Learning Meetup Season 4 Episode 4 A quick demo of Top2Vec With application on 2020 10-K business descriptions
  • 2. Discover latent “topics” • Find automatically the abstract “topics” which occur in a collection of documents • For example, for news:
 topic 1: “trade war”
 topic 2: “in fl ation”
 topic 3: “sport”
 topic 4: “healthcare”
 topic 5: “environment”
 … Topic Modeling
  • 3. LDA: Standard method with practical shortcomings • LDA: Main method for topic modeling for the past 20 years • LDA relies on bag-of-words which ignore ordering and semantics of words • LDA relies on lots of arbitrary meta- parameters such as
 - custom stop-word lists
 - stemming and lemmatization • LDA requires the number of topics to be known in advance (unrealistic) • As a result, topics found by LDA are not always very stable… Latent Dirichlet Allocation (LDA)
  • 4. Distributed Representations of Topics • Recent arxiv paper (2020) • GitHub: ddangelov/Top2Vec • Does not require the number of topics to be known in advance • Does not require text pre- processing Top2Vec
  • 5. 5-step algorithm • Step 1: Create a joint embedding of documents and words • For example, using:
 - Universal Sentence Encoder
 - BERT Sentence Transformer Top2Vec: How does it work?
  • 6. 5-step algorithm • Step 2: Project the document and word vectors to a smaller dimension (e.g. 2D) • For example, using:
 - UMAP Top2Vec: How does it work?
  • 7. 5-step algorithm • Step 3: Find dense clusters;
 a cluster is a topic • For example, using:
 - HDBSCAN Top2Vec: How does it work?
  • 8. 5-step algorithm • Step 4: Find the centroid of each cluster in original high dimension
 
 This centroid is the topic vector Top2Vec: How does it work?
  • 9. 5-step algorithm • Step 5: Find the n-closest word vectors to the topic vector
 
 
 These words de fi ne the topic Top2Vec: How does it work?
  • 10. Top2Vec Demo: Application on 10-K filings 2020 10-K business descriptions
  • 11. What is a 10-K? • Forms 10-K are mandatory annual reports for U.S. public companies • Company management o ff er a detailed picture of 
 - the company’s business,
 - the risks it faces,
 - the operating/ fi nancial results • Available for free on
 https://www.sec.gov/edgar.shtml 10-K corporate filings
  • 12. Business description • Each 10-K starts with a thorough business description • We learn about its main products and services, subsidiaries it owns, markets it operates in • It may also include:
 - competition the company faces
 - regulations that apply to it
 - labor issues
 - special operating costs
 - seasonal factors
 - … 10-K corporate filings
  • 13. Topic modeling of business descriptions 1. Download the 2020 10-Ks 2. Extract business descriptions 3. Apply Top2Vec 
 Blog + colab notebook:
 https://marti.ai/ml/2021/11/14/ top2vec-10k-business.html
 Top2Vec Demo
  • 14. Results • Top2Vec found 10 topics • 9 topics correspond to sectors • 1 topic corresponds to COVID-19 disruptions Top2Vec Demo
  • 15. So what? • Top2Vec as a tool works well • On corporate fi lings, it essentially recovers sectors as topics…
 not very informative! • Can we uncover residual topics? Top2Vec on 10-K