A text mining and association analysis:
Exploring text data for creating topic models
Kyuson Lim
1Department of Mathematics & Statistics, McMaster University
E-mail: limk15@mcmaster.ca
Content
I Introduction
I Motivation
I Analysis result of 6 methods
I Literature review
I Interpretation
I Conclusion
Why text mining?
I Data transformation: business/industry overwhelmed with unstructured data.
I Telecommunication industry: analysis on customer termination reasoning.
I Government agency: news issues, local opinions, topic clusterings on annual
report.
I Data mining: web crawling on news headlines, social media, movie reviews.
I Business intelligence, exploratory data analysis, consumer behaviors.
I Easy to interpret, variety of applications, attractable outcomes, combinations with
other result.
I Variety of algorithm: Bert (Bidirectional Encoder Representations from
Transformers), BTM (Biterm Topic Models), LDA (Latent Dirichlet Allocation)
Figure: Example by Kyuson in 2020
How is different?
I Natural language processing (NLP): used to understand human language by
analyzing text, speech, or grammatical syntax.
I Advanced ML models and AI (artificial intelligence), ie. Siri.
I Geared towards mimicking natural human communication, syntax meaning.
I Extract grammatical structure and the sentiment.
I Text mining: used to extract information from unstructured and structured
content.
I Extract information from unstructured data.
I Statistical models to analyze qualitative meaning of content.
I Frequency of words, patterns, and correlation within words to explain the text.
Figure: Example of Journals (Author submitted)
Goal of the analysis
I Effectively portray the output and present in a collection of keywords by its
connects.
I Wordcloud, creation using data visualization.
I Hierarchal clustering and correlation graph.
I K-medoids clustering on classification.
I Association between issues and causal discovery on timeline of words.
I Gaussian graphical model: visually portray for connection.
I Local smoothing regression: fit on timelines in frequencies of words.
I BTM: topic clustering on documents, ultimate tool for co-occurrence based
topic model.
What gain?
I Statistics Canada (StatsCan COVID-19):
I January to December 2020 of the Canadian Perspectives Survey Series (CPSS).
I Covers livings and lifestyle issues of aged over 20 in Canada.
I Topics in sociological and economical issues.
I Keywords and issues of Covid-19 pandemics in 2020.
I Measure of impact on daily life, endemics.
I Interpretation is simple, concise and scientific.
I Future usage: recap and pandemic solution.
I Exploratory data analysis on advanced analytic method.
I Data visualization: efficient and practical skill to earn.
I Application for real life data: data analyst, complex data.
Text mining pre-processing
I Web crawling and tokenize text.
1. In the website Statistics Canada, using html source code, analyze the relevant
code for headlines which parses through contents.
2. Then, build code in R to crawl each headlines by the loop to save with.
3. Tokenize by making all words to be a small letters.
I Pre-process for filtering the unnecessary words.
I Eliminate adverbs and special characters using package "stopwords" and
"tokenizers".
I Construct a term table, consist of words and frequencies.
I Crawl published dates and change format into dates.
Figure: Example of web crawling
Literature review of wordcloud
I Visualize keyword metadata on websites in free form text.
I Mainly familiarized by the Web 2.0 websites and blogs.
I Past experience to publish in government report.
I Association rule: support analysis (co-occurences).
I Commonly known to be called as a market basket analysis.
I Machine learning (ML) method for discovering the interesting relations between
variables.
I Identify visually influential factors and attractive for readers to understand the
data.
Figure: Web 2.0 and association rule analysis: support
Wordcloud and co-occurences
I Wordcloud and co-occurrence bar graph of top 6 ranked most frequent used
words.
I "Covid" and "pandemic" has been used in 79 times.
I "pandemic" and "Canadian" has been used 21 times.
I Most "articles" are "Canadian" and "impact" except for "covid" and "pandemics".
I Overall data on frequencies and keywords to combine for phrase.
I Supplement unstructured form of wordcloud by quantitative bar graph on the
importance of issues by keywords identification.
I Investigate to find words that constitute topics of impact by Covid-19 in 2020.
Figure: Frequency table and combination of wordcloud with co-occurences graph
Interpretation: Wordcloud and co-occurences
I Minor and major words investigation.
I Identify some unique words such as "data", "differences", "survey", "statistics"
and "study".
I Economical and sociological issues for majority of articles.
I Interest in living issues: "price", "mental", "concerns", "home", and "workers".
I Future work for living issues and overcome for inflation, economic support.
I Unique overview to improve in various text data, publish in shiny app.
Figure: Frequency table and combination of wordcloud with co-occurences graph
Literature review of hierarchal clustering
I Seeks to construct a hierarchy of clusters, which classify the words into groups
based on the dissimilarity of the words.
1. The number of times of the word used is the coordinates in the space.
2. For any two words, the distance in the space is calculated as a measure of
dissimilarity.
3. At the beginning of the clustering process, each element is in a cluster of its own.
4. Within the distance matrix, we can then cluster the words.
5. For two clusters, the distance is the maximum distance among any pair of
elements from the two clusters.
6. Then the two clusters separated by the shortest distance are combined.
7. Iteratively, two most similar (close) clusters or word is joint until there is 2 cluster.
I Correlation analysis: co-occurrences among all documents of words in the
sparse matrix.
I If a word occurs in a particular document, then the sparse matrix entry for
corresponding to that row and column is 1, else it is 0.
I An efficient representation of the information contained in the term document
matrix.
Hierarchal clustering and correlation analysis
I Dendrogram: tree structure, visualize clusters of combination by the distance.
I 2 clusters establishes for 18 words and 3 words with "covid", "health" and
"pandemic".
I A word "covid" and "pandemic" has strong correlation (0.35).
I A "impact" has negative correlation with the word "health" (-0.1).
I Correlation is calculated based on all words to account with.
Figure: A hierarchal clustering of dendrogram and correlation analysis.
Interpretation: Hierarchal clustering and correlation
I Sub-hierarchy to formulate a result of topics with issues.
I Words "examines", "study", "using", and "survey" are grouped together in the
same hierarchy
I Main topic of 3 words and 18 words of sub-topics.
I Similar result for the dissimilarity between word "impact" and words of "health"
and "pandemic".
Figure: A hierarchal clustering of dendrogram and correlation analysis.
Literature review of K-medoid clustering
I Partitions the data into groups and attempt to minimize the distance between
points by defining a point of the center in that cluster.
I Use the Manhattan distance to define the dissimilarity.
I A k-medoid minimizes a sum of pairwise dissimilarities.
I A k-medoid chooses datapoints in the data as centers (called medoids).
I Build steps to construct the clusters and swap steps to adjust boundaries of
points.
Figure: A iteration steps for constructing the k-medoid clustering.
K-medoid clustering analysis
I The method is greedy (local optimal choice) to be heuristic for many solutions.
I 19 words in cluster 2 and a word is contained in both cluster 2 and 3.
I A similar result to contain most words in the cluster 2.
I Determine the number of clusters:
I An "elbow" method, calculate how much variability in the data to be explained by the
clustering.
I Identify the drastic point of increase to be the optimal cut-off for the choice in the
number of clusters.
Figure: A k-medoid clustering analysis and variance explained.
Interpretation: K-medoid clustering
I Similar to hierarchical clustering analysis in sub-topics and main topics.
I The words of "covid" and "pandemic" is separated from the major cluster.
I The word of "health" is contained with the other cluster (cluster 2).
I No cluster overlaps to be adequate fit for the data. Two number of cluster
accounts for 45.64% of the variability in the data.
I Reflect a meaningful relation between words where the result is reflected on
how people perceived in livings and issues.
cluster covid pandemic Canada impact Canadians business impacts
number 1 3 2 2 2 2 2
cluster people health Canadian economics survey article data examine
number 2 3 2 2 2 2 2 2
Table: Table of words classified by clusters in k-medoids
Time series analysis: local smoothing regression
I A type of non-parametric regression method that is mixed type of a moving average (MA) and
polynomial regression.
I A smooth curve fitted for the trend changes to identify if one word has impact on the other to cause
some issues in 2020.
I Overall result shows that the causality is not possible, as the trend is the same for all 7 words.
I The decreasing trend of word "Canada" after June has been moved to the word "Canadian" as people
aims to describe more specified interest.
I A "Canadian" issues are more frequently appeared in the headlines at the period of July to August as
the word "health" does indicating that the impact on Canadian people for health issues are most
severed.
Figure: A time series data analysis by local smoothing regression
Literature review of Gaussian Graphical Model
I Explicitly capture the statistical dependency between the variables of interest
in the form of a network graph.
I Each node in the graph corresponds to one of the word in the text data.
I A missing edges in graph correspond to conditional independence relations.
I Start with complete graph, take stepwise approach with BIC values (compare
graphs) to disconnect the edge.
I Apply a specific threshold for the partial correlation and remove all edge less than
the threshold.
Figure: Undirected Gaussian graphical model for the dependency structure
Interpretation of Gaussian Graphical Model
I Conditional independence relationship between sets of words as a practical inference.
I Words "Canadian" and "covid", cannot connect with "article" and "Canada" without the edges in
between them, which is connected by the nodes, "impact" and "pandemic" to find for the conditional
independence relationship.
I Also, (Health) ⊥ (impact, Canada) | (article), by the connected edges.
I (Canadian) ⊥ (impact, pandemic) | (covid) and (health, article) ⊥ (Canadian, covid) | (impact,
pandemic)
I Structural interpretation and result of partial correlation:
I Most articles of issues that deals with "health" issues are relevant to "pandemics" and "impact" in
2020.
I A word "covid" and "article" is conditionally independent (with 2) and "Canadian" and "impact" is also
conditionally independent (with 9).
Figure: Undirected Gaussian graphical model for the dependency structure
Literature review of biterm topic model (BTM)
I First introduced in 2013, which attempted to address the inadequacies on
short documents to do modelling of co-occurrences in global term.
I The best method in topic clustering for short words as it is a probabilistic
generative model
I Learns topics by directly modeling the generation of word co-occurrence patterns,
by modeling each document as a mixture of topics.
I The R package BTM was used to perform the biterm topic modeling.
I Crawl data of plain text and pre-process tokenized the inputs. The output gives
unique tagging on each sentences and characteristics of words.
I Perform tagging on title and extract co-occurrences of nouns, adjectives and
verbs within 3 words distance.
I Build the biterm topic model with 5 topics and provide the set of biterms to
cluster, where tuning parameters are input to analyze the data.
I The R package of "ggraph" is used to automatically process the topic clustering
data.
Interpretation of the biterm topic model (BTM)
I Some of the unique and unobserved words include "international", "postsecondary", and "student" to have
not appeared.
I The economical issues and sociological issues are somewhat separated to yield a different result.
I A natural consequence of covid-19 pandemics, "medical", "protective", "business" and "personal" words that
appear to be the interest.
I A BTM yields comprehensive and grouped topics of words by the application of mixture models.
I A word "mental" and "health" is closely connected to show that the issue of public health.
I Words of "outlook", "price" and "service", shows worries and livings of Canadians for the impact of Covid-19.
Figure: Biterm clusters for 5 topics
Conclusion and discussion
I There are some variation and minor difference in methods.
I The BTM to provide with ultimate guidance on the data we analyzed with, coherent and
consistent topic.
I A hierarchal clustering shows for 2 clusters, but k-medoid clustering shows for 3 clusters.
Figure: Model used to analyze text data.
Conclusion and discussion
I Each method is different by the nature of mathematical and statistical foundation, leading
us to explore the data and guide through different result of the analysis.
I Analyzing term frequencies and term co-frequencies, clustering and the formulating topic
models better understand the topics of keywords of covid-19 pandemics in Canada.
I A wordcloud informed keywords and co-occurrence to observe for the data.
I A hierarchical clustering and k-medoid clustering to group them and investigate for the
correlation.
I Observed 2 groups of keywords where the first group of main topics ("covid", "pandemic" and
"health") of keywords and second group of sub-topics for issues.
I A local smoothing regression in time series data to investigate if there is a causal
relationship to draw upon the different trend.
I A trend is similar for top 7 ranked most appeared words indicating that the trend is the same to
have no formal statement on causal inference.
I A Gaussian graphical model to draw a conditional independence and structural
dependence relationship between top 7 rank of words.
I By the conditional independence, issue are organized for the co-occurrences of phrases for
structural dependencies.
I BTM were, more concise and specified topics to differentiate clearly for relevant keywords.
I The 5 topics, yield problems and issues with keywords Canadians to overcome Covid-19
pandemic to end with.
References
I Agrawal, R., Imielinski, T., & Swami, A. (1993, June). Mining association rules between sets of
items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on
Management of data (pp. 207-216).
I Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project.
org/package= BTM. R package version 0.3, 1.
I Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Information
processing letters, 24(6), 377-380.
I Gershon, N., & Page, W. (2001). What storytelling can do for information visualization.
Communications of the ACM, 44(8), 31-37.
I Becue-Bertaut, M. (2019). Textual data science with R. CRC Press.
I Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis Using Time Series
and Copula Models. Applied Sciences, 9(19), 4071.
I Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom for ’ggplot2’. R
package version 0.5. 0.
I Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering. Bioinformatics, 31(22), 3718-3720.
I Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM,
CLARA, and CLARANS algorithms. In International conference on similarity search and
applications (pp. 171-187). Springer, Cham.
I James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol1. 112, p.18). New York: Springer
I Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation Techniques for Social
and Mobile Data. Springer Nature.
I Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression model for apple
keywords by text mining. Advanced Engineering Informatics, 29(4), 918-929.

More Related Content

PDF
20142014_20142015_20142115
PDF
PDF
Statistics for Managers notes.pdf
PDF
Data science landscape in the insurance industry
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
Knowledge Graph Futures
20142014_20142015_20142115
Statistics for Managers notes.pdf
Data science landscape in the insurance industry
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
Knowledge Graph Futures

Similar to Text mining and its association analysis.pdf (13)

PDF
Pt2520 Unit 6 Data Mining Project
PPTX
Social Media and Text Analytics
DOCX
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
DOCX
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
PDF
ugc carelist journals ugc carelist journals
PPTX
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
DOCX
3282016 Additional Book Resourceshttpscourserooma.cap.docx
PDF
Supervised Multi Attribute Gene Manipulation For Cancer
PDF
Corporate bankruptcy prediction using Deep learning techniques
PDF
Database Concepts 8th Edition Kroenke Test Bank
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
Pt2520 Unit 6 Data Mining Project
Social Media and Text Analytics
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
ugc carelist journals ugc carelist journals
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
3282016 Additional Book Resourceshttpscourserooma.cap.docx
Supervised Multi Attribute Gene Manipulation For Cancer
Corporate bankruptcy prediction using Deep learning techniques
Database Concepts 8th Edition Kroenke Test Bank
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
Ad

More from KyusonLim (7)

PPTX
ROC Korean drought presentation.pptx
PDF
Dag in mmhc
PDF
Regularization and variable selection via elastic net
PDF
ideas of mathematics -17tilings (final)
PDF
BlUP and BLUE- REML of linear mixed model
PDF
Missing value imputation (slide)
PDF
Survival analysis 1
ROC Korean drought presentation.pptx
Dag in mmhc
Regularization and variable selection via elastic net
ideas of mathematics -17tilings (final)
BlUP and BLUE- REML of linear mixed model
Missing value imputation (slide)
Survival analysis 1
Ad

Recently uploaded (20)

PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Complications of Minimal Access-Surgery.pdf
PPTX
Climate Change and Its Global Impact.pptx
PDF
Race Reva University – Shaping Future Leaders in Artificial Intelligence
PDF
English Textual Question & Ans (12th Class).pdf
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
Literature_Review_methods_ BRACU_MKT426 course material
PDF
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
My India Quiz Book_20210205121199924.pdf
PDF
Journal of Dental Science - UDMY (2021).pdf
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
Education and Perspectives of Education.pptx
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Introduction to pro and eukaryotes and differences.pptx
Complications of Minimal Access-Surgery.pdf
Climate Change and Its Global Impact.pptx
Race Reva University – Shaping Future Leaders in Artificial Intelligence
English Textual Question & Ans (12th Class).pdf
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Unit 4 Computer Architecture Multicore Processor.pptx
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Literature_Review_methods_ BRACU_MKT426 course material
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
My India Quiz Book_20210205121199924.pdf
Journal of Dental Science - UDMY (2021).pdf
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Education and Perspectives of Education.pptx
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
Cambridge-Practice-Tests-for-IELTS-12.docx

Text mining and its association analysis.pdf

  • 1. A text mining and association analysis: Exploring text data for creating topic models Kyuson Lim 1Department of Mathematics & Statistics, McMaster University E-mail: limk15@mcmaster.ca
  • 2. Content I Introduction I Motivation I Analysis result of 6 methods I Literature review I Interpretation I Conclusion
  • 3. Why text mining? I Data transformation: business/industry overwhelmed with unstructured data. I Telecommunication industry: analysis on customer termination reasoning. I Government agency: news issues, local opinions, topic clusterings on annual report. I Data mining: web crawling on news headlines, social media, movie reviews. I Business intelligence, exploratory data analysis, consumer behaviors. I Easy to interpret, variety of applications, attractable outcomes, combinations with other result. I Variety of algorithm: Bert (Bidirectional Encoder Representations from Transformers), BTM (Biterm Topic Models), LDA (Latent Dirichlet Allocation) Figure: Example by Kyuson in 2020
  • 4. How is different? I Natural language processing (NLP): used to understand human language by analyzing text, speech, or grammatical syntax. I Advanced ML models and AI (artificial intelligence), ie. Siri. I Geared towards mimicking natural human communication, syntax meaning. I Extract grammatical structure and the sentiment. I Text mining: used to extract information from unstructured and structured content. I Extract information from unstructured data. I Statistical models to analyze qualitative meaning of content. I Frequency of words, patterns, and correlation within words to explain the text. Figure: Example of Journals (Author submitted)
  • 5. Goal of the analysis I Effectively portray the output and present in a collection of keywords by its connects. I Wordcloud, creation using data visualization. I Hierarchal clustering and correlation graph. I K-medoids clustering on classification. I Association between issues and causal discovery on timeline of words. I Gaussian graphical model: visually portray for connection. I Local smoothing regression: fit on timelines in frequencies of words. I BTM: topic clustering on documents, ultimate tool for co-occurrence based topic model.
  • 6. What gain? I Statistics Canada (StatsCan COVID-19): I January to December 2020 of the Canadian Perspectives Survey Series (CPSS). I Covers livings and lifestyle issues of aged over 20 in Canada. I Topics in sociological and economical issues. I Keywords and issues of Covid-19 pandemics in 2020. I Measure of impact on daily life, endemics. I Interpretation is simple, concise and scientific. I Future usage: recap and pandemic solution. I Exploratory data analysis on advanced analytic method. I Data visualization: efficient and practical skill to earn. I Application for real life data: data analyst, complex data.
  • 7. Text mining pre-processing I Web crawling and tokenize text. 1. In the website Statistics Canada, using html source code, analyze the relevant code for headlines which parses through contents. 2. Then, build code in R to crawl each headlines by the loop to save with. 3. Tokenize by making all words to be a small letters. I Pre-process for filtering the unnecessary words. I Eliminate adverbs and special characters using package "stopwords" and "tokenizers". I Construct a term table, consist of words and frequencies. I Crawl published dates and change format into dates. Figure: Example of web crawling
  • 8. Literature review of wordcloud I Visualize keyword metadata on websites in free form text. I Mainly familiarized by the Web 2.0 websites and blogs. I Past experience to publish in government report. I Association rule: support analysis (co-occurences). I Commonly known to be called as a market basket analysis. I Machine learning (ML) method for discovering the interesting relations between variables. I Identify visually influential factors and attractive for readers to understand the data. Figure: Web 2.0 and association rule analysis: support
  • 9. Wordcloud and co-occurences I Wordcloud and co-occurrence bar graph of top 6 ranked most frequent used words. I "Covid" and "pandemic" has been used in 79 times. I "pandemic" and "Canadian" has been used 21 times. I Most "articles" are "Canadian" and "impact" except for "covid" and "pandemics". I Overall data on frequencies and keywords to combine for phrase. I Supplement unstructured form of wordcloud by quantitative bar graph on the importance of issues by keywords identification. I Investigate to find words that constitute topics of impact by Covid-19 in 2020. Figure: Frequency table and combination of wordcloud with co-occurences graph
  • 10. Interpretation: Wordcloud and co-occurences I Minor and major words investigation. I Identify some unique words such as "data", "differences", "survey", "statistics" and "study". I Economical and sociological issues for majority of articles. I Interest in living issues: "price", "mental", "concerns", "home", and "workers". I Future work for living issues and overcome for inflation, economic support. I Unique overview to improve in various text data, publish in shiny app. Figure: Frequency table and combination of wordcloud with co-occurences graph
  • 11. Literature review of hierarchal clustering I Seeks to construct a hierarchy of clusters, which classify the words into groups based on the dissimilarity of the words. 1. The number of times of the word used is the coordinates in the space. 2. For any two words, the distance in the space is calculated as a measure of dissimilarity. 3. At the beginning of the clustering process, each element is in a cluster of its own. 4. Within the distance matrix, we can then cluster the words. 5. For two clusters, the distance is the maximum distance among any pair of elements from the two clusters. 6. Then the two clusters separated by the shortest distance are combined. 7. Iteratively, two most similar (close) clusters or word is joint until there is 2 cluster. I Correlation analysis: co-occurrences among all documents of words in the sparse matrix. I If a word occurs in a particular document, then the sparse matrix entry for corresponding to that row and column is 1, else it is 0. I An efficient representation of the information contained in the term document matrix.
  • 12. Hierarchal clustering and correlation analysis I Dendrogram: tree structure, visualize clusters of combination by the distance. I 2 clusters establishes for 18 words and 3 words with "covid", "health" and "pandemic". I A word "covid" and "pandemic" has strong correlation (0.35). I A "impact" has negative correlation with the word "health" (-0.1). I Correlation is calculated based on all words to account with. Figure: A hierarchal clustering of dendrogram and correlation analysis.
  • 13. Interpretation: Hierarchal clustering and correlation I Sub-hierarchy to formulate a result of topics with issues. I Words "examines", "study", "using", and "survey" are grouped together in the same hierarchy I Main topic of 3 words and 18 words of sub-topics. I Similar result for the dissimilarity between word "impact" and words of "health" and "pandemic". Figure: A hierarchal clustering of dendrogram and correlation analysis.
  • 14. Literature review of K-medoid clustering I Partitions the data into groups and attempt to minimize the distance between points by defining a point of the center in that cluster. I Use the Manhattan distance to define the dissimilarity. I A k-medoid minimizes a sum of pairwise dissimilarities. I A k-medoid chooses datapoints in the data as centers (called medoids). I Build steps to construct the clusters and swap steps to adjust boundaries of points. Figure: A iteration steps for constructing the k-medoid clustering.
  • 15. K-medoid clustering analysis I The method is greedy (local optimal choice) to be heuristic for many solutions. I 19 words in cluster 2 and a word is contained in both cluster 2 and 3. I A similar result to contain most words in the cluster 2. I Determine the number of clusters: I An "elbow" method, calculate how much variability in the data to be explained by the clustering. I Identify the drastic point of increase to be the optimal cut-off for the choice in the number of clusters. Figure: A k-medoid clustering analysis and variance explained.
  • 16. Interpretation: K-medoid clustering I Similar to hierarchical clustering analysis in sub-topics and main topics. I The words of "covid" and "pandemic" is separated from the major cluster. I The word of "health" is contained with the other cluster (cluster 2). I No cluster overlaps to be adequate fit for the data. Two number of cluster accounts for 45.64% of the variability in the data. I Reflect a meaningful relation between words where the result is reflected on how people perceived in livings and issues. cluster covid pandemic Canada impact Canadians business impacts number 1 3 2 2 2 2 2 cluster people health Canadian economics survey article data examine number 2 3 2 2 2 2 2 2 Table: Table of words classified by clusters in k-medoids
  • 17. Time series analysis: local smoothing regression I A type of non-parametric regression method that is mixed type of a moving average (MA) and polynomial regression. I A smooth curve fitted for the trend changes to identify if one word has impact on the other to cause some issues in 2020. I Overall result shows that the causality is not possible, as the trend is the same for all 7 words. I The decreasing trend of word "Canada" after June has been moved to the word "Canadian" as people aims to describe more specified interest. I A "Canadian" issues are more frequently appeared in the headlines at the period of July to August as the word "health" does indicating that the impact on Canadian people for health issues are most severed. Figure: A time series data analysis by local smoothing regression
  • 18. Literature review of Gaussian Graphical Model I Explicitly capture the statistical dependency between the variables of interest in the form of a network graph. I Each node in the graph corresponds to one of the word in the text data. I A missing edges in graph correspond to conditional independence relations. I Start with complete graph, take stepwise approach with BIC values (compare graphs) to disconnect the edge. I Apply a specific threshold for the partial correlation and remove all edge less than the threshold. Figure: Undirected Gaussian graphical model for the dependency structure
  • 19. Interpretation of Gaussian Graphical Model I Conditional independence relationship between sets of words as a practical inference. I Words "Canadian" and "covid", cannot connect with "article" and "Canada" without the edges in between them, which is connected by the nodes, "impact" and "pandemic" to find for the conditional independence relationship. I Also, (Health) ⊥ (impact, Canada) | (article), by the connected edges. I (Canadian) ⊥ (impact, pandemic) | (covid) and (health, article) ⊥ (Canadian, covid) | (impact, pandemic) I Structural interpretation and result of partial correlation: I Most articles of issues that deals with "health" issues are relevant to "pandemics" and "impact" in 2020. I A word "covid" and "article" is conditionally independent (with 2) and "Canadian" and "impact" is also conditionally independent (with 9). Figure: Undirected Gaussian graphical model for the dependency structure
  • 20. Literature review of biterm topic model (BTM) I First introduced in 2013, which attempted to address the inadequacies on short documents to do modelling of co-occurrences in global term. I The best method in topic clustering for short words as it is a probabilistic generative model I Learns topics by directly modeling the generation of word co-occurrence patterns, by modeling each document as a mixture of topics. I The R package BTM was used to perform the biterm topic modeling. I Crawl data of plain text and pre-process tokenized the inputs. The output gives unique tagging on each sentences and characteristics of words. I Perform tagging on title and extract co-occurrences of nouns, adjectives and verbs within 3 words distance. I Build the biterm topic model with 5 topics and provide the set of biterms to cluster, where tuning parameters are input to analyze the data. I The R package of "ggraph" is used to automatically process the topic clustering data.
  • 21. Interpretation of the biterm topic model (BTM) I Some of the unique and unobserved words include "international", "postsecondary", and "student" to have not appeared. I The economical issues and sociological issues are somewhat separated to yield a different result. I A natural consequence of covid-19 pandemics, "medical", "protective", "business" and "personal" words that appear to be the interest. I A BTM yields comprehensive and grouped topics of words by the application of mixture models. I A word "mental" and "health" is closely connected to show that the issue of public health. I Words of "outlook", "price" and "service", shows worries and livings of Canadians for the impact of Covid-19. Figure: Biterm clusters for 5 topics
  • 22. Conclusion and discussion I There are some variation and minor difference in methods. I The BTM to provide with ultimate guidance on the data we analyzed with, coherent and consistent topic. I A hierarchal clustering shows for 2 clusters, but k-medoid clustering shows for 3 clusters. Figure: Model used to analyze text data.
  • 23. Conclusion and discussion I Each method is different by the nature of mathematical and statistical foundation, leading us to explore the data and guide through different result of the analysis. I Analyzing term frequencies and term co-frequencies, clustering and the formulating topic models better understand the topics of keywords of covid-19 pandemics in Canada. I A wordcloud informed keywords and co-occurrence to observe for the data. I A hierarchical clustering and k-medoid clustering to group them and investigate for the correlation. I Observed 2 groups of keywords where the first group of main topics ("covid", "pandemic" and "health") of keywords and second group of sub-topics for issues. I A local smoothing regression in time series data to investigate if there is a causal relationship to draw upon the different trend. I A trend is similar for top 7 ranked most appeared words indicating that the trend is the same to have no formal statement on causal inference. I A Gaussian graphical model to draw a conditional independence and structural dependence relationship between top 7 rank of words. I By the conditional independence, issue are organized for the co-occurrences of phrases for structural dependencies. I BTM were, more concise and specified topics to differentiate clearly for relevant keywords. I The 5 topics, yield problems and issues with keywords Canadians to overcome Covid-19 pandemic to end with.
  • 24. References I Agrawal, R., Imielinski, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (pp. 207-216). I Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project. org/package= BTM. R package version 0.3, 1. I Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Information processing letters, 24(6), 377-380. I Gershon, N., & Page, W. (2001). What storytelling can do for information visualization. Communications of the ACM, 44(8), 31-37. I Becue-Bertaut, M. (2019). Textual data science with R. CRC Press. I Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis Using Time Series and Copula Models. Applied Sciences, 9(19), 4071. I Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom for ’ggplot2’. R package version 0.5. 0. I Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31(22), 3718-3720. I Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In International conference on similarity search and applications (pp. 171-187). Springer, Cham. I James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol1. 112, p.18). New York: Springer I Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data. Springer Nature. I Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression model for apple keywords by text mining. Advanced Engineering Informatics, 29(4), 918-929.