SlideShare a Scribd company logo
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Interactive Visual Data Analysis
Part Two
Interactive Text Mining Suite
Olga Scrivner
Indiana University
Workshop in Methods
1 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Outline
1 Introduce a web application for text processing and mining
2 Learn about natural language processing techniques
3 Develop practical skills
2 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Data Mining
“As our collective knowledge continues to be digitized and
stored (...) it becomes more difficult to find and discover what
we are looking for.” (Blei 2012)
3 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
New Ways of Exploring Data Collections
Word clouds (Vuillemot et al., 2009)
4 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Visualization Methods
Social network graphs (Rydberg-Cox, 2011)
5 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Visualization Methods
Tracking emotion and sentiment in fairy tales
(Mohammad, 2012)
6 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Topic Modeling
Discovering underlying theme of collection from Science magazine
1990-2000 (Blei 2012)
7 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Technological and Methodological Obstacles
Many tools require some programming skills (Mallet,
Meta, R and Python libraries)
GUI tools are limited to certain formats and functions
(Voyant, PaperMachine)
Lack of active control by users
8 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Interactive Text Mining Suite
A user-friendly tool for quantitative analysis and
visualization of unstructured data
Platform-independent
Interactive
9 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
ITMS Structure
1 File Uploads
Upload files (txt, pdf, rdf and Google books API)
2 Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3 Data Visualization
Word frequencies, Cluster analysis and topic modeling
10 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
ITMS Structure
1 File Uploads
Upload files (txt, pdf, rdf and Google books API)
2 Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3 Data Visualization
Word frequencies, Cluster analysis and topic modeling
10 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Workshop Files
Download 3 text files
http://ssrc.indiana.edu/seminars/wim.shtml
NY Times articles (3 documents in a plain text format)
ITMS Web site:
http://www.interactivetextminingsuite.com
11 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Upload File
12 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Upload File
12 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Upload File
12 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Preprocessing Data
Before performing data analysis we should preprocess data.
13 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Preprocessing Options
Select preprocessing options and click apply.
14 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Stopwords
Stopwords (e.g. the, and): select Default for English
15 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Manual Removal of Stopwords
Based on the need, remove any additional stopwords that you
may consider a noise, e,g, paper, shows etc
Select apply
16 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Stemming
To improve analytics, you can stem all your tokens, ex. instead
of worked, works, working, you will have only one relevant
stem work
17 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Metadata Extraction
You can extract or upload metadata. You will need datestamp
(year) information for chronological topic modeling.
18 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Visualization
19 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Word Cloud Representation
20 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Customization
21 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Cluster Analysis
You need to have at least three documents
Documents will be grouped based on their term similarity
measures
22 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Cluster Analysis
23 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Topic Modeling
LDA (Latent Dirichlet allocation)
STM (Structural Topic model)
Chronological topic visualization (lda): requires metadata
24 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Topic Modeling Tuning
Selection of topics (how many different themes)
Selection of words per theme (how many words per topic)
Selection of iteration
25 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Topic Model Selection
26 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
LDA Topic Model
27 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
STM Topic Model
28 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Other Formats - Google Books
Before switching to other data formats, refresh your local
browser.
Start with File Uploads and select Structured Data
29 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Other Formats - Google Books
Select your search terms and submit
Current limitation is 40 books
30 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Future Options
Shiny Web Application is highly customizable
1 Part-of-speech tagging (tm package)
2 Network analysis (igraph package)
3 Name Entity Recognition (NLP package)
4 Twitter Streaming (twitterR package) - will requires user’s
twitter set-up for streaming but information will be
provided how to set it up
Open for other suggestions and collaboration - contact
obscrivn@indiana.edu
31 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
Acknowledgements
I would like to thank WIM for providing this opportunity.
Contributors: Jefferson Davis, Irina Trapido, Jay Lee
32 / 33
Introduction
ITMS
Preprocessing
Data
Data
Visualization
Cluster
Analysis
Topic
Modeling
Google Book
API
Future
Directions
References
References I
[1] Many open source R packages: tm, shiny, NLP, stringi, stringr, topicmodels, lda and many more
[2] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:
Cambridge University Press
[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi
Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge
University Press
[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in the
Humanities and Social Sciences. Springer International Publishing, Cham
[5] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso
[6] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods for
literature analysis. Proceedings of the 6th EACL Workshop on Language Technology for Cultural
Heritage, Social 561Sciences, and Humanities, pages 35–44
image credits: https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif
33 / 33

More Related Content

PPT
Ilusión de Octavio Ocampo
PDF
Best Practices for Immersive Technology in Education
DOCX
funciones singulares
PDF
El día que cayó el duce el país
PPTX
Apprenticeship Update March 2017
PDF
Automated deployment
DOCX
Guía de actividades y rubrica de evaluacion etapa 1 reconocimiento del curso
PPTX
Clase contaminacion delaire
Ilusión de Octavio Ocampo
Best Practices for Immersive Technology in Education
funciones singulares
El día que cayó el duce el país
Apprenticeship Update March 2017
Automated deployment
Guía de actividades y rubrica de evaluacion etapa 1 reconocimiento del curso
Clase contaminacion delaire

Viewers also liked (18)

DOCX
indutancia
PPTX
Sistema de-equivalentes.
PPTX
Profissão bibliotecário: tendências e (im)possibilidades
PPTX
La contabilidad
PDF
Arc2615 report file khorhaoxiang0318065
DOC
Rancangan Pengajaran Tahunan PJK KSSR
PPTX
أعمال السدود - Dams Works
PPTX
Radioterapia e inmunoterapia final (1)
PDF
VIETNAM - TRANSPORTATION AND LOGISTICS – WHAT YOU MUST KNOW:
PPT
Data preprocessing
PPT
Data preprocessing
PPTX
An efficient data preprocessing method for mining
PPT
Data preprocessing
PPTX
Text Mining Framework
PPTX
Data Preprocessing- Data Warehouse & Data Mining
PPTX
Augmented Reality Technologies for Foreign Language Teaching and Learning
PDF
Data Visualization: Language Variation Suite and Interactive Text Mining Suite
PPTX
Text MIning
indutancia
Sistema de-equivalentes.
Profissão bibliotecário: tendências e (im)possibilidades
La contabilidad
Arc2615 report file khorhaoxiang0318065
Rancangan Pengajaran Tahunan PJK KSSR
أعمال السدود - Dams Works
Radioterapia e inmunoterapia final (1)
VIETNAM - TRANSPORTATION AND LOGISTICS – WHAT YOU MUST KNOW:
Data preprocessing
Data preprocessing
An efficient data preprocessing method for mining
Data preprocessing
Text Mining Framework
Data Preprocessing- Data Warehouse & Data Mining
Augmented Reality Technologies for Foreign Language Teaching and Learning
Data Visualization: Language Variation Suite and Interactive Text Mining Suite
Text MIning
Ad

Similar to Introduction to Text Mining and Visualization with Interactive Web Application (20)

PPTX
COMP303-Lecture-01_1539277777777777.pptx
PDF
Sistemas de Recomendação sem Enrolação
PDF
Meetup SF - Amundsen
PPTX
How Lyft Drives Data Discovery
PPTX
Lecture 1.pptx
PPTX
Database Management Systems Lecture # 01
PDF
Disrupting Data Discovery
PDF
Concepts, use cases and principles to build big data systems (1)
PPTX
Large Graph Mining
PDF
Data Discovery and Metadata
PDF
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
PPTX
Lecture 1 database system notes full.pptx
PDF
An Efficient Approach for Clustering High Dimensional Data
PPT
Introduction to question answering for linked data & big data
PDF
Sq lite module1
PDF
Amundsen: From discovering to security data
PPTX
Towards Automatic Analysis of Online Discussions among Hong Kong Students
PPTX
LIS688_Group1
PDF
Data science technology overview
PPTX
How Lyft Drives Data Discovery
COMP303-Lecture-01_1539277777777777.pptx
Sistemas de Recomendação sem Enrolação
Meetup SF - Amundsen
How Lyft Drives Data Discovery
Lecture 1.pptx
Database Management Systems Lecture # 01
Disrupting Data Discovery
Concepts, use cases and principles to build big data systems (1)
Large Graph Mining
Data Discovery and Metadata
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Lecture 1 database system notes full.pptx
An Efficient Approach for Clustering High Dimensional Data
Introduction to question answering for linked data & big data
Sq lite module1
Amundsen: From discovering to security data
Towards Automatic Analysis of Online Discussions among Hong Kong Students
LIS688_Group1
Data science technology overview
How Lyft Drives Data Discovery
Ad

More from Olga Scrivner (20)

PPTX
Engaging Students Competition and Polls.pptx
PPTX
HICSS ATLT: Advances in Teaching and Learning Technologies
PDF
The power of unstructured data: Recommendation systems
PPTX
Cognitive executive functions and Opioid Use Disorder
PDF
Introduction to Web Scraping with Python
PDF
Call for paper Collaboration Systems and Technology
PDF
Jupyter machine learning crash course
PDF
R and RMarkdown crash course
PDF
The Impact of Language Requirement on Students' Performance, Retention, and M...
PPTX
If a picture is worth a thousand words, Interactive data visualizations are w...
PPTX
Introduction to Interactive Shiny Web Application
PDF
Introduction to Overleaf Workshop
PDF
R crash course for Business Analytics Course K303
PDF
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
PDF
Gender Disparity in Employment and Education
PDF
CrashCourse: Python with DataCamp and Jupyter for Beginners
PDF
Optimizing Data Analysis: Web application with Shiny
PDF
Data Analysis and Visualization: R Workflow
PDF
Reproducible visual analytics of public opioid data
PPTX
Building Effective Visualization Shiny WVF
Engaging Students Competition and Polls.pptx
HICSS ATLT: Advances in Teaching and Learning Technologies
The power of unstructured data: Recommendation systems
Cognitive executive functions and Opioid Use Disorder
Introduction to Web Scraping with Python
Call for paper Collaboration Systems and Technology
Jupyter machine learning crash course
R and RMarkdown crash course
The Impact of Language Requirement on Students' Performance, Retention, and M...
If a picture is worth a thousand words, Interactive data visualizations are w...
Introduction to Interactive Shiny Web Application
Introduction to Overleaf Workshop
R crash course for Business Analytics Course K303
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Gender Disparity in Employment and Education
CrashCourse: Python with DataCamp and Jupyter for Beginners
Optimizing Data Analysis: Web application with Shiny
Data Analysis and Visualization: R Workflow
Reproducible visual analytics of public opioid data
Building Effective Visualization Shiny WVF

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Computer network topology notes for revision
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to machine learning and Linear Models
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
annual-report-2024-2025 original latest.
PDF
Introduction to the R Programming Language
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Data Science and Data Analysis
Introduction-to-Cloud-ComputingFinal.pptx
[EN] Industrial Machine Downtime Prediction
Computer network topology notes for revision
Business Analytics and business intelligence.pdf
Introduction to machine learning and Linear Models
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
annual-report-2024-2025 original latest.
Introduction to the R Programming Language
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Introduction to Text Mining and Visualization with Interactive Web Application