Introduction to Text Mining

Discover the world at Leiden UniversityDiscover the world at Leiden University
Introduction to Text Mining
Ben Companjen | Code4Lib2017
1

Discover the world at Leiden University
Plan for this workshop
1. Housekeeping
• Only one device on the WiFi network please!
2. Front matter
• About Ben
• About you(r goals)
• Goals for the workshop
3. Setup
• Pair up?
• Data distribution
4. Text Mining
5. Discussion & wrap-up
• Where to go from here?
• Practical issues
2

About me
• Digital Scholarship Librarian at Leiden University Libraries
- as part of the small Centre for Digital Scholarship team I work with researchers and library colleagues to support digital
scholarship, by e.g. explaining and helping with text and data mining, data modeling and information system design.
• @bencomp on Twitter, bencomp on IRC
• I'm not a wizard of text mining
3

About you: survey responses
• Not everyone knows Python
- hopefully we can help each other if needed
• Majority has not worked with Jupyter Notebook
- basic commands in one slide coming up
• Majority has heard or read about text mining, but no practical experience
- some may have more experience than me
• You seem eager to get your hands dirty!
4

Goals for this workshop
• Understand definitions of text mining and related terms
• Get hands-on experience with a few text mining techniques
• Get a sense of the scope and limitations of text mining
5

Setup for today
• Work in pairs, preferably with someone who has Jupyter Notebook setup or has seen Python before
• Jupyter Notebook allows us to mix text, code and output in files and use/run this cross platform
• Notebooks have been prepared with code and explanations
• Step through the notebook, then change some parameters and run again
• Discuss the process and results with the whole group before continuing with the next notebook
6

Examples of text mining applications
• Find out what people think of a product
• Categorise an article with similar articles
• Detect plagiarism in theses
• Find out who wrote the lyrics to the Dutch national anthem
• Track the use and meaning of certain words or phrases over time
7

Text mining: definitions
• Text mining and text analytics
• Natural Language Processing
• Document
• Corpus
8

Approaching text mining
• Example
- Question: What are these documents about?
- Task: Extract topics (distinguishing words) from documents
- Method: Topic modelling
- Algorithm: Latent Semantic Analysis
- Tool: Gensim
• Question -> Task -> Method -> Algorithm -> Tool
- I.e. you don't normally pick a tool that does the mining to answer your question
• Algorithm-in-Tool + Parameters + Model + Data
- I.e. it may take more than one go to get a good answer
9

Data for text mining
• Plain text, no markup
• No para-text, such as headers and footers, title pages, etc.
• Preprocessing is almost always necessary
- Removing meaningless but common words ('the', 'a', 'be')
- Handle punctuation
- Convert to lowercase
- Split documents into paragraphs, sentences and/or words
• It depends on the type of mining you want to do (which depends on the question)
10

Voyant Tools: short demo
• Question: What words and phrases are frequent in the Code4Lib2017 talks and workshops abstracts?
• Task: Show most used words and phrases in C4L2017 abstracts
• Method: Word cloud
• Algorithm: Count words and count phrases
• Tool: Voyant
• Voyant Tools provides an online environment for "hermeneutics", i.e. exploration of text to inform further
investigation
- Word cloud
- Frequent terms
- Links (collocates)
- Frequencies
- Key Words In Context
- Phrases
11

Jupyter Notebook
• Web-based interface for editing and running Notebooks (i.e. code and text)
• Written in Python, but works with different backends ("kernels") for various programming languages
• A notebook consists of cells, each with either code or text
• Two modes: edit and command
- edit mode: write code, (Markdown) text
- command mode: cut/copy/paste cells, toggle output, etc.
• Some useful keys:
- (in command mode) Enter: go into edit mode; edit selected cell
- (in edit mode) Esc: go from edit to command mode
- Ctrl+Enter: run code/parse Markdown and select current cell – use for code-try-cycles
- Shift+Enter: run code/parse Markdown and select next cell – use for running cells in a row
- Alt+Enter: run code/parse Markdown and create a cell after this one
- (in command mode) M: make cell a Markdown cell instead of code
12

Notebook 0: Corpora and Vector Spaces
• Adapted from Gensim tutorial
• Load a corpus (efficiently)
• Transform it into a dictionary and matrix for further use
• Key concepts:
- bag of words
13

Notebook 1: Naive classification
• Adapted from NLTK example
• Train a classifier to classify texts into a limited number of categories
• Key concepts:
- words as features
- training, machine learning
14

Notebook 2: Topics and Transformations
• Also adapted from a Gensim tutorial
• Transform the raw word counts into weighted counts and train a topic model
• Key concepts:
- TF-IDF
- Topic Model
15

Notebook 3: Similarity queries
• Also adapted from Gensim tutorial
• Find a topic that is close to the topic of a new (unseen) document or query
• Key concepts:
- similarity in vector space
16

Notebook 4: Entity recognition
• Based on http://textminingonline.com/getting-started-with-spacy
• Find references to entities (e.g. persons, organisations, locations) in running text
• Key concepts:
- Named Entity Recognition (list-based vs. rule-based)
- Named Entity Disambiguation
17

Recap of notebook tutorials
• Transform strings to vectors – if necessary
• Extract topics from documents in corpus as a model for your corpus
• Compare new documents to the model to classify them
• Extract entity references
18

Where (to|will you) go from here?
• There are many tasks, methods and tools already and more are being developed
- libraries (e.g. NLTK)
- All-in-one library and pre-trained models (e.g. textacy)
- (Cloud) API providers (e.g. Google Natural Language API, Apache Stanbol)
• Evaluate before use – of course
• What are your thoughts?
19

Further reading and tutorials
• Gensim has more tutorials
• spaCy provides links to tutorials
• https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017/blob/master/advanced-text-
analysis.ipynb
• Ted Underwood and his team blog on their text mining use in research
• Methods Commons provides recipes for various kinds of (text) analyses
• Programming Historian has lessons on text and data mining and more
20

Discover the world at Leiden UniversityDiscover the world at Leiden University
Thank you!
21

Introduction to Text Mining

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Introduction to Text Mining (20)

Recently uploaded (20)

Introduction to Text Mining

Editor's Notes