SlideShare a Scribd company logo
Discover the world at Leiden UniversityDiscover the world at Leiden University
Introduction to Text Mining
Ben Companjen | Code4Lib2017
1
Discover the world at Leiden University
Plan for this workshop
1. Housekeeping
• Only one device on the WiFi network please!
2. Front matter
• About Ben
• About you(r goals)
• Goals for the workshop
3. Setup
• Pair up?
• Data distribution
4. Text Mining
5. Discussion & wrap-up
• Where to go from here?
• Practical issues
2
Discover the world at Leiden University
About me
• Digital Scholarship Librarian at Leiden University Libraries
- as part of the small Centre for Digital Scholarship team I work with researchers and library colleagues to support digital
scholarship, by e.g. explaining and helping with text and data mining, data modeling and information system design.
• @bencomp on Twitter, bencomp on IRC
• I'm not a wizard of text mining
3
Discover the world at Leiden University
About you: survey responses
• Not everyone knows Python
- hopefully we can help each other if needed
• Majority has not worked with Jupyter Notebook
- basic commands in one slide coming up
• Majority has heard or read about text mining, but no practical experience
- some may have more experience than me
• You seem eager to get your hands dirty!
4
Discover the world at Leiden University
Goals for this workshop
• Understand definitions of text mining and related terms
• Get hands-on experience with a few text mining techniques
• Get a sense of the scope and limitations of text mining
5
Discover the world at Leiden University
Setup for today
• Work in pairs, preferably with someone who has Jupyter Notebook setup or has seen Python before
• Jupyter Notebook allows us to mix text, code and output in files and use/run this cross platform
• Notebooks have been prepared with code and explanations
• Step through the notebook, then change some parameters and run again
• Discuss the process and results with the whole group before continuing with the next notebook
6
Discover the world at Leiden University
Examples of text mining applications
• Find out what people think of a product
• Categorise an article with similar articles
• Detect plagiarism in theses
• Find out who wrote the lyrics to the Dutch national anthem
• Track the use and meaning of certain words or phrases over time
7
Discover the world at Leiden University
Text mining: definitions
• Text mining and text analytics
• Natural Language Processing
• Document
• Corpus
8
Discover the world at Leiden University
Approaching text mining
• Example
- Question: What are these documents about?
- Task: Extract topics (distinguishing words) from documents
- Method: Topic modelling
- Algorithm: Latent Semantic Analysis
- Tool: Gensim
• Question -> Task -> Method -> Algorithm -> Tool
- I.e. you don't normally pick a tool that does the mining to answer your question
• Algorithm-in-Tool + Parameters + Model + Data
- I.e. it may take more than one go to get a good answer
9
Discover the world at Leiden University
Data for text mining
• Plain text, no markup
• No para-text, such as headers and footers, title pages, etc.
• Preprocessing is almost always necessary
- Removing meaningless but common words ('the', 'a', 'be')
- Handle punctuation
- Convert to lowercase
- Split documents into paragraphs, sentences and/or words
• It depends on the type of mining you want to do (which depends on the question)
10
Discover the world at Leiden University
Voyant Tools: short demo
• Question: What words and phrases are frequent in the Code4Lib2017 talks and workshops abstracts?
• Task: Show most used words and phrases in C4L2017 abstracts
• Method: Word cloud
• Algorithm: Count words and count phrases
• Tool: Voyant
• Voyant Tools provides an online environment for "hermeneutics", i.e. exploration of text to inform further
investigation
- Word cloud
- Frequent terms
- Links (collocates)
- Frequencies
- Key Words In Context
- Phrases
11
Discover the world at Leiden University
Jupyter Notebook
• Web-based interface for editing and running Notebooks (i.e. code and text)
• Written in Python, but works with different backends ("kernels") for various programming languages
• A notebook consists of cells, each with either code or text
• Two modes: edit and command
- edit mode: write code, (Markdown) text
- command mode: cut/copy/paste cells, toggle output, etc.
• Some useful keys:
- (in command mode) Enter: go into edit mode; edit selected cell
- (in edit mode) Esc: go from edit to command mode
- Ctrl+Enter: run code/parse Markdown and select current cell – use for code-try-cycles
- Shift+Enter: run code/parse Markdown and select next cell – use for running cells in a row
- Alt+Enter: run code/parse Markdown and create a cell after this one
- (in command mode) M: make cell a Markdown cell instead of code
12
Discover the world at Leiden University
Notebook 0: Corpora and Vector Spaces
• Adapted from Gensim tutorial
• Load a corpus (efficiently)
• Transform it into a dictionary and matrix for further use
• Key concepts:
- bag of words
13
Discover the world at Leiden University
Notebook 1: Naive classification
• Adapted from NLTK example
• Train a classifier to classify texts into a limited number of categories
• Key concepts:
- words as features
- training, machine learning
14
Discover the world at Leiden University
Notebook 2: Topics and Transformations
• Also adapted from a Gensim tutorial
• Transform the raw word counts into weighted counts and train a topic model
• Key concepts:
- TF-IDF
- Topic Model
15
Discover the world at Leiden University
Notebook 3: Similarity queries
• Also adapted from Gensim tutorial
• Find a topic that is close to the topic of a new (unseen) document or query
• Key concepts:
- similarity in vector space
16
Discover the world at Leiden University
Notebook 4: Entity recognition
• Based on http://textminingonline.com/getting-started-with-spacy
• Find references to entities (e.g. persons, organisations, locations) in running text
• Key concepts:
- Named Entity Recognition (list-based vs. rule-based)
- Named Entity Disambiguation
17
Discover the world at Leiden University
Recap of notebook tutorials
• Transform strings to vectors – if necessary
• Extract topics from documents in corpus as a model for your corpus
• Compare new documents to the model to classify them
• Extract entity references
18
Discover the world at Leiden University
Where (to|will you) go from here?
• There are many tasks, methods and tools already and more are being developed
- libraries (e.g. NLTK)
- All-in-one library and pre-trained models (e.g. textacy)
- (Cloud) API providers (e.g. Google Natural Language API, Apache Stanbol)
• Evaluate before use – of course
• What are your thoughts?
19
Discover the world at Leiden University
Further reading and tutorials
• Gensim has more tutorials
• spaCy provides links to tutorials
• https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017/blob/master/advanced-text-
analysis.ipynb
• Ted Underwood and his team blog on their text mining use in research
• Methods Commons provides recipes for various kinds of (text) analyses
• Programming Historian has lessons on text and data mining and more
20
Discover the world at Leiden UniversityDiscover the world at Leiden University
Thank you!
21

More Related Content

PDF
RDM Services catalogue @ Leiden University
PDF
The Academic Library as a Centre of Expertise in the field of Text and Data M...
PDF
Virtual Research Environments at Leiden University
PDF
The repository as an interactive research tool
PDF
PDF
Centre for Digital Scholarship and LURIS
PDF
International Image Interoperability Framework (IIIF)
PDF
A comprehensive approach towards the curation of born digital material by Lei...
RDM Services catalogue @ Leiden University
The Academic Library as a Centre of Expertise in the field of Text and Data M...
Virtual Research Environments at Leiden University
The repository as an interactive research tool
Centre for Digital Scholarship and LURIS
International Image Interoperability Framework (IIIF)
A comprehensive approach towards the curation of born digital material by Lei...

What's hot (20)

PDF
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
PDF
Centre for Digital Scholarship at Leiden University Libraries
PDF
Data management support as core business of research libraries
PPTX
Building Confidence: Training Librarians in Research Data Management
PPTX
Roles & Skills for RDM
PDF
Data Management Support at Leiden University
PPT
Building research data management services at the University of Edinburgh: a ...
PPTX
RDM Programme at University of Edinburgh
PDF
20170410 17 wde pid workshop datacite
PDF
Preserving Our Digital Heritage: Community Action via UK LOCKSS
PDF
Presentation DFG Bonn 16 september 2015
PPTX
Designing and delivering an international MOOC on Research Data Management an...
PDF
On being a cog rather than inventing the wheel: Edinburgh DataShare as a key ...
PPTX
University of Edinburgh RDM Training: MANTRA & beyond
PDF
Data management planning – what it is and how to do it
PPT
PEPRS: Recording The Extent Preserved
PDF
Rise of the Databrarian - Jeroen Rombouts
PPT
DIY Research Data Management Training Kit for Librarians
PDF
Library Connect Webinar - Data Sharing
PPTX
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
Centre for Digital Scholarship at Leiden University Libraries
Data management support as core business of research libraries
Building Confidence: Training Librarians in Research Data Management
Roles & Skills for RDM
Data Management Support at Leiden University
Building research data management services at the University of Edinburgh: a ...
RDM Programme at University of Edinburgh
20170410 17 wde pid workshop datacite
Preserving Our Digital Heritage: Community Action via UK LOCKSS
Presentation DFG Bonn 16 september 2015
Designing and delivering an international MOOC on Research Data Management an...
On being a cog rather than inventing the wheel: Edinburgh DataShare as a key ...
University of Edinburgh RDM Training: MANTRA & beyond
Data management planning – what it is and how to do it
PEPRS: Recording The Extent Preserved
Rise of the Databrarian - Jeroen Rombouts
DIY Research Data Management Training Kit for Librarians
Library Connect Webinar - Data Sharing
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Ad

Viewers also liked (7)

PDF
International Image Interoperability Framework (IIIF)
PDF
Bijzondere collecties: houdbaar, vindbaar en bruikbaar
PDF
Samenwerken voor Research Data Management
PDF
From DAI to ORCID; Implementation and beyond in Leiden
PDF
Text and Data Mining: kennisdeelsessie
International Image Interoperability Framework (IIIF)
Bijzondere collecties: houdbaar, vindbaar en bruikbaar
Samenwerken voor Research Data Management
From DAI to ORCID; Implementation and beyond in Leiden
Text and Data Mining: kennisdeelsessie
Ad

Similar to Introduction to Text Mining (20)

PDF
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
PPTX
Building the Abnormal Hieratic Global Portal
PDF
Analysing Qualitative Data
PPTX
Essentials for a Better ICT Student in Palestine
PPTX
Introduction to Deep Learning and ML.pptx
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PDF
Metadata
PPTX
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
PPT
Searching of Web and Electronic Resources
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
KEY
Online Citation Tools
PPTX
Final presentation
PPTX
Building NLP solutions for Davidson ML Group
PPTX
Data and Donuts: Data organization
PPTX
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
PDF
Elements of AI Luxembourg - opening session
PPT
Itct year1 mitchell
PDF
Tools and Methodology for Research: Writing Scientific Material
PDF
AI presentation and introduction - Retrieval Augmented Generation RAG 101
PPT
Developing & Running your own E-reader Seminars and Gadget Labs
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Building the Abnormal Hieratic Global Portal
Analysing Qualitative Data
Essentials for a Better ICT Student in Palestine
Introduction to Deep Learning and ML.pptx
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Metadata
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Searching of Web and Electronic Resources
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Online Citation Tools
Final presentation
Building NLP solutions for Davidson ML Group
Data and Donuts: Data organization
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Elements of AI Luxembourg - opening session
Itct year1 mitchell
Tools and Methodology for Research: Writing Scientific Material
AI presentation and introduction - Retrieval Augmented Generation RAG 101
Developing & Running your own E-reader Seminars and Gadget Labs

Recently uploaded (20)

PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
2. Earth - The Living Planet earth and life
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Sciences of Europe No 170 (2025)
PPTX
Microbiology with diagram medical studies .pptx
PPTX
BIOMOLECULES PPT........................
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPT
protein biochemistry.ppt for university classes
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
2. Earth - The Living Planet earth and life
Viruses (History, structure and composition, classification, Bacteriophage Re...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
The KM-GBF monitoring framework – status & key messages.pptx
Sciences of Europe No 170 (2025)
Microbiology with diagram medical studies .pptx
BIOMOLECULES PPT........................
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ECG_Course_Presentation د.محمد صقران ppt
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Phytochemical Investigation of Miliusa longipes.pdf
protein biochemistry.ppt for university classes
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
bbec55_b34400a7914c42429908233dbd381773.pdf

Introduction to Text Mining

  • 1. Discover the world at Leiden UniversityDiscover the world at Leiden University Introduction to Text Mining Ben Companjen | Code4Lib2017 1
  • 2. Discover the world at Leiden University Plan for this workshop 1. Housekeeping • Only one device on the WiFi network please! 2. Front matter • About Ben • About you(r goals) • Goals for the workshop 3. Setup • Pair up? • Data distribution 4. Text Mining 5. Discussion & wrap-up • Where to go from here? • Practical issues 2
  • 3. Discover the world at Leiden University About me • Digital Scholarship Librarian at Leiden University Libraries - as part of the small Centre for Digital Scholarship team I work with researchers and library colleagues to support digital scholarship, by e.g. explaining and helping with text and data mining, data modeling and information system design. • @bencomp on Twitter, bencomp on IRC • I'm not a wizard of text mining 3
  • 4. Discover the world at Leiden University About you: survey responses • Not everyone knows Python - hopefully we can help each other if needed • Majority has not worked with Jupyter Notebook - basic commands in one slide coming up • Majority has heard or read about text mining, but no practical experience - some may have more experience than me • You seem eager to get your hands dirty! 4
  • 5. Discover the world at Leiden University Goals for this workshop • Understand definitions of text mining and related terms • Get hands-on experience with a few text mining techniques • Get a sense of the scope and limitations of text mining 5
  • 6. Discover the world at Leiden University Setup for today • Work in pairs, preferably with someone who has Jupyter Notebook setup or has seen Python before • Jupyter Notebook allows us to mix text, code and output in files and use/run this cross platform • Notebooks have been prepared with code and explanations • Step through the notebook, then change some parameters and run again • Discuss the process and results with the whole group before continuing with the next notebook 6
  • 7. Discover the world at Leiden University Examples of text mining applications • Find out what people think of a product • Categorise an article with similar articles • Detect plagiarism in theses • Find out who wrote the lyrics to the Dutch national anthem • Track the use and meaning of certain words or phrases over time 7
  • 8. Discover the world at Leiden University Text mining: definitions • Text mining and text analytics • Natural Language Processing • Document • Corpus 8
  • 9. Discover the world at Leiden University Approaching text mining • Example - Question: What are these documents about? - Task: Extract topics (distinguishing words) from documents - Method: Topic modelling - Algorithm: Latent Semantic Analysis - Tool: Gensim • Question -> Task -> Method -> Algorithm -> Tool - I.e. you don't normally pick a tool that does the mining to answer your question • Algorithm-in-Tool + Parameters + Model + Data - I.e. it may take more than one go to get a good answer 9
  • 10. Discover the world at Leiden University Data for text mining • Plain text, no markup • No para-text, such as headers and footers, title pages, etc. • Preprocessing is almost always necessary - Removing meaningless but common words ('the', 'a', 'be') - Handle punctuation - Convert to lowercase - Split documents into paragraphs, sentences and/or words • It depends on the type of mining you want to do (which depends on the question) 10
  • 11. Discover the world at Leiden University Voyant Tools: short demo • Question: What words and phrases are frequent in the Code4Lib2017 talks and workshops abstracts? • Task: Show most used words and phrases in C4L2017 abstracts • Method: Word cloud • Algorithm: Count words and count phrases • Tool: Voyant • Voyant Tools provides an online environment for "hermeneutics", i.e. exploration of text to inform further investigation - Word cloud - Frequent terms - Links (collocates) - Frequencies - Key Words In Context - Phrases 11
  • 12. Discover the world at Leiden University Jupyter Notebook • Web-based interface for editing and running Notebooks (i.e. code and text) • Written in Python, but works with different backends ("kernels") for various programming languages • A notebook consists of cells, each with either code or text • Two modes: edit and command - edit mode: write code, (Markdown) text - command mode: cut/copy/paste cells, toggle output, etc. • Some useful keys: - (in command mode) Enter: go into edit mode; edit selected cell - (in edit mode) Esc: go from edit to command mode - Ctrl+Enter: run code/parse Markdown and select current cell – use for code-try-cycles - Shift+Enter: run code/parse Markdown and select next cell – use for running cells in a row - Alt+Enter: run code/parse Markdown and create a cell after this one - (in command mode) M: make cell a Markdown cell instead of code 12
  • 13. Discover the world at Leiden University Notebook 0: Corpora and Vector Spaces • Adapted from Gensim tutorial • Load a corpus (efficiently) • Transform it into a dictionary and matrix for further use • Key concepts: - bag of words 13
  • 14. Discover the world at Leiden University Notebook 1: Naive classification • Adapted from NLTK example • Train a classifier to classify texts into a limited number of categories • Key concepts: - words as features - training, machine learning 14
  • 15. Discover the world at Leiden University Notebook 2: Topics and Transformations • Also adapted from a Gensim tutorial • Transform the raw word counts into weighted counts and train a topic model • Key concepts: - TF-IDF - Topic Model 15
  • 16. Discover the world at Leiden University Notebook 3: Similarity queries • Also adapted from Gensim tutorial • Find a topic that is close to the topic of a new (unseen) document or query • Key concepts: - similarity in vector space 16
  • 17. Discover the world at Leiden University Notebook 4: Entity recognition • Based on http://textminingonline.com/getting-started-with-spacy • Find references to entities (e.g. persons, organisations, locations) in running text • Key concepts: - Named Entity Recognition (list-based vs. rule-based) - Named Entity Disambiguation 17
  • 18. Discover the world at Leiden University Recap of notebook tutorials • Transform strings to vectors – if necessary • Extract topics from documents in corpus as a model for your corpus • Compare new documents to the model to classify them • Extract entity references 18
  • 19. Discover the world at Leiden University Where (to|will you) go from here? • There are many tasks, methods and tools already and more are being developed - libraries (e.g. NLTK) - All-in-one library and pre-trained models (e.g. textacy) - (Cloud) API providers (e.g. Google Natural Language API, Apache Stanbol) • Evaluate before use – of course • What are your thoughts? 19
  • 20. Discover the world at Leiden University Further reading and tutorials • Gensim has more tutorials • spaCy provides links to tutorials • https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017/blob/master/advanced-text- analysis.ipynb • Ted Underwood and his team blog on their text mining use in research • Methods Commons provides recipes for various kinds of (text) analyses • Programming Historian has lessons on text and data mining and more 20
  • 21. Discover the world at Leiden UniversityDiscover the world at Leiden University Thank you! 21

Editor's Notes

  • #7: NB If we finish the whole programme ahead of time, let's continue tinkering, or look at other texts and tools
  • #8: We will not be
  • #9: Some definitions to start off with (others will follow when needed) Text mining is a subfield of text analytics, which in general aims to find structure and meaning in unstructured or semi-structured text. Text mining is perhaps more the mathematical part of getting structured data from text. NLP is roughly the same as text analytics A document is a text A corpus is a collection (or "body") of texts
  • #12: Loading a corpus Word cloud Frequent terms Links Frequencies (more interesting to see frequencies over time) Key Words In Context Phrases