SlideShare a Scribd company logo
SAS Global 2021 Introduction to Natural Language Processing
Natural Language Processing—An Introduction
Colleen M. Farrelly, Staticlysm
Brief bio –
Colleen M. Farrelly is a machine learning scientist whose expertise includes
supervised learning, unsupervised learning, psychometrics, topological data
analysis, and natural language processing. She has an analytics book in review
that touches upon the analysis of text data with topological data analysis tools.
Introduction
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Text Data and Applications
• What do all of these have in
common?
• Clinical case notes
• Chatbot conversations
• Client email interactions
• Court case
summaries/transcripts
• Published research articles
• Tweets
• Voice recordings
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Text Data and Applications
• Commonalities
• Text data
• Contain potentially-
informative features for
predicting an outcome or
categorizing data
• May contain information
not available in structured
datasets
• Linguistic insight on the
speaker/writer
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Example
Legal
• Imagine both the witness and the robber in these two examples.
• How might these observations impact the outcome of a police investigation?
• Statement 1:
• She pulled the gun, took the money, and ran.
• Statement 2:
• The petite blonde pulled a shotgun on the clerk at station 2, filled a bag with cash from the
register, and absconded with the money and a handful of pens.
• How many suspects might the police have to stop to find Bonnie and Clyde?
Which witness statement might have more impact on a jury?
• How might differences in clinical case notes by clinicians inform health outcome
models? How might they reflect on the individual clinician?
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Making Sense of Text Data
• Natural language
processing (NLP)
• Collection of tools to parse
human language into
something understandable by
algorithms
• What is said
• Computational linguistics
• Deriving insight about human
behavior or traits based on
text data
• How it’s said
Common NLP Tools
An Overview
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Parsing Documents/Sentences
An Example
• Tokens (words or punctuation)
• Punctuation (non-word tokens)
• Stop words (less important words)
• Root words (stemming/lemmatizing)
Bonnie hopped into Clyde’s new car.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Tagging Features
• Parts of speech
• Clauses
• Grammatical relations
• Entity recognition
Bonnie hopped into Clyde’s new car.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Deriving Sentiment
• Language-dependent
• Sentiment dictionaries
• Positive/negative/neutral
(afinn, for instance)
• Emotion groups from
psychological models
Bonnie hopped into Clyde’s new car.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Vectorizing/Summarizing Results
• Many options for turning
NLP results into usable
data in machine learning
and statistical tools:
• Vectorization
• Word frequency matrices
• Summary tables
Bonnie hopped into Clyde’s new car.
Using Statistical Tools to Understand NLP
An Overview
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Summary Statistics
• Common summary
statistic uses
1. Conversation length
(example: engagement
metric)
2. Swear count (example:
escalation marker)
3. Conversation sentiment
over time (example:
engagement and
satisfaction)
4. Key word frequency
(example: products with
most issues)
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Use as Machine Learning Features
• Examples combining
NLP data with data
from structured
databases
1. Clustering (example:
types of churn from
client feedback and
account data)
2. Predictive modeling
(example: patient
outcomes from case
notes and medical
records)
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Psychometric Applications
• Some published papers:
1. Personality trait
identification in industrial
psychology research
2. Author identification in
plagiarism software
3. Quantification of release
risk in justice systems
4. Quantification of relapse
risk in mental health
applications
Other Uses of NLP
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Other Common NLP Applications
• Chatbots
• Personal assistants
• Translation services
• Sentence completion
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
In General
Useful References/Software
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Main NLP Software Options
• NLTK (Python)
• spaCy (Python)
• Stanford CoreNLP (Java)
• John Snow Labs/Spark NLP (Spark)
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Some NLP Literature
• Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala, N., Markert, M., Sagreiya, H., ...
& Ré, C. (2020). Cross-modal data programming enables rapid medical machine
learning. Patterns, 100019.
• Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June).
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual
meeting of the association for computational linguistics: Human language
technologies (pp. 142-150).
• Pennebaker, J. W. (2011). The secret life of pronouns. New Scientist, 211(2828),
42-45.
• Polsley, S., Jhunjhunwala, P., & Huang, R. (2016, December). Casesummarizer: a
system for automated summarization of legal texts. In Proceedings of COLING
2016, the 26th international conference on Computational Linguistics: System
Demonstrations (pp. 258-262).
• Velupillai, S., Suominen, H., Liakata, M., Roberts, A., Shah, A. D., Morley, K., ... &
Chapman, W. (2018). Using clinical Natural Language Processing for health
outcomes research: Overview and actionable suggestions for future advances.
Journal of biomedical informatics, 88, 11-19.
Thank you!
Contact Information
cfarrelly@med.miami.edu
SAS Global 2021 Introduction to Natural Language Processing

More Related Content

KEY
The Semantic Web meets the Code of Federal Regulations
PPTX
Information Extraction
PDF
Managing Mature Taxonomies: Resolving Orphan Terms
PDF
Analysing Demonetisation through Text Mining using Live Twitter Data!
PDF
SRL4ORL: Semantic Role Labelling for Opinion Role Labelling
PPT
Text Mining
PDF
Synonyms, Alternative Labels, and Nonpreferred Terms
PPTX
6CS4_AI_Unit-5 @zammers.pptx(for artificial intelligence)
The Semantic Web meets the Code of Federal Regulations
Information Extraction
Managing Mature Taxonomies: Resolving Orphan Terms
Analysing Demonetisation through Text Mining using Live Twitter Data!
SRL4ORL: Semantic Role Labelling for Opinion Role Labelling
Text Mining
Synonyms, Alternative Labels, and Nonpreferred Terms
6CS4_AI_Unit-5 @zammers.pptx(for artificial intelligence)

Similar to SAS Global 2021 Introduction to Natural Language Processing (20)

PPT
NLP Introduction.ppt machine learning presentation
PDF
Veda Semantics - introduction document
PPTX
Natural Language Processing 20 March.pptx
PDF
artificial intelligence Chapter 6 - NLP.pdf
PDF
A Guide to Natural Language Processing NLP.pdf
PPTX
Introduction to NLP.pptx
PPTX
Introduction to NLP_1.pptx
PPTX
AI_Lecture_10.pptx
PPTX
Applications & Text Representations.pptx
PDF
nlp ppt.pdf
PDF
NLP in artificial intelligence .pdf
DOCX
Natural language processing
PDF
A Guide to Natural Language Processing NLP.pdf
PDF
call for papers, research paper publishing, where to publish research paper, ...
PPTX
Natural Language Processing.pptx
PDF
Natural Language Processing for development
PPT
Introduction to Natural Language Processing
PDF
Role of Natural Language Processing in AI - Overview
PDF
Intro to AI of [chapter 6-7- 8 ] (1).pdf
PPTX
operating system notes for II year IV semester students
NLP Introduction.ppt machine learning presentation
Veda Semantics - introduction document
Natural Language Processing 20 March.pptx
artificial intelligence Chapter 6 - NLP.pdf
A Guide to Natural Language Processing NLP.pdf
Introduction to NLP.pptx
Introduction to NLP_1.pptx
AI_Lecture_10.pptx
Applications & Text Representations.pptx
nlp ppt.pdf
NLP in artificial intelligence .pdf
Natural language processing
A Guide to Natural Language Processing NLP.pdf
call for papers, research paper publishing, where to publish research paper, ...
Natural Language Processing.pptx
Natural Language Processing for development
Introduction to Natural Language Processing
Role of Natural Language Processing in AI - Overview
Intro to AI of [chapter 6-7- 8 ] (1).pdf
operating system notes for II year IV semester students
Ad

More from Colleen Farrelly (20)

PPTX
Generative AI for Social Good at Open Data Science East 2024
PPTX
Hands-On Network Science, PyData Global 2023
PPTX
Modeling Climate Change.pptx
PPTX
Natural Language Processing for Beginners.pptx
PPTX
The Shape of Data--ODSC.pptx
PPTX
Generative AI, WiDS 2023.pptx
PPTX
Emerging Technologies for Public Health in Remote Locations.pptx
PPTX
Applications of Forman-Ricci Curvature.pptx
PPTX
Geometry for Social Good.pptx
PPTX
Topology for Time Series.pptx
PPTX
Time Series Applications AMLD.pptx
PPTX
An introduction to quantum machine learning.pptx
PPTX
An introduction to time series data with R.pptx
PPTX
NLP: Challenges and Opportunities in Underserved Areas
PPTX
Geometry, Data, and One Path Into Data Science.pptx
PPTX
Topological Data Analysis.pptx
PPTX
Transforming Text Data to Matrix Data via Embeddings.pptx
PPTX
Natural Language Processing in the Wild.pptx
PPTX
2021 American Mathematical Society Data Science Talk
PPTX
WIDS 2021--An Introduction to Network Science
Generative AI for Social Good at Open Data Science East 2024
Hands-On Network Science, PyData Global 2023
Modeling Climate Change.pptx
Natural Language Processing for Beginners.pptx
The Shape of Data--ODSC.pptx
Generative AI, WiDS 2023.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
Applications of Forman-Ricci Curvature.pptx
Geometry for Social Good.pptx
Topology for Time Series.pptx
Time Series Applications AMLD.pptx
An introduction to quantum machine learning.pptx
An introduction to time series data with R.pptx
NLP: Challenges and Opportunities in Underserved Areas
Geometry, Data, and One Path Into Data Science.pptx
Topological Data Analysis.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
Natural Language Processing in the Wild.pptx
2021 American Mathematical Society Data Science Talk
WIDS 2021--An Introduction to Network Science
Ad

Recently uploaded (20)

PPTX
Leprosy and NLEP programme community medicine
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Transcultural that can help you someday.
PDF
Introduction to the R Programming Language
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Managing Community Partner Relationships
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Microsoft Core Cloud Services powerpoint
PDF
Business Analytics and business intelligence.pdf
PPTX
modul_python (1).pptx for professional and student
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Leprosy and NLEP programme community medicine
ISS -ESG Data flows What is ESG and HowHow
Transcultural that can help you someday.
Introduction to the R Programming Language
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Managing Community Partner Relationships
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft Core Cloud Services powerpoint
Business Analytics and business intelligence.pdf
modul_python (1).pptx for professional and student
climate analysis of Dhaka ,Banglades.pptx
DATA COLLECTION METHODS-ppt for nursing research
SAP 2 completion done . PRESENTATION.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...

SAS Global 2021 Introduction to Natural Language Processing

  • 2. Natural Language Processing—An Introduction Colleen M. Farrelly, Staticlysm Brief bio – Colleen M. Farrelly is a machine learning scientist whose expertise includes supervised learning, unsupervised learning, psychometrics, topological data analysis, and natural language processing. She has an analytics book in review that touches upon the analysis of text data with topological data analysis tools.
  • 4. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Text Data and Applications • What do all of these have in common? • Clinical case notes • Chatbot conversations • Client email interactions • Court case summaries/transcripts • Published research articles • Tweets • Voice recordings
  • 5. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Text Data and Applications • Commonalities • Text data • Contain potentially- informative features for predicting an outcome or categorizing data • May contain information not available in structured datasets • Linguistic insight on the speaker/writer
  • 6. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Example Legal • Imagine both the witness and the robber in these two examples. • How might these observations impact the outcome of a police investigation? • Statement 1: • She pulled the gun, took the money, and ran. • Statement 2: • The petite blonde pulled a shotgun on the clerk at station 2, filled a bag with cash from the register, and absconded with the money and a handful of pens. • How many suspects might the police have to stop to find Bonnie and Clyde? Which witness statement might have more impact on a jury? • How might differences in clinical case notes by clinicians inform health outcome models? How might they reflect on the individual clinician?
  • 7. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Making Sense of Text Data • Natural language processing (NLP) • Collection of tools to parse human language into something understandable by algorithms • What is said • Computational linguistics • Deriving insight about human behavior or traits based on text data • How it’s said
  • 9. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Parsing Documents/Sentences An Example • Tokens (words or punctuation) • Punctuation (non-word tokens) • Stop words (less important words) • Root words (stemming/lemmatizing) Bonnie hopped into Clyde’s new car.
  • 10. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Tagging Features • Parts of speech • Clauses • Grammatical relations • Entity recognition Bonnie hopped into Clyde’s new car.
  • 11. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Deriving Sentiment • Language-dependent • Sentiment dictionaries • Positive/negative/neutral (afinn, for instance) • Emotion groups from psychological models Bonnie hopped into Clyde’s new car.
  • 12. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Vectorizing/Summarizing Results • Many options for turning NLP results into usable data in machine learning and statistical tools: • Vectorization • Word frequency matrices • Summary tables Bonnie hopped into Clyde’s new car.
  • 13. Using Statistical Tools to Understand NLP An Overview
  • 14. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Summary Statistics • Common summary statistic uses 1. Conversation length (example: engagement metric) 2. Swear count (example: escalation marker) 3. Conversation sentiment over time (example: engagement and satisfaction) 4. Key word frequency (example: products with most issues)
  • 15. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Use as Machine Learning Features • Examples combining NLP data with data from structured databases 1. Clustering (example: types of churn from client feedback and account data) 2. Predictive modeling (example: patient outcomes from case notes and medical records)
  • 16. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Psychometric Applications • Some published papers: 1. Personality trait identification in industrial psychology research 2. Author identification in plagiarism software 3. Quantification of release risk in justice systems 4. Quantification of relapse risk in mental health applications
  • 18. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Other Common NLP Applications • Chatbots • Personal assistants • Translation services • Sentence completion
  • 19. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. In General
  • 21. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Main NLP Software Options • NLTK (Python) • spaCy (Python) • Stanford CoreNLP (Java) • John Snow Labs/Spark NLP (Spark)
  • 22. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Some NLP Literature • Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala, N., Markert, M., Sagreiya, H., ... & Ré, C. (2020). Cross-modal data programming enables rapid medical machine learning. Patterns, 100019. • Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150). • Pennebaker, J. W. (2011). The secret life of pronouns. New Scientist, 211(2828), 42-45. • Polsley, S., Jhunjhunwala, P., & Huang, R. (2016, December). Casesummarizer: a system for automated summarization of legal texts. In Proceedings of COLING 2016, the 26th international conference on Computational Linguistics: System Demonstrations (pp. 258-262). • Velupillai, S., Suominen, H., Liakata, M., Roberts, A., Shah, A. D., Morley, K., ... & Chapman, W. (2018). Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances. Journal of biomedical informatics, 88, 11-19.