SlideShare a Scribd company logo
Introduction to Natural
Language Processing
Pham Quang Nhat Minh
FPT Technology Research Institute (FTRI)
minhpqn@fpt.edu.vn
2
IBM Watson won Jeopardy Game
Lecture Outline
● What is Natural Language Processing?
● Why is NLP hard?
● Brief history of NLP
● Fundamental tasks in NLP
● Some applications of NLP
3
What is Natural Language
Processing?
● A field of computer science, artificial
intelligence, and computational linguistics
● To get computers to perform useful tasks
involving human languages
− Human-Machine communication
− Improving human-human communication
● E.g Machine Translation
− Extracting information from texts
4
Why is NLP interesting?
● Languages involve many human activities
− Reading, writing, speaking, listening
● Voice can be used as an user interface in many
applications
− Remote controls, virtual assistants like siri,...
● NLP is used to acquire insights from massive
amount of textual data
− E.g., hypotheses from medical, health reports
● NLP has many applications
● NLP is hard!
5
Why is NLP hard?
● Highly ambiguous
● Sentence I made her duck may have different
meanings (from Jurafsky book)
− I cooked waterfowl for her.
− I cooked waterfowl belong to her.
− I created the (plaster?) duck she owns.
− I caused her to quickly lower her head or body.
− I waved my magic wand and turned her into
undifferentiated waterfowl.
6
Why is NLP hard?
I shot an elephant in my pajamas.
7
Why is NLP hard?
● Natural languages are highly ambiguous at all levels
− Lexical (word’s meaning)
− Syntactic
− Semantic
− Discourse
● Natural languages are fuzzy
● Natural languages involve reasoning about the world
− E.g., It is unlikely that an elephant wears a pajamas
8
Brief history of NLP
● Foundational Insights: 1940s and 1950s
− Two foundational paradigms
●
Automaton
● Probabilistic/Information-Theoretic models
● The two camps: 1957-1970
− Symbolic paradigm: the work of Chomsky and others on
formal language theory and generative syntax (1950s ~
mid 1960s)
− Stochastic paradigm
● In departments of statistics
9
Brief history of NLP
● Four paradigms: 1970-1983, explosion in
research in speech and language processing
− Stochastic paradigm
− Logic-based paradigm
− Natural language understanding
− Discourse modeling paradigm
● Empiricism and Finite State Models Redux:
1983-1993
10
Brief history of NLP
● The Fields Comes Together: 1994-1999
− Probabilistic and data-driven models had become
quite standard
● The Rise of Machine Learning: 2000-now
− Large amount of spoken and textual data become
available
− Widespread availability of high-performance
computing systems
11
Fundamental Tasks in NLP
● Word Segmentation
● Part-of-speech (POS) tagging
● Syntactic Analysis
● Semantic Analysis
12
Word Segmentation
● In some languages, there is no space between
words, or a word may contain smaller syllables
− 毎年うちの研究室の学生が1-2名国語研でアルバイトさせ
てもらっているので、今日は新しくアルバイトする B4 学
生の紹介である。
− Nhật Bản luôn là thị trường thương mại quan trọng của
Việt Nam (Nhật_Bản luôn là thị_ trường thương_mại
quan_trọng của Việt_Nam)
● In such languages, word segmentation is the first
step of NLP systems.
13
Word Segmentation
● A possible solution is maximum matching
− Start by pointing at the beginning of a string, then choose the
longest word in the the dictionary that matches the input at the
current position.
− Nhật_Bản luôn là thị trường thương mại quan trọng của Việt
Nam
● Nhật_Bản is a word in dictionary, but “Nhật Bản luôn” is not
● Problems:
− Maxmatching could not deal with unknown words
− Dependency between words in the same sentences is not
exploited
14
Word Segmentation
● Most successful word segmentation tools are
based on machine-learning techniques.
● Word segmentation tools obtained high
accuracy
− vn.vitk (https://github.com/phuonglh/vn.vitk)
obtained 97% accuracy on test data
15
POS Tagging
● Each word in a sentence can be classified in to
classes, such as verbs, adjectives, nouns, etc
● POS Tagging is a process of tagging words in a
sentences to particular part-of-speech, based on:
− Its definition
− Its context in the sentence
● The/DT grand/JJ jury/NN commented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.
16
Sequence Labeling
● Many NLP problems can be viewed as
sequence labeling
● Each token in a sequence is assigned a label.
● Labels of tokens are dependent on the labels
of other tokens in the sequence, particularly
their neighbors.
17
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
Sequence Labeling as Classification
● Classify each token independently
● Use as features, information about the
surrounding tokens (sliding window).
18
John saw the saw and decided to take it to the table.
classifier
NNP
Probabilistic Sequence Models
• Model probabilities of pairs (token sequences,
tag sequences) from annotated data set.
• Exploit dependency between tokens
• Typical sequence models
• Hidden Markov Models (HMMs)
• Conditional Random Fields (CRF)
19
Fundamental Tasks in NLP
● Word Segmentation
● Part-of-speech (POS) tagging
● Syntactic Analysis
● Semantic Analysis
20
Syntax Analysis
● The task of recognizing a sentence and assigning a
syntactic structure to it.
My dog also likes eating sausage.
(ROOT
(S
(NP (PRP$ My) (NN dog))
(ADVP (RB also))
(VP (VBZ likes)
(S
(VP (VBG eating)
(NP (NN sausage)))))
(. .)))
S
NP
PRP NN
ADVP
RB
My dog also
VP
VBZ S
VP
VBG
eating
NP
NN sausage
.
.
21
Syntax analysis
● An important task in NLP with many
applications
− Intermediate stage of representation for semantic
analysis
− Play an important role in applications like question
answering and information extraction
− E.g., What books were written by British women
authors before 1800?
22
Syntax analysis
● A challenging task in NLP
− Ambiguity problem: one sentence may have many
possible parsing trees
● Vietnamese language processing (VNLP) still
lacks accurate syntax parsers (in my
understanding)
− Accuracy about 78 ~ 84%
23
Approaches to Syntax analysis
● Top-down parsing
● Bottom-up parsing
● Dynamic programming methods
− CYK algorithm
− Earley algorithm
− Chart parsing
● Probabilistic Context-Free Grammars (PCFG)
● Assign probabilities for derivations
24
Fundamental Tasks in NLP
● Word Segmentation
● Part-of-speech (POS) tagging
● Syntactic Analysis
● Semantic Analysis
25
Semantic Analysis
● Two levels
● Lexical semantics
− Representing meaning of words
− Word sense disambiguation (e.g., word bank)
• Compositional semantics
− How words combined to form a larger meaning.
26
Meaning representations
• First order predicate calculus
•E.g., Maharani serves vegetarian food.
=> Serves(Maharani, vegetarian food)
•E.g., I only have five dollars and I don’t have a lot of
time
=> Have(Speaker, FiveDollars) ∧ ¬Have(Speaker,
LotOfTime)
27
28
Syntax-driven semantic analysis
Lecture Outline
● What is Natural Language Processing?
● Why is NLP hard?
● Brief history of NLP
● Fundamental tasks in NLP
● Some NLP applications
29
Some applications
● Information Retrieval
● Information Extraction
● Question Answering
● Text Summarization
● Machine Translation
30
Information Retrieval
● Query: “list of good sushi restaurants in kyoto?”
31
Query Query processing
Search (Vector
space model or
probabilistic)
Ranked
documents
Indexing
Document
collection
32
Architecture of an ad hoc IR system
Information Extraction
● To extract from unstructured text, information
which pre-specified or pre-defined in templates
− Fill a number of slots/attributes
● Example: use template [PERSON, go,
LOCATION, TIME] to extract information about
the destination of an individual goes.
− “President Obama went to Hanoi yesterday.
− [PERSON = “President Obama”, go, LOCATION =
“Hanoi”, TIME = “yesterday”]
33
Question Answering
● A system that automatically return answers for an
user’s question by retrieving information from a
collected documents.
● Differences from information retrieval system:
− QA system’s goal is to respond exact answer instead of
documents related to users’ question.
● Q: who did invent the internet? A: Robert E. Kahn and Vint
Cerf.
− QA system requires more complicated semantic
analysis.
34
Question Answering
● Factoid question answering:
− Who/What/Where/When
− Answers are often phrases.
● Non-factoid question answering:
− Definition question
− How/Why
− Answers may span multiple sentences (paragraph)
35
The figure is credited by Dr Ngo Xuan Bach: http://tinyurl.com/jk2dv33 36
Text Summarization
● Text summarization is process of distilling the
most important information from a text to produce
an abridge version for a particular task or user.
● Useful in the era of information explosion
● Categories of text summarization:
− Single-document/Multi-document summarization
− Extractive/Abstractive summarization
− Query-focused text summarization
37
Example of text summarisation
• https://www.bloomberg.com/view/articles/
2016-08-23/china-s-super-bus-exposes-dark-
side-of-p2p-lending
• It looked like the future: a wide, elevated Chinese bus that would speed
atop tracks straddling the road while multiple lanes of traffic flowed below.
And the future looked surprisingly near. In early August, a prototype of the
Transit Elevated Bus -- or TEB -- was tested in northern China.
• Demand for such loans has exploded in recent years, growing in volume
from $4.3 billion in 2013 to $71 billion in 2015. The appeal is twofold.
First, China's big state-owned banks have traditionally focused their
attention on other companies in the state sector, at the expense of
consumers and small businesses.
• Meanwhile, cash-rich Chinese are anxious to find yields higher than the
anemic rates paid by China's state banks, which typically fall below 3
percent. China's dodgy stock markets aren't a terribly appealing alternative,
while the attractiveness of Chinese real estate varies by region.
Output by Skype’s Summarization chatbot
38
Machine Translation
● The use computer to automatic some or all of
the process of translating one language to the
other one.
● Fully automatic machine translation is one of
the most challenging and hot topic in NLP.
● Recent advances of Deep Learning raise the
trend of Neural Machine Translation.
39
Example (Google translation)
It looked like the future: a wide, elevated Chinese
bus that would speed atop tracks straddling the
road while multiple lanes of traffic flowed below.
And the future looked surprisingly near.
Nó trông giống như tương lai: một rộng, xe buýt cao
Trung Quốc sẽ tăng tốc trên đường ray trải dài
đường trong khi nhiều tuyến đường giao thông chảy
bên dưới. Và tương lai có vẻ ngạc nhiên gần.
40
Approaches in Machine Translation
• Rule-based methods
• Transfer-based MT
• Interlingual MT
• Dictionary-based MT
• Statistical MT
• Example-based MT
• Hybrid MT
Bernard Vauquois' pyramid showing comparative depths of
intermediary representation, interlingual machine translation at the
peak, followed by transfer-based, then direct translation.
42
How to learn NLP?
• Have background/knowledge about:
• Probabilistic and Statistics
• Basic math (linear algebra, calculus)
• Machine Learning
• Programming
• Read textbook or attend online NLP courses:
• Speech and Language Processing, by Jurafsky, Daniel and
Martin, James H.
• Youtube’s playlist (Dan Jurafsky & Chris Manning: Natural
Language Processing): http://tinyurl.com/lb57fxf
How to learn NLP?
• Practice with programming exercises:
• 100 NLP drill exercises: https://github.com/minhpqn/nlp_100_drill_exercises
• NLP Programming Tutorial, by Graham Neubig: http://www.phontron.com/
teaching.php
• Compete in Kaggle data science challenges
(kaggle.com)
Try some NLP applications
• Try Stanford CoreNLP and Stanford Parser
demo
• http://nlp.stanford.edu:8080/corenlp
• http://nlp.stanford.edu:8080/parser
• Solve SAT-style math questions
• http://euclid.allenai.org
References
1. Speech and Language Processing, by Jurafsky,
Daniel and Martin, James H.
2. An Introduction to Natural Language
Processing series (http://tinyurl.com/hdg58wx)
References
• An Introduction to Natural Language Processing - Section
1 (http://tinyurl.com/ztkwb2b)
• An Introduction to Natural Language Processing - Section
2: Some Brief History (http://tinyurl.com/j48or27)
• An Introduction to Natural Language Processing - Section
3: Fundamental Tasks in NLP (http://tinyurl.com/zk7dgzv)
• An Introduction to Natural Language Processing - Section
4: Some Applications (http://tinyurl.com/jk2dv33)
• An Introduction to Natural Language Processing (http://
tinyurl.com/hdg58wx)
47

More Related Content

PPTX
Natural language processing
PDF
Syntactic analysis in NLP
PPTX
Recent trends in natural language processing
PPTX
Chapter wise ppt(MBA)
PDF
Natural Language Processing (NLP)
PPTX
Public Debt Philippines
PDF
Why rust?
PPTX
Deep learning
Natural language processing
Syntactic analysis in NLP
Recent trends in natural language processing
Chapter wise ppt(MBA)
Natural Language Processing (NLP)
Public Debt Philippines
Why rust?
Deep learning

What's hot (20)

PDF
Natural Language Processing
PDF
Natural language processing (NLP) introduction
PPTX
Natural Language Processing
PPTX
Natural language processing
PDF
Natural language processing
PPTX
Natural Language Processing (NLP) - Introduction
PPTX
Natural lanaguage processing
PPT
Introduction to Natural Language Processing
PPTX
Natural Language Processing in AI
PPT
Introduction to Natural Language Processing
PPTX
Natural Language Processing
PPT
Natural Language Processing
PPT
Natural language processing
PPTX
Natural language processing (NLP)
PPTX
Natural language processing
PDF
Introduction to Natural Language Processing (NLP)
PPTX
Natural Language Processing
PPTX
Natural language processing
PPTX
Natural Language Processing
Natural Language Processing
Natural language processing (NLP) introduction
Natural Language Processing
Natural language processing
Natural language processing
Natural Language Processing (NLP) - Introduction
Natural lanaguage processing
Introduction to Natural Language Processing
Natural Language Processing in AI
Introduction to Natural Language Processing
Natural Language Processing
Natural Language Processing
Natural language processing
Natural language processing (NLP)
Natural language processing
Introduction to Natural Language Processing (NLP)
Natural Language Processing
Natural language processing
Natural Language Processing
Ad

Viewers also liked (20)

PDF
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
PDF
Introduction to Natural Language Processing
PDF
Smart Data Webinar: Advances in Natural Language Processing
PDF
Practical Natural Language Processing
PDF
How to Become a Thought Leader in Your Niche
PPTX
Thực tập GameLoft SAI1
PPTX
Formal Grammars of English
PPTX
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
PDF
DIY Chinese Segmentation
PDF
World of Watson 2016 - Artificial Intelligence Research
PDF
Laravel 5 framework
PPTX
Introduction to Natural Language Processing
PDF
Analytics2017
PPT
Machine Learning for NLP
PPTX
Game Design Patterns Workshop - FDG2012 - Opening Remarks
PPT
Natural Language Processing for Games Research
PDF
Un día hipotético
PDF
NLP e Chatbots
PPTX
Weave-D - 2nd Progress Evaluation Presentation
PDF
Global Messaging Trends 2 - When are chatbots actually useful?
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Introduction to Natural Language Processing
Smart Data Webinar: Advances in Natural Language Processing
Practical Natural Language Processing
How to Become a Thought Leader in Your Niche
Thực tập GameLoft SAI1
Formal Grammars of English
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
DIY Chinese Segmentation
World of Watson 2016 - Artificial Intelligence Research
Laravel 5 framework
Introduction to Natural Language Processing
Analytics2017
Machine Learning for NLP
Game Design Patterns Workshop - FDG2012 - Opening Remarks
Natural Language Processing for Games Research
Un día hipotético
NLP e Chatbots
Weave-D - 2nd Progress Evaluation Presentation
Global Messaging Trends 2 - When are chatbots actually useful?
Ad

Similar to Introduction to natural language processing (20)

PPTX
Presentacion_Procesamiento_Lenguaje.pptx
PPTX
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
PDF
Introduction to Natural Language Processing
PPTX
Natural Language Processing (NLP)
PPTX
Natural Language Processing (NLP).pptx
PDF
Pptphrase tagset mapping for french and english treebanks and its application...
PPTX
Introduction to natural language processing, history and origin
PDF
Yves Peirsman - Deep Learning for NLP
PPTX
naturallanguageprocessingnlp-231215172843-839c05ab.pptx
PPTX
Artificial Intelligence Notes Unit 4
PDF
Natural language processing module 1 chapter 1
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
PDF
Natural language processing for requirements engineering: ICSE 2021 Technical...
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PDF
Natural Language Processing
PPTX
Classroom Presentation_NLPMALLAIAH_CSE_PHD.pptx
PDF
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
PPTX
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
PDF
Beyond the Symbols: A 30-minute Overview of NLP
Presentacion_Procesamiento_Lenguaje.pptx
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Introduction to Natural Language Processing
Natural Language Processing (NLP)
Natural Language Processing (NLP).pptx
Pptphrase tagset mapping for french and english treebanks and its application...
Introduction to natural language processing, history and origin
Yves Peirsman - Deep Learning for NLP
naturallanguageprocessingnlp-231215172843-839c05ab.pptx
Artificial Intelligence Notes Unit 4
Natural language processing module 1 chapter 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing
Classroom Presentation_NLPMALLAIAH_CSE_PHD.pptx
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
Beyond the Symbols: A 30-minute Overview of NLP

More from Minh Pham (13)

PDF
Học tập suốt đời – Chìa khóa để thích ứng với sự bất định
PDF
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
PDF
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
PDF
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
PDF
Research methods for engineering students (v.2020)
PDF
Giới thiệu về AIML
PDF
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Deep Contexualized Representation
PDF
Research Methods in Natural Language Processing (2018 version)
PDF
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
PDF
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
PDF
Research Methods in Natural Language Processing
Học tập suốt đời – Chìa khóa để thích ứng với sự bất định
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
Research methods for engineering students (v.2020)
Giới thiệu về AIML
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Deep Contexualized Representation
Research Methods in Natural Language Processing (2018 version)
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Research Methods in Natural Language Processing

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
The scientific heritage No 166 (166) (2025)
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
Chemical bonding and molecular structure
PPTX
Microbiology with diagram medical studies .pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
INTRODUCTION TO EVS | Concept of sustainability
. Radiology Case Scenariosssssssssssssss
microscope-Lecturecjchchchchcuvuvhc.pptx
The KM-GBF monitoring framework – status & key messages.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Classification Systems_TAXONOMY_SCIENCE8.pptx
The scientific heritage No 166 (166) (2025)
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
HPLC-PPT.docx high performance liquid chromatography
neck nodes and dissection types and lymph nodes levels
2. Earth - The Living Planet Module 2ELS
Chemical bonding and molecular structure
Microbiology with diagram medical studies .pptx

Introduction to natural language processing

  • 1. Introduction to Natural Language Processing Pham Quang Nhat Minh FPT Technology Research Institute (FTRI) minhpqn@fpt.edu.vn
  • 2. 2 IBM Watson won Jeopardy Game
  • 3. Lecture Outline ● What is Natural Language Processing? ● Why is NLP hard? ● Brief history of NLP ● Fundamental tasks in NLP ● Some applications of NLP 3
  • 4. What is Natural Language Processing? ● A field of computer science, artificial intelligence, and computational linguistics ● To get computers to perform useful tasks involving human languages − Human-Machine communication − Improving human-human communication ● E.g Machine Translation − Extracting information from texts 4
  • 5. Why is NLP interesting? ● Languages involve many human activities − Reading, writing, speaking, listening ● Voice can be used as an user interface in many applications − Remote controls, virtual assistants like siri,... ● NLP is used to acquire insights from massive amount of textual data − E.g., hypotheses from medical, health reports ● NLP has many applications ● NLP is hard! 5
  • 6. Why is NLP hard? ● Highly ambiguous ● Sentence I made her duck may have different meanings (from Jurafsky book) − I cooked waterfowl for her. − I cooked waterfowl belong to her. − I created the (plaster?) duck she owns. − I caused her to quickly lower her head or body. − I waved my magic wand and turned her into undifferentiated waterfowl. 6
  • 7. Why is NLP hard? I shot an elephant in my pajamas. 7
  • 8. Why is NLP hard? ● Natural languages are highly ambiguous at all levels − Lexical (word’s meaning) − Syntactic − Semantic − Discourse ● Natural languages are fuzzy ● Natural languages involve reasoning about the world − E.g., It is unlikely that an elephant wears a pajamas 8
  • 9. Brief history of NLP ● Foundational Insights: 1940s and 1950s − Two foundational paradigms ● Automaton ● Probabilistic/Information-Theoretic models ● The two camps: 1957-1970 − Symbolic paradigm: the work of Chomsky and others on formal language theory and generative syntax (1950s ~ mid 1960s) − Stochastic paradigm ● In departments of statistics 9
  • 10. Brief history of NLP ● Four paradigms: 1970-1983, explosion in research in speech and language processing − Stochastic paradigm − Logic-based paradigm − Natural language understanding − Discourse modeling paradigm ● Empiricism and Finite State Models Redux: 1983-1993 10
  • 11. Brief history of NLP ● The Fields Comes Together: 1994-1999 − Probabilistic and data-driven models had become quite standard ● The Rise of Machine Learning: 2000-now − Large amount of spoken and textual data become available − Widespread availability of high-performance computing systems 11
  • 12. Fundamental Tasks in NLP ● Word Segmentation ● Part-of-speech (POS) tagging ● Syntactic Analysis ● Semantic Analysis 12
  • 13. Word Segmentation ● In some languages, there is no space between words, or a word may contain smaller syllables − 毎年うちの研究室の学生が1-2名国語研でアルバイトさせ てもらっているので、今日は新しくアルバイトする B4 学 生の紹介である。 − Nhật Bản luôn là thị trường thương mại quan trọng của Việt Nam (Nhật_Bản luôn là thị_ trường thương_mại quan_trọng của Việt_Nam) ● In such languages, word segmentation is the first step of NLP systems. 13
  • 14. Word Segmentation ● A possible solution is maximum matching − Start by pointing at the beginning of a string, then choose the longest word in the the dictionary that matches the input at the current position. − Nhật_Bản luôn là thị trường thương mại quan trọng của Việt Nam ● Nhật_Bản is a word in dictionary, but “Nhật Bản luôn” is not ● Problems: − Maxmatching could not deal with unknown words − Dependency between words in the same sentences is not exploited 14
  • 15. Word Segmentation ● Most successful word segmentation tools are based on machine-learning techniques. ● Word segmentation tools obtained high accuracy − vn.vitk (https://github.com/phuonglh/vn.vitk) obtained 97% accuracy on test data 15
  • 16. POS Tagging ● Each word in a sentence can be classified in to classes, such as verbs, adjectives, nouns, etc ● POS Tagging is a process of tagging words in a sentences to particular part-of-speech, based on: − Its definition − Its context in the sentence ● The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. 16
  • 17. Sequence Labeling ● Many NLP problems can be viewed as sequence labeling ● Each token in a sequence is assigned a label. ● Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors. 17 John saw the saw and decided to take it to the table. NNP VBD DT NN CC VBD TO VB PRP IN DT NN
  • 18. Sequence Labeling as Classification ● Classify each token independently ● Use as features, information about the surrounding tokens (sliding window). 18 John saw the saw and decided to take it to the table. classifier NNP
  • 19. Probabilistic Sequence Models • Model probabilities of pairs (token sequences, tag sequences) from annotated data set. • Exploit dependency between tokens • Typical sequence models • Hidden Markov Models (HMMs) • Conditional Random Fields (CRF) 19
  • 20. Fundamental Tasks in NLP ● Word Segmentation ● Part-of-speech (POS) tagging ● Syntactic Analysis ● Semantic Analysis 20
  • 21. Syntax Analysis ● The task of recognizing a sentence and assigning a syntactic structure to it. My dog also likes eating sausage. (ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (. .))) S NP PRP NN ADVP RB My dog also VP VBZ S VP VBG eating NP NN sausage . . 21
  • 22. Syntax analysis ● An important task in NLP with many applications − Intermediate stage of representation for semantic analysis − Play an important role in applications like question answering and information extraction − E.g., What books were written by British women authors before 1800? 22
  • 23. Syntax analysis ● A challenging task in NLP − Ambiguity problem: one sentence may have many possible parsing trees ● Vietnamese language processing (VNLP) still lacks accurate syntax parsers (in my understanding) − Accuracy about 78 ~ 84% 23
  • 24. Approaches to Syntax analysis ● Top-down parsing ● Bottom-up parsing ● Dynamic programming methods − CYK algorithm − Earley algorithm − Chart parsing ● Probabilistic Context-Free Grammars (PCFG) ● Assign probabilities for derivations 24
  • 25. Fundamental Tasks in NLP ● Word Segmentation ● Part-of-speech (POS) tagging ● Syntactic Analysis ● Semantic Analysis 25
  • 26. Semantic Analysis ● Two levels ● Lexical semantics − Representing meaning of words − Word sense disambiguation (e.g., word bank) • Compositional semantics − How words combined to form a larger meaning. 26
  • 27. Meaning representations • First order predicate calculus •E.g., Maharani serves vegetarian food. => Serves(Maharani, vegetarian food) •E.g., I only have five dollars and I don’t have a lot of time => Have(Speaker, FiveDollars) ∧ ¬Have(Speaker, LotOfTime) 27
  • 29. Lecture Outline ● What is Natural Language Processing? ● Why is NLP hard? ● Brief history of NLP ● Fundamental tasks in NLP ● Some NLP applications 29
  • 30. Some applications ● Information Retrieval ● Information Extraction ● Question Answering ● Text Summarization ● Machine Translation 30
  • 31. Information Retrieval ● Query: “list of good sushi restaurants in kyoto?” 31
  • 32. Query Query processing Search (Vector space model or probabilistic) Ranked documents Indexing Document collection 32 Architecture of an ad hoc IR system
  • 33. Information Extraction ● To extract from unstructured text, information which pre-specified or pre-defined in templates − Fill a number of slots/attributes ● Example: use template [PERSON, go, LOCATION, TIME] to extract information about the destination of an individual goes. − “President Obama went to Hanoi yesterday. − [PERSON = “President Obama”, go, LOCATION = “Hanoi”, TIME = “yesterday”] 33
  • 34. Question Answering ● A system that automatically return answers for an user’s question by retrieving information from a collected documents. ● Differences from information retrieval system: − QA system’s goal is to respond exact answer instead of documents related to users’ question. ● Q: who did invent the internet? A: Robert E. Kahn and Vint Cerf. − QA system requires more complicated semantic analysis. 34
  • 35. Question Answering ● Factoid question answering: − Who/What/Where/When − Answers are often phrases. ● Non-factoid question answering: − Definition question − How/Why − Answers may span multiple sentences (paragraph) 35
  • 36. The figure is credited by Dr Ngo Xuan Bach: http://tinyurl.com/jk2dv33 36
  • 37. Text Summarization ● Text summarization is process of distilling the most important information from a text to produce an abridge version for a particular task or user. ● Useful in the era of information explosion ● Categories of text summarization: − Single-document/Multi-document summarization − Extractive/Abstractive summarization − Query-focused text summarization 37
  • 38. Example of text summarisation • https://www.bloomberg.com/view/articles/ 2016-08-23/china-s-super-bus-exposes-dark- side-of-p2p-lending • It looked like the future: a wide, elevated Chinese bus that would speed atop tracks straddling the road while multiple lanes of traffic flowed below. And the future looked surprisingly near. In early August, a prototype of the Transit Elevated Bus -- or TEB -- was tested in northern China. • Demand for such loans has exploded in recent years, growing in volume from $4.3 billion in 2013 to $71 billion in 2015. The appeal is twofold. First, China's big state-owned banks have traditionally focused their attention on other companies in the state sector, at the expense of consumers and small businesses. • Meanwhile, cash-rich Chinese are anxious to find yields higher than the anemic rates paid by China's state banks, which typically fall below 3 percent. China's dodgy stock markets aren't a terribly appealing alternative, while the attractiveness of Chinese real estate varies by region. Output by Skype’s Summarization chatbot 38
  • 39. Machine Translation ● The use computer to automatic some or all of the process of translating one language to the other one. ● Fully automatic machine translation is one of the most challenging and hot topic in NLP. ● Recent advances of Deep Learning raise the trend of Neural Machine Translation. 39
  • 40. Example (Google translation) It looked like the future: a wide, elevated Chinese bus that would speed atop tracks straddling the road while multiple lanes of traffic flowed below. And the future looked surprisingly near. Nó trông giống như tương lai: một rộng, xe buýt cao Trung Quốc sẽ tăng tốc trên đường ray trải dài đường trong khi nhiều tuyến đường giao thông chảy bên dưới. Và tương lai có vẻ ngạc nhiên gần. 40
  • 41. Approaches in Machine Translation • Rule-based methods • Transfer-based MT • Interlingual MT • Dictionary-based MT • Statistical MT • Example-based MT • Hybrid MT
  • 42. Bernard Vauquois' pyramid showing comparative depths of intermediary representation, interlingual machine translation at the peak, followed by transfer-based, then direct translation. 42
  • 43. How to learn NLP? • Have background/knowledge about: • Probabilistic and Statistics • Basic math (linear algebra, calculus) • Machine Learning • Programming • Read textbook or attend online NLP courses: • Speech and Language Processing, by Jurafsky, Daniel and Martin, James H. • Youtube’s playlist (Dan Jurafsky & Chris Manning: Natural Language Processing): http://tinyurl.com/lb57fxf
  • 44. How to learn NLP? • Practice with programming exercises: • 100 NLP drill exercises: https://github.com/minhpqn/nlp_100_drill_exercises • NLP Programming Tutorial, by Graham Neubig: http://www.phontron.com/ teaching.php • Compete in Kaggle data science challenges (kaggle.com)
  • 45. Try some NLP applications • Try Stanford CoreNLP and Stanford Parser demo • http://nlp.stanford.edu:8080/corenlp • http://nlp.stanford.edu:8080/parser • Solve SAT-style math questions • http://euclid.allenai.org
  • 46. References 1. Speech and Language Processing, by Jurafsky, Daniel and Martin, James H. 2. An Introduction to Natural Language Processing series (http://tinyurl.com/hdg58wx)
  • 47. References • An Introduction to Natural Language Processing - Section 1 (http://tinyurl.com/ztkwb2b) • An Introduction to Natural Language Processing - Section 2: Some Brief History (http://tinyurl.com/j48or27) • An Introduction to Natural Language Processing - Section 3: Fundamental Tasks in NLP (http://tinyurl.com/zk7dgzv) • An Introduction to Natural Language Processing - Section 4: Some Applications (http://tinyurl.com/jk2dv33) • An Introduction to Natural Language Processing (http:// tinyurl.com/hdg58wx) 47