DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan

RUDOLF EREMYAN
MACHINE LEARNING SOFTWARE ENGINEER
INTRODUCTION TO NATURAL LANGUAGE
PROCESSING
CONTACTS: EREMYAN.RUDOLF@GMAIL.COM HTTPS://WWW.LINKEDIN.COM/IN/RUDOLFEREMYAN/

CHATBOT FRAMEWORK FOR GEORGIAN
LANGUAGE
TI BOT FOR TBC
BANK
• 35K LIKES
• 100K CONVERSATIONS
• 8K ACTIVE USERS PER MONTH
• 41,5K USERS ASKES ABOUT
WEATHER
• 1K P2P TRANSACTIONS IN
AUGUST

SENTIMENT ANALYSIS ON FACEBOOK
COMMENTS

NATURAL LANGUAGE PROCESSING
https://en.wikipedia.org/wiki/Natural_language_processing
NATURAL LANGUAGE PROCESSING (NLP) IS A FIELD
OF COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE AND
COMPUTATIONAL LINGUISTICS CONCERNED WITH THE
INTERACTIONS BETWEEN COMPUTERS AND HUMAN
(NATURAL) LANGUAGES, AND, IN PARTICULAR,
CONCERNED WITH PROGRAMMING COMPUTERS TO
FRUITFULLY PROCESS LARGE NATURAL LANGUAGE
CORPORA.

THE HISTORY OF NLP
1950 - ALAN TURING PUBLISHED
AN ARTICLE TITLED "COMPUTING
MACHINERY AND
INTELLIGENCE" WHICH
PROPOSED WHAT IS NOW
CALLED THE TURING TEST AS A
CRITERION OF INTELLIGENCE.

THE HISTORY OF NLP
1954 - THE GEORGETOWN
EXPERIMENT INVOLVED FULLY
AUTOMATIC TRANSLATION OF
MORE THAN SIXTY RUSSIAN
SENTENCES INTO ENGLISH. THE
AUTHORS CLAIMED THAT WITHIN
THREE OR FIVE YEARS, MACHINE
TRANSLATION WOULD BE A SOLVED
PROBLEM.

THE HISTORY OF NLP
1970 - MANY PROGRAMMERS BEGAN TO WRITE "CONCEPTUAL ONTOLOGIES", WHICH STRUCTURED REAL-
WORLD INFORMATION INTO COMPUTER-UNDERSTANDABLE DATA. EXAMPLES ARE QUALM (LEHNERT, 1977),
POLITICS (CARBONELL, 1979), AND PLOT UNITS (LEHNERT 1981). DURING THIS TIME, MANY CHATTERBOTS
WERE WRITTEN INCLUDING PARRY, RACTER.
• WORDNET
• EUROWORDNET
• SENTIWORDNET

THE HISTORY OF NLP
1980 - THERE WAS A REVOLUTION IN NLP WITH
THE INTRODUCTION OF MACHINE LEARNING
ALGORITHMS FOR LANGUAGE PROCESSING. PART-
OF-SPEECH TAGGING INTRODUCED THE USE OF
HIDDEN MARKOV MODELS TO NLP, AND
INCREASINGLY, RESEARCH HAS FOCUSED ON
STATISTICAL MODELS, WHICH MAKE SOFT,
PROBABILISTIC DECISIONS BASED ON ATTACHING
REAL-VALUED WEIGHTS TO THE FEATURES MAKING
UP THE INPUT DATA.

THE HISTORY OF NLP
IN RECENT YEARS, THERE HAS BEEN A FLURRY OF RESULTS SHOWING DEEP
LEARNING TECHNIQUES ACHIEVING STATE-OF-THE-ART RESULTS IN MANY
NATURAL LANGUAGE TASKS, FOR EXAMPLE IN LANGUAGE MODELING,
PARSING AND MANY OTHERS.

HAVE YOU EVER USED ANY NLP PRODUCTS?

NLP APPLICATIONS
TEXT CLASSIFICATION
TEXT CLUSTERING
TEXT SUMMARISATION
MACHINE TRANSLATION
SEMANTIC SEARCH
SENTIMENT ANALYSIS
QUESTION ANSWERING
INFORMATION EXTRACTION

NLP. TEXT CLASSIFICATION
Document classification or
document categorization is a
problem in library science,
information science and computer
science. The task is to assign a
document to one or more classes or
categories. This may be done
"manually" or algorithmically.
Popular algorithms:
1. Multinomial Naive Bayes
2. SVM
3. Neural Networks

NLP. TEXT CLUSTERING
Document clustering (or text
clustering) is the application of
cluster analysis to textual
documents. It has applications in
automatic document organization,
topic extraction and fast information
retrieval or filtering.
Popular algorithms:
1. k-Means
2. DBSCAN
3. Deep Learning

NLP. TEXT SUMMARISATION
Automatic summarization is the
process of shortening a text
document with software, in order
to create a summary with the
major points of the original
document. Technologies that
can make a coherent summary
take into account variables such
as length, writing style and
syntax.
Popular algorithms:
1. LDA
2. Deep Learning

NLP. MACHINE TRANSLATION
MT performs simple substitution of words in
one language for words in another, but that
alone usually cannot produce a good
translation of a text because recognition of
whole phrases and their closest counterparts
in the target language is needed. Solving this
problem with corpus statistical, and neural
techniques is a rapidly growing field that is
leading to better translations, handling
differences in linguistic typology, translation of
idioms, and the isolation of anomalies
Algorithms:
1. Rule based
2. Statistical methods
3. Encoder-Decoder

NLP. SEMANTIC SEARCH
Semantic search seeks to
improve search accuracy by
understanding searcher intent
and the contextual meaning of
terms as they appear in the
searchable dataspace, whether
on the Web or within a closed
system, to generate more
relevant results.
Approaches:
1. Entity Recognition
2. User context

NLP. SENTIMENT ANALYSIS
Sentiment Analysis is the
process of determining whether a
piece of writing is positive,
negative or neutral. It's also
known as opinion mining,
deriving the opinion or attitude of
a speaker.
Algorithms:
1. Lexicon-based
2. Machine Learning (SVM)
3. Deep Learning (RNN, LSTM)

NLP. QUESTION ANSWERING
Question answering (QA) is a
computer science discipline within
the fields of information retrieval and
natural language processing (NLP),
which is concerned with building
systems that automatically answer
questions posed by humans in a
natural language.
Algorithms:
1. Rule based
2. Machine Learning
3. Deep Learning

NLP. INFORMATION EXTRACTION
Information extraction is the task of automatically
extracting structured information from unstructured
and/or semi-structured machine-readable documents.

NLP TOOLS
1. MORPHOLOGICAL ANALYZER
2. POS TAGGER
3. STEMMER
4. PARSERS
5. NAMED ENTITY RECOGNIZER

NLP. STEMMER
Stemmers remove morphological affixes from words, leaving only the word stem.
bananas -> banana
flies -> fli
cats -> cat
dogs -> dog
How about “flies” -> fly?

NLP. MORPHOLOGICAL ANALYZER
Lemmatization usually refers to doing things properly
with the use of a vocabulary and morphological
analysis of words, normally aiming to remove
inflectional endings only and to return the base or
dictionary form of a word, which is known as the
lemma .
flies -> fly
went -> go
am, are, is -> be

NLP. POS TAGGER
A Part-Of-Speech Tagger (POS Tagger) is a piece of
software that reads text in some language and
assigns parts of speech to each word (and other
token), such as noun, verb, adjective, etc., although
generally computational applications use more fine-
grained POS tags like 'noun-plural'.

NLP. PARSER
A natural language parser is a program that works out the grammatical structure of
sentences, for instance, which groups of words go together (as "phrases") and which
words are the subject or object of a verb.
Dependency tree Constituency tree

NLP. NAMED ENTITY RECOGNIZER
Named-entity recognition (NER) (also known as entity identiﬁcation, entity chunking and
entity extraction) is a subtask of information extraction that seeks to locate and classify
named entities in text into pre-deﬁned categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

PROJECT. THEORETICAL PART
THERE IS A DATASET OF LABELED TEXTS,
OUT TASK TO CREATE MACHINE LEARNING
PIPELINE, FOR TEXT CLASSIFICATION,
TRAINED ON GIVEN DATA

PROJECT. ML PIPELINE
feature extraction
text preprocessing
training classifier
evaluation

PROJECT. TEXT PREPROCESSING
• Removing non-text (e.g., ads, javascript)
• Dealing with text encoding (e.g., Unicode)
• Normalization
–extra-terrestrial/extraterrestrial, extra terrestrial
• Stemming
–computer/computation
• Morphological analysis
– car/cars
• Capitalization
– Now/NOW, led/LED
• Named entity extraction
– USA/usa
• Tokenization

PROJECT. FEATURE EXTRACTION
1. TF-IDF SCHEME
2. WORD EMBEDDING(WORD2VEC)

PROJECT. FEATURE EXTRACTION.TF-IDF
“TF-IDF is a weighting scheme that assigns each term in a
document a weight based on its term frequency (tf) and inverse
document frequency (idf). The terms with higher weight scores
are considered to be more important. It’s one of the most popular
weighting schemes in Information Retrieval”

Term Frequency (TF)
“Term Frequency, which measures how frequently a term occurs in a document.
Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones. Thus, the term frequency is
often divided by the document length as a way of normalization”
TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document)

Inverse Document Frequency(IDF)
“IDF: Inverse Document Frequency, which measures how important a term is. While
computing TF, all terms are considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear a lot of times but have
little importance. Thus we need to weigh down the frequent terms while scale up
the rare ones, by computing the following:”
IDF(t) = log_e(Total number of documents / Number of documents with term t in
it)
Base 10 logarithms are just as good as these although the values are considerably smaller.

PROJECT. FEATURE EXTRACTION.WORD2VEC
WORD2VEC is used for learning vector
representations of words, called "word
embeddings".

PROJECT. FEATURE EXTRACTION.WORD2VEC

PROJECT. TRAINING CLASSIFIER
CLASSIFICATION ALGORITHMS
1. Support Vector Machines
2. k-Nearest Neighbors
3. Multinomial Naive Bayes

Support Vector Machines

k-Nearest Neighbors

Multinomial Naive Bayes

PROJECT. TEXT CLASSIFICATION EVALUATION
“If you cannot measure it, you can not improve it”
Lord Kelvin
Main metrics for Text Classification:
Precision and Recall
Precision and recall are the measures used in the information
retrieval domain to measure how well an information retrieval
system retrieves the relevant documents requested by a user.
The measures are deﬁned as follows:

Precision = Total number of documents retrieved that are
relevant/Total number of documents that are retrieved.

Recall = Total number of documents retrieved that are
relevant/Total number of relevant documents in the database.

PROJECT. TEXT CLASSIFICATION EVALUATION

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan

More Related Content

What's hot (20)

Similar to DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan (20)

Recently uploaded (20)

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan