SlideShare a Scribd company logo
RUDOLF EREMYAN
MACHINE LEARNING SOFTWARE ENGINEER
INTRODUCTION TO NATURAL LANGUAGE
PROCESSING
CONTACTS: EREMYAN.RUDOLF@GMAIL.COM HTTPS://WWW.LINKEDIN.COM/IN/RUDOLFEREMYAN/
CHATBOT FRAMEWORK FOR GEORGIAN
LANGUAGE
TI BOT FOR TBC
BANK
• 35K LIKES
• 100K CONVERSATIONS
• 8K ACTIVE USERS PER MONTH
• 41,5K USERS ASKES ABOUT
WEATHER
• 1K P2P TRANSACTIONS IN
AUGUST
SENTIMENT ANALYSIS ON FACEBOOK
COMMENTS
NATURAL LANGUAGE PROCESSING
https://en.wikipedia.org/wiki/Natural_language_processing
NATURAL LANGUAGE PROCESSING (NLP) IS A FIELD
OF COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE AND
COMPUTATIONAL LINGUISTICS CONCERNED WITH THE
INTERACTIONS BETWEEN COMPUTERS AND HUMAN
(NATURAL) LANGUAGES, AND, IN PARTICULAR,
CONCERNED WITH PROGRAMMING COMPUTERS TO
FRUITFULLY PROCESS LARGE NATURAL LANGUAGE
CORPORA.
THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1950 - ALAN TURING PUBLISHED
AN ARTICLE TITLED "COMPUTING
MACHINERY AND
INTELLIGENCE" WHICH
PROPOSED WHAT IS NOW
CALLED THE TURING TEST AS A
CRITERION OF INTELLIGENCE.
THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1954 - THE GEORGETOWN
EXPERIMENT INVOLVED FULLY
AUTOMATIC TRANSLATION OF
MORE THAN SIXTY RUSSIAN
SENTENCES INTO ENGLISH. THE
AUTHORS CLAIMED THAT WITHIN
THREE OR FIVE YEARS, MACHINE
TRANSLATION WOULD BE A SOLVED
PROBLEM.
THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1970 - MANY PROGRAMMERS BEGAN TO WRITE "CONCEPTUAL ONTOLOGIES", WHICH STRUCTURED REAL-
WORLD INFORMATION INTO COMPUTER-UNDERSTANDABLE DATA. EXAMPLES ARE QUALM (LEHNERT, 1977),
POLITICS (CARBONELL, 1979), AND PLOT UNITS (LEHNERT 1981). DURING THIS TIME, MANY CHATTERBOTS
WERE WRITTEN INCLUDING PARRY, RACTER.
• WORDNET
• EUROWORDNET
• SENTIWORDNET
THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1980 - THERE WAS A REVOLUTION IN NLP WITH
THE INTRODUCTION OF MACHINE LEARNING
ALGORITHMS FOR LANGUAGE PROCESSING. PART-
OF-SPEECH TAGGING INTRODUCED THE USE OF
HIDDEN MARKOV MODELS TO NLP, AND
INCREASINGLY, RESEARCH HAS FOCUSED ON
STATISTICAL MODELS, WHICH MAKE SOFT,
PROBABILISTIC DECISIONS BASED ON ATTACHING
REAL-VALUED WEIGHTS TO THE FEATURES MAKING
UP THE INPUT DATA.
THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
IN RECENT YEARS, THERE HAS BEEN A FLURRY OF RESULTS SHOWING DEEP
LEARNING TECHNIQUES ACHIEVING STATE-OF-THE-ART RESULTS IN MANY
NATURAL LANGUAGE TASKS, FOR EXAMPLE IN LANGUAGE MODELING,
PARSING AND MANY OTHERS.
HAVE YOU EVER USED ANY NLP PRODUCTS?
HAVE YOU EVER USED ANY NLP PRODUCTS?
NLP APPLICATIONS
TEXT CLASSIFICATION
TEXT CLUSTERING
TEXT SUMMARISATION
MACHINE TRANSLATION
SEMANTIC SEARCH
SENTIMENT ANALYSIS
QUESTION ANSWERING
INFORMATION EXTRACTION
NLP. TEXT CLASSIFICATION
Document classification or
document categorization is a
problem in library science,
information science and computer
science. The task is to assign a
document to one or more classes or
categories. This may be done
"manually" or algorithmically.
Popular algorithms:
1. Multinomial Naive Bayes
2. SVM
3. Neural Networks
NLP. TEXT CLUSTERING
Document clustering (or text
clustering) is the application of
cluster analysis to textual
documents. It has applications in
automatic document organization,
topic extraction and fast information
retrieval or filtering.
Popular algorithms:
1. k-Means
2. DBSCAN
3. Deep Learning
NLP. TEXT SUMMARISATION
Automatic summarization is the
process of shortening a text
document with software, in order
to create a summary with the
major points of the original
document. Technologies that
can make a coherent summary
take into account variables such
as length, writing style and
syntax.
Popular algorithms:
1. LDA
2. Deep Learning
NLP. MACHINE TRANSLATION
MT performs simple substitution of words in
one language for words in another, but that
alone usually cannot produce a good
translation of a text because recognition of
whole phrases and their closest counterparts
in the target language is needed. Solving this
problem with corpus statistical, and neural
techniques is a rapidly growing field that is
leading to better translations, handling
differences in linguistic typology, translation of
idioms, and the isolation of anomalies
Algorithms:
1. Rule based
2. Statistical methods
3. Encoder-Decoder
NLP. SEMANTIC SEARCH
Semantic search seeks to
improve search accuracy by
understanding searcher intent
and the contextual meaning of
terms as they appear in the
searchable dataspace, whether
on the Web or within a closed
system, to generate more
relevant results.
Approaches:
1. Entity Recognition
2. User context
NLP. SENTIMENT ANALYSIS
Sentiment Analysis is the
process of determining whether a
piece of writing is positive,
negative or neutral. It's also
known as opinion mining,
deriving the opinion or attitude of
a speaker.
Algorithms:
1. Lexicon-based
2. Machine Learning (SVM)
3. Deep Learning (RNN, LSTM)
NLP. QUESTION ANSWERING
Question answering (QA) is a
computer science discipline within
the fields of information retrieval and
natural language processing (NLP),
which is concerned with building
systems that automatically answer
questions posed by humans in a
natural language.
Algorithms:
1. Rule based
2. Machine Learning
3. Deep Learning
NLP. INFORMATION EXTRACTION
Information extraction is the task of automatically
extracting structured information from unstructured
and/or semi-structured machine-readable documents.
NLP TOOLS
1. MORPHOLOGICAL ANALYZER
2. POS TAGGER
3. STEMMER
4. PARSERS
5. NAMED ENTITY RECOGNIZER
NLP. STEMMER
Stemmers remove morphological affixes from words, leaving only the word stem.
bananas -> banana
flies -> fli
cats -> cat
dogs -> dog
How about “flies” -> fly?
NLP. MORPHOLOGICAL ANALYZER
Lemmatization usually refers to doing things properly
with the use of a vocabulary and morphological
analysis of words, normally aiming to remove
inflectional endings only and to return the base or
dictionary form of a word, which is known as the
lemma .
flies -> fly
went -> go
am, are, is -> be
NLP. MORPHOLOGICAL ANALYZER
NLP. POS TAGGER
A Part-Of-Speech Tagger (POS Tagger) is a piece of
software that reads text in some language and
assigns parts of speech to each word (and other
token), such as noun, verb, adjective, etc., although
generally computational applications use more fine-
grained POS tags like 'noun-plural'.
NLP. POS TAGGER
NLP. PARSER
A natural language parser is a program that works out the grammatical structure of
sentences, for instance, which groups of words go together (as "phrases") and which
words are the subject or object of a verb.
Dependency tree Constituency tree
NLP. NAMED ENTITY RECOGNIZER
Named-entity recognition (NER) (also known as entity identification, entity chunking and
entity extraction) is a subtask of information extraction that seeks to locate and classify
named entities in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
PROJECT. THEORETICAL PART
THERE IS A DATASET OF LABELED TEXTS,
OUT TASK TO CREATE MACHINE LEARNING
PIPELINE, FOR TEXT CLASSIFICATION,
TRAINED ON GIVEN DATA
PROJECT. ML PIPELINE
feature extraction
text preprocessing
training classifier
evaluation
PROJECT. TEXT PREPROCESSING
• Removing non-text (e.g., ads, javascript)
• Dealing with text encoding (e.g., Unicode)
• Normalization
–extra-terrestrial/extraterrestrial, extra terrestrial
• Stemming
–computer/computation
• Morphological analysis
– car/cars
• Capitalization
– Now/NOW, led/LED
• Named entity extraction
– USA/usa
• Tokenization
PROJECT. FEATURE EXTRACTION
1. TF-IDF SCHEME
2. WORD EMBEDDING(WORD2VEC)
PROJECT. FEATURE EXTRACTION.TF-IDF
“TF-IDF is a weighting scheme that assigns each term in a
document a weight based on its term frequency (tf) and inverse
document frequency (idf). The terms with higher weight scores
are considered to be more important. It’s one of the most popular
weighting schemes in Information Retrieval”
PROJECT. FEATURE EXTRACTION.TF-IDF
Term Frequency (TF)
“Term Frequency, which measures how frequently a term occurs in a document.
Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones. Thus, the term frequency is
often divided by the document length as a way of normalization”
TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document)
PROJECT. FEATURE EXTRACTION.TF-IDF
Inverse Document Frequency(IDF)
“IDF: Inverse Document Frequency, which measures how important a term is. While
computing TF, all terms are considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear a lot of times but have
little importance. Thus we need to weigh down the frequent terms while scale up
the rare ones, by computing the following:”
IDF(t) = log_e(Total number of documents / Number of documents with term t in
it)
Base 10 logarithms are just as good as these although the values are considerably smaller.
PROJECT. FEATURE EXTRACTION.TF-IDF
PROJECT. FEATURE EXTRACTION.WORD2VEC
WORD2VEC is used for learning vector
representations of words, called "word
embeddings".
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. FEATURE EXTRACTION.WORD2VEC
PROJECT. TRAINING CLASSIFIER
CLASSIFICATION ALGORITHMS
1. Support Vector Machines
2. k-Nearest Neighbors
3. Multinomial Naive Bayes
PROJECT. TRAINING CLASSIFIER
Support Vector Machines
PROJECT. TRAINING CLASSIFIER
k-Nearest Neighbors
PROJECT. TRAINING CLASSIFIER
Multinomial Naive Bayes
PROJECT. TEXT CLASSIFICATION EVALUATION
“If you cannot measure it, you can not improve it”
Lord Kelvin
Main metrics for Text Classification:
Precision and Recall
Precision and recall are the measures used in the information
retrieval domain to measure how well an information retrieval
system retrieves the relevant documents requested by a user.
The measures are defined as follows:

Precision  =  Total number of documents retrieved that are
relevant/Total number of documents that are retrieved.

Recall  =  Total number of documents retrieved that are
relevant/Total number of relevant documents in the database.
PROJECT. TEXT CLASSIFICATION EVALUATION
NLP FRAMEWORKS
QUESTIONS?

More Related Content

PPT
Big Data and Natural Language Processing
PDF
Natural language processing (Python)
PDF
Natural language processing (NLP) introduction
PDF
Introduction to Natural Language Processing (NLP)
PDF
Natural Language Processing with Python
PPTX
PDF
Natural Language Processing (NLP)
PPTX
Python NLTK
Big Data and Natural Language Processing
Natural language processing (Python)
Natural language processing (NLP) introduction
Introduction to Natural Language Processing (NLP)
Natural Language Processing with Python
Natural Language Processing (NLP)
Python NLTK

What's hot (20)

PPTX
NLTK - Natural Language Processing in Python
PPTX
Nltk
PDF
Introduction to Natural Language Processing
PPTX
Natural Language Processing (NLP) - Introduction
PDF
Natural Language Processing
PPTX
A Panorama of Natural Language Processing
PPTX
The Role of Natural Language Processing in Information Retrieval
PPTX
PDF
Introduction to natural language processing
PPT
Natural Language Processing
PDF
Natural Language Processing and Machine Learning
PDF
UCU NLP Summer Workshops 2017 - Part 2
PPT
Natural Language Processing for Games Research
PPTX
Natural language processing
PDF
Natural Language Processing: L02 words
DOCX
Natural Language Processing
PPT
Introduction to Natural Language Processing
PPTX
You too can nlp - PyBay 2018 lightning talk
PPTX
Natural Language Processing in Alternative and Augmentative Communication
PPTX
Natural Language Processing in AI
NLTK - Natural Language Processing in Python
Nltk
Introduction to Natural Language Processing
Natural Language Processing (NLP) - Introduction
Natural Language Processing
A Panorama of Natural Language Processing
The Role of Natural Language Processing in Information Retrieval
Introduction to natural language processing
Natural Language Processing
Natural Language Processing and Machine Learning
UCU NLP Summer Workshops 2017 - Part 2
Natural Language Processing for Games Research
Natural language processing
Natural Language Processing: L02 words
Natural Language Processing
Introduction to Natural Language Processing
You too can nlp - PyBay 2018 lightning talk
Natural Language Processing in Alternative and Augmentative Communication
Natural Language Processing in AI
Ad

Similar to DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan (20)

PPTX
Natural Language Processing_in semantic web.pptx
PPTX
Text Mining_big_data_machine_learning.pptx
DOCX
Top 10 Must-Know NLP Techniques for Data Scientists
PPT
The impact of standardized terminologies and domain-ontologies in multilingua...
PDF
Mining Opinion Features in Customer Reviews
PDF
The Process of Information extraction through Natural Language Processing
PDF
INTELLIGENT QUERY PROCESSING IN MALAYALAM
PDF
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
PPTX
Open nlp presentationss
PDF
NLP Deep Learning with Tensorflow
PPTX
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
PPTX
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
PDF
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
PDF
Text mining open source tokenization
PDF
Text Mining: open Source Tokenization Tools � An Analysis
PPTX
Fast and accurate sentiment classification us and naive bayes model b516001
PPTX
NLP Introduction and basics of natural language processing
PDF
Natural Language Processing Theory, Applications and Difficulties
PDF
INTRODUCTION TO Natural language processing
Natural Language Processing_in semantic web.pptx
Text Mining_big_data_machine_learning.pptx
Top 10 Must-Know NLP Techniques for Data Scientists
The impact of standardized terminologies and domain-ontologies in multilingua...
Mining Opinion Features in Customer Reviews
The Process of Information extraction through Natural Language Processing
INTELLIGENT QUERY PROCESSING IN MALAYALAM
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
Open nlp presentationss
NLP Deep Learning with Tensorflow
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
Text mining open source tokenization
Text Mining: open Source Tokenization Tools � An Analysis
Fast and accurate sentiment classification us and naive bayes model b516001
NLP Introduction and basics of natural language processing
Natural Language Processing Theory, Applications and Difficulties
INTRODUCTION TO Natural language processing
Ad

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
web development for engineering and engineering
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Welding lecture in detail for understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Geodesy 1.pptx...............................................
web development for engineering and engineering
Automation-in-Manufacturing-Chapter-Introduction.pdf
Welding lecture in detail for understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Internet of Things (IOT) - A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
OOP with Java - Java Introduction (Basics)
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Operating System & Kernel Study Guide-1 - converted.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan

  • 1. RUDOLF EREMYAN MACHINE LEARNING SOFTWARE ENGINEER INTRODUCTION TO NATURAL LANGUAGE PROCESSING CONTACTS: EREMYAN.RUDOLF@GMAIL.COM HTTPS://WWW.LINKEDIN.COM/IN/RUDOLFEREMYAN/
  • 2. CHATBOT FRAMEWORK FOR GEORGIAN LANGUAGE TI BOT FOR TBC BANK • 35K LIKES • 100K CONVERSATIONS • 8K ACTIVE USERS PER MONTH • 41,5K USERS ASKES ABOUT WEATHER • 1K P2P TRANSACTIONS IN AUGUST
  • 3. SENTIMENT ANALYSIS ON FACEBOOK COMMENTS
  • 4. NATURAL LANGUAGE PROCESSING https://en.wikipedia.org/wiki/Natural_language_processing NATURAL LANGUAGE PROCESSING (NLP) IS A FIELD OF COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL LINGUISTICS CONCERNED WITH THE INTERACTIONS BETWEEN COMPUTERS AND HUMAN (NATURAL) LANGUAGES, AND, IN PARTICULAR, CONCERNED WITH PROGRAMMING COMPUTERS TO FRUITFULLY PROCESS LARGE NATURAL LANGUAGE CORPORA.
  • 5. THE HISTORY OF NLP https://en.wikipedia.org/wiki/Natural_language_processing 1950 - ALAN TURING PUBLISHED AN ARTICLE TITLED "COMPUTING MACHINERY AND INTELLIGENCE" WHICH PROPOSED WHAT IS NOW CALLED THE TURING TEST AS A CRITERION OF INTELLIGENCE.
  • 6. THE HISTORY OF NLP https://en.wikipedia.org/wiki/Natural_language_processing 1954 - THE GEORGETOWN EXPERIMENT INVOLVED FULLY AUTOMATIC TRANSLATION OF MORE THAN SIXTY RUSSIAN SENTENCES INTO ENGLISH. THE AUTHORS CLAIMED THAT WITHIN THREE OR FIVE YEARS, MACHINE TRANSLATION WOULD BE A SOLVED PROBLEM.
  • 7. THE HISTORY OF NLP https://en.wikipedia.org/wiki/Natural_language_processing 1970 - MANY PROGRAMMERS BEGAN TO WRITE "CONCEPTUAL ONTOLOGIES", WHICH STRUCTURED REAL- WORLD INFORMATION INTO COMPUTER-UNDERSTANDABLE DATA. EXAMPLES ARE QUALM (LEHNERT, 1977), POLITICS (CARBONELL, 1979), AND PLOT UNITS (LEHNERT 1981). DURING THIS TIME, MANY CHATTERBOTS WERE WRITTEN INCLUDING PARRY, RACTER. • WORDNET • EUROWORDNET • SENTIWORDNET
  • 8. THE HISTORY OF NLP https://en.wikipedia.org/wiki/Natural_language_processing 1980 - THERE WAS A REVOLUTION IN NLP WITH THE INTRODUCTION OF MACHINE LEARNING ALGORITHMS FOR LANGUAGE PROCESSING. PART- OF-SPEECH TAGGING INTRODUCED THE USE OF HIDDEN MARKOV MODELS TO NLP, AND INCREASINGLY, RESEARCH HAS FOCUSED ON STATISTICAL MODELS, WHICH MAKE SOFT, PROBABILISTIC DECISIONS BASED ON ATTACHING REAL-VALUED WEIGHTS TO THE FEATURES MAKING UP THE INPUT DATA.
  • 9. THE HISTORY OF NLP https://en.wikipedia.org/wiki/Natural_language_processing IN RECENT YEARS, THERE HAS BEEN A FLURRY OF RESULTS SHOWING DEEP LEARNING TECHNIQUES ACHIEVING STATE-OF-THE-ART RESULTS IN MANY NATURAL LANGUAGE TASKS, FOR EXAMPLE IN LANGUAGE MODELING, PARSING AND MANY OTHERS.
  • 10. HAVE YOU EVER USED ANY NLP PRODUCTS?
  • 11. HAVE YOU EVER USED ANY NLP PRODUCTS?
  • 12. NLP APPLICATIONS TEXT CLASSIFICATION TEXT CLUSTERING TEXT SUMMARISATION MACHINE TRANSLATION SEMANTIC SEARCH SENTIMENT ANALYSIS QUESTION ANSWERING INFORMATION EXTRACTION
  • 13. NLP. TEXT CLASSIFICATION Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. Popular algorithms: 1. Multinomial Naive Bayes 2. SVM 3. Neural Networks
  • 14. NLP. TEXT CLUSTERING Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. Popular algorithms: 1. k-Means 2. DBSCAN 3. Deep Learning
  • 15. NLP. TEXT SUMMARISATION Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. Popular algorithms: 1. LDA 2. Deep Learning
  • 16. NLP. MACHINE TRANSLATION MT performs simple substitution of words in one language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus statistical, and neural techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies Algorithms: 1. Rule based 2. Statistical methods 3. Encoder-Decoder
  • 17. NLP. SEMANTIC SEARCH Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. Approaches: 1. Entity Recognition 2. User context
  • 18. NLP. SENTIMENT ANALYSIS Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It's also known as opinion mining, deriving the opinion or attitude of a speaker. Algorithms: 1. Lexicon-based 2. Machine Learning (SVM) 3. Deep Learning (RNN, LSTM)
  • 19. NLP. QUESTION ANSWERING Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language. Algorithms: 1. Rule based 2. Machine Learning 3. Deep Learning
  • 20. NLP. INFORMATION EXTRACTION Information extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
  • 21. NLP TOOLS 1. MORPHOLOGICAL ANALYZER 2. POS TAGGER 3. STEMMER 4. PARSERS 5. NAMED ENTITY RECOGNIZER
  • 22. NLP. STEMMER Stemmers remove morphological affixes from words, leaving only the word stem. bananas -> banana flies -> fli cats -> cat dogs -> dog How about “flies” -> fly?
  • 23. NLP. MORPHOLOGICAL ANALYZER Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . flies -> fly went -> go am, are, is -> be
  • 25. NLP. POS TAGGER A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine- grained POS tags like 'noun-plural'.
  • 27. NLP. PARSER A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Dependency tree Constituency tree
  • 28. NLP. NAMED ENTITY RECOGNIZER Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
  • 29. PROJECT. THEORETICAL PART THERE IS A DATASET OF LABELED TEXTS, OUT TASK TO CREATE MACHINE LEARNING PIPELINE, FOR TEXT CLASSIFICATION, TRAINED ON GIVEN DATA
  • 30. PROJECT. ML PIPELINE feature extraction text preprocessing training classifier evaluation
  • 31. PROJECT. TEXT PREPROCESSING • Removing non-text (e.g., ads, javascript) • Dealing with text encoding (e.g., Unicode) • Normalization –extra-terrestrial/extraterrestrial, extra terrestrial • Stemming –computer/computation • Morphological analysis – car/cars • Capitalization – Now/NOW, led/LED • Named entity extraction – USA/usa • Tokenization
  • 32. PROJECT. FEATURE EXTRACTION 1. TF-IDF SCHEME 2. WORD EMBEDDING(WORD2VEC)
  • 33. PROJECT. FEATURE EXTRACTION.TF-IDF “TF-IDF is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important. It’s one of the most popular weighting schemes in Information Retrieval”
  • 34. PROJECT. FEATURE EXTRACTION.TF-IDF Term Frequency (TF) “Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length as a way of normalization” TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
  • 35. PROJECT. FEATURE EXTRACTION.TF-IDF Inverse Document Frequency(IDF) “IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:” IDF(t) = log_e(Total number of documents / Number of documents with term t in it) Base 10 logarithms are just as good as these although the values are considerably smaller.
  • 37. PROJECT. FEATURE EXTRACTION.WORD2VEC WORD2VEC is used for learning vector representations of words, called "word embeddings".
  • 45. PROJECT. TRAINING CLASSIFIER CLASSIFICATION ALGORITHMS 1. Support Vector Machines 2. k-Nearest Neighbors 3. Multinomial Naive Bayes
  • 49. PROJECT. TEXT CLASSIFICATION EVALUATION “If you cannot measure it, you can not improve it” Lord Kelvin Main metrics for Text Classification: Precision and Recall Precision and recall are the measures used in the information retrieval domain to measure how well an information retrieval system retrieves the relevant documents requested by a user. The measures are defined as follows: Precision  =  Total number of documents retrieved that are relevant/Total number of documents that are retrieved. Recall  =  Total number of documents retrieved that are relevant/Total number of relevant documents in the database.