Group 04 te_a_mini project_ report

Voice and Text Summary Generation using
ML and NLP
Mini Project Report
Submitted in partial fulfillment of the requirements for the degree of
Bachelor of Engineering (Computer Engineering)
by:
Aditya Singh TU3F1819005
Prathamesh Chaskar TU3F1819006
Shruti Rathod TU3F1819029
Under the Guidance of
Dr. Mire Archana Vasant
Department of Computer Engineering
TERNA ENGINEERING COLLEGE
Nerul (W), Navi Mumbai 400706
(University of Mumbai)
(2020-21)

Internal Approval Sheet
TERNA ENGINEERING COLLEGE, NERUL
Department of Computer Engineering
Academic Year 2020-21
CERTIFICATE
This is to certify that the mini project entitled “Voice and Text Summary
Generation using ML and NLP” is a bonafide work of
submitted to the University of Mumbai in partial fulfillment of the requirement
for the award of the Bachelor of Engineering (Computer Engineering).
Guide Head of Department Principal

Approval Sheet
Project Report Approval
This Mini Project Report – entitled “Voice and Text Summary Generation
using ML and NLP” by following students is approved for the degree of
B.E. in "Computer Engineering".
Submitted by:
Examiners Name & Signature:
1.---------------------------------------------------------
2.----------------------------------------------------------
Date: ---------------------------------
Place: ---------------------------------

iv
Declaration
We declare that this written submission represents our ideas in our own words and
where others' ideas or words have been included, we have adequately cited and
referenced the original sources. We also declare that we have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in our submission. We understand
that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
Aditya Singh TU3F1819005 ---------------------------
Prathamesh Chaskar TU3F1819006 ---------------------------
Shruti Rathod TU3F1819029 ---------------------------
Date: _____________________
Place: _____________________

v
Acknowledgement
We would like to express our sincere gratitude towards our guide Dr. Mire
Archana Vasant for their help, guidance and encouragement, she provided during
the project development. This work would have not been possible without her
valuable time, patience and motivation. We thank her for making our stint
thoroughly pleasant and enriching. It was great learning and an honor being their
student.
We are deeply thankful to Dr. Mire Archana Vasant (H.O.D Computer
Department) and entire team in the Computer Department. They supported us
with scientific guidance, advice and encouragement, they were always helpful and
enthusiastic and this inspired us in our work.
We take the privilege to express our sincere thanks to Dr. L. K. Ragha our
Principal for providing the encouragement and much support throughout our work.
Aditya Singh TU3F1819005 ---------------------------
Prathamesh Chaskar TU3F1819006 ---------------------------
Shruti Rathod TU3F1819029 ---------------------------
Date: _____________________
Place: _____________________

vi
Abstract
Text summarization is a subdomain of Natural Language Processing (NLP) that
deals with extracting summaries from huge chunks of texts. It is a process of
generating a concise and meaningful summary of text from multiple text resources
such as books, news articles, blog posts, research papers and emails or more.
Machine learning algorithms can be trained to comprehend documents and identify
the sections that convey important facts and information before producing the
required summarized texts. This project uses machine learning and NLP to
generate condensed form of the text and also give you the entities that was in the
provided input.
The number of sentences picked for compression may depend on the compression
ration of the summary.
Summarization systems often need to have additional evidence that they can utilize
in order to specify the most important topics of the documents. For example, a
scientific paper or a resume will contain considerable amount of information which
needs to be summarized and need to be noted. An aspect of machine learning
known as Named Entity Recognition is used to fulfill this purpose to provide the
crucial data and information present in the document.
A trained Named Entity Recognition model helps to recognize the entities in a
document and can be very useful for getting valuable information from information
centric documents like applications, resumes, research papers etc. In this report
machine learning is used to generate the summaries and also collect the named
entities in a document and save it using a file system.

vii
Table of Contents
Chapter 1 Introduction 01
1.1 Aim and Objectives of Project 02
1.2 Scope 02
1.3 Organization of The Report 03
Chapter 2 Literature Survey 05
2.1 Existing System in market use 06
Chapter 3 Software Analysis
3.2 Proposed System
07
08
Chapter 4 Design And Implementation 09
4.1 System Architecture 09
4.2 Summarization Flow Diagram 10
4.3 SpaCy NER Architecture 11
4.4 NER flow diagram 12
Chapter 5 Methodology 13
5.1 Approach
5.2 Preprocessing
5.3 Vectorizing
5.4 Cosine Similarity
5.5 Ranking
5.6 Named Entity Recognition
13
13
14
15
16
16
Chapter 6 Implementation Details 17
6.1 Working of The Code 18

viii
Chapter 7 Performance Evaluation 29
Chapter 8 Results and Conclusion 32
8.1 Project Screenshots 32
Chapter 9 References 38

Voice and Text Summary Generation Using ML and NLP
Terna Engineering College, Nerul 9
List of Figures
Sr. No Figure Name Pg. No
4.1 System Architecture Diagram 09
4.2 Summarization Flow Diagram 10
4.3 SpaCy NER Architecture 11
4.4 NER flow diagram 12
5.4 Cosine Similarity Diagram 15
6.1 Sample NER training data 28
7.00 NER sample losses 29
7.00 Scatter Plot for NER 30
7.00 NER losses with iterations 31

Terna Engineering College, Nerul
10
CHAPTER - 1
Introduction
Automatic text processing is a research field which is currently extremely active.
The important task in this field is automatic summarization which consists of
reducing the size of a text document while preserving its crucial information
content. A summarization is a system that produces a condensed representation of
the input given by the user for their consumption. This application has been
researched a lot and still it is a nascent stage as compared to manual
summarization. It is very useful to condense unstructured text of an article into a
summary automatically.
The Abstractive Summarization selects the words based on semantic
understanding. They interpret and examine the text using advanced natural
language techniques. It can be in a similar way the human reads a text article and
then summarizes in their own way. The extractive methods attempt to summarize
the articles by selecting a subset of words that retain the most important points.
Purely extractive summaries sometimes give better summaries.
In addition to a wholesome summary of a text document, other special documents
like a resume, scientific, research paper need to have additional evidence that can
be utilized for much better understanding in the condensed form of the document.
For this purpose, a Named Entity Recognition model is trained to recognize the
words of special importance like organizations, location, name, currency, date,
time, etc. and provide the user with this additional information to cover all
important aspects of the document summary.
1.1 Aim and Objectives of Project

11
In order to automate the manual process of generation of meaningful condensed
form of a text document we aim to generate the summaries of a given text
document and extract the important entities mentioned in those documents using
machine learning and natural language processing algorithms of GloVe and
training and using of Named Entity Recognition model. The objective of this
project is to use the machine learning algorithm known as the Global Vectors for
Word Representation (GloVe) as the statistics of word occurrence with that
primary source of data and to train a Named Entity Recognition (NER) system to
use linguistic grammar-based techniques to generate summaries of provided input.
1.2 Scope
The scope of this project is to make computer based automated production of
condensed versions of documents. This automation is necessary for the information
driven society. Its applications lies in areas of speech summaries, commentary,
sports, news, scientific, business reports, human resource, resume filtering etc.
Also with the information about the entities present in the original document such
as the name, locations, organizations, time, date, currency, numbers and more can
be used to evaluate some complex documents.

12
1.3 Organization of the report
Chapter 1 gives a brief overview about the aim for developing this project. The
problem definition tells us about the expected outcome of the project for the
application.
Chapter 2 of the report includes the literature survey on the existing system.
Chapter 3 shows the software model used to design is described, along with this
our proposed system is also described.
Chapter 4 shows the Flow diagrams that give us an abstract view of the system.
This chapter gives a detailed explanation about the technologies used and the
techniques used in the development of the project.
Chapter 5 shows the project module that are designed and used in our proposed
system.
Chapter 6 shows the overall working of the system.
Chapter 7 evaluates the performance of the project.
Chapter 8 describes the tasks that are achieved through this project. It describes the
applications of the project.

13
Chapter 9 is conclusion. This chapter gives a summary of the entire project. It also
gives the future scope for research and development in this project.

14
Chapter 2
Literature Survey
1)
Automatic Text Summarization Survey[1]
Year published: 2017
The text summarization is based on the query generic approach and Neural
networks based. The neural network is trained on a corpus of articles and
then that neural network is modified through fusion to produce a summary
of the most ranked sentences of the article. Through the feature fusion, the
network discovers the importance of various features used to determine the
summary worthiness of each sentence.
Parts of Speech Tagging is the process of grouping of the words according to
the speech category such as noun, verb, adverbs, adjectives, etc.
Stop word filtering, stopwords are words which are filtered out before or
after processing of corpus. It is fully non-objective depends on the needs.
For example, a, an, in, by can be considered stop words from a plain text
document.
The query-based text summarization, the scoring of sentences of a given
document is based on the frequency counts of words. The sentences with the
highest scores are then extracted for the output summary together with their
structural context.
2) GloVe: Global Vectors for Word Representation[2]
The statistics of word occurrences in a corpus is the primary source of
information available to all the unsupervised methods for learning word
representations. Semantic vector space models of language represent each
word with a real valued vector. These vectors can be used as features in
variety of applications like document classification, question answering,
named entity recognition, determining the similarity between multiple
documents, etc. Most word vector methods rely on the distance or angle
between pairs of words vectors as the primary method of evaluating the
intrinsic quality of such set of words.
3) Named Entity Recognition[3]

15
NER systems have been created that use linguistic grammar-based
techniques as well as the statistical models with aspects to machine learning.
There are predefined NER models available but an NER model could be
trained as well to perform better or as to our needs. Multiple NER models
can be created for specific needs. These models should be robust across
multiple domains as it is expected to be applied on a diverse set. It is the
fundamental task and is the core of Natural Language Processing systems.
NER has basically two tasks, which is firstly the identification of proper
names in the text and secondly the classification of these names into a set of
trained categories of interest such as persons, names, organizations, etc.
2.1 Existing System in market use:
Abstractive Summarizers are called these because they do not select sentences
from the originally give text passage to create summary. Instead, they produce a
paraphrasing of the main contents of the given text, using a vocabulary set
different from the original document. This is very similar to as a manual
summarization process. We create a semantic representation of the document, then
we pick words from our general vocabulary and create short summary that
represents all the points of an original document.
The basic structure of implementing this type of system uses the application of
sequence-to-sequence RNNs. The models are designed to create an output
sequence of words from an input sequence of words.
The Encoder-Decoder Networks
Internal
The encoder is responsible for taking in the input and generate a final state vector.
The encoder may contain LSTM layers, RNN or GRU layers. Mostly LSTM layers
are used due to the removed Exploding and vanishing gradient problem. It also
uses the attention mechanism which aims to focus on some specific sequences
from the input only rather than the entire input sequence to predict a word. This
approach is very similar to the human approach to solve the problem.
Input Output
Decoder
Encoder

16
Chapter 3
Software Analysis
The software we are using with their purposes are listed as follows
1) Natural Language Toolkit
The natural language toolkit is a suit of libraries and programs for symbolic
and statistical natural language processing for English written in Python
programming language. The summary generation and named entity
recognition is a subtask of information extraction. We use the NLTK in this
project for multiple purposes.
From NLTK we use the Punkt Sentence Tokenizer. This tokenizer divides a
text into a list of sentences by using an unsupervised algorithm to build a
model for abbreviation words, collocations and words that start sentences.
The NLTK data package includes a pre-trained Punkt tokenizer for English
language.
Another is the stopwords. The stopwords are the English words which do not
add much meaning to a sentence. They can safely be ignored without
sacrificing the meaning of the sentence. For example, the, he, have, etc. We
use the NLTK to download the stopwords. It as available for various
languages as well.
2) spaCy
spaCy is an open-source software library for advanced natural language
processing, written in Python programming language and Cython. It focuses
on providing the software for production useage. It supports the deep
learning workflows with connections of statistical models training with
machine learning libraries. spaCy provides a good features for named entity
recognition, text categorizing, parts of speech tagging, etc.
spaCy provides some small pipelines like en_core_web_sm is a small
English pipeline trained on written web texts like blogs, news, comments
that includes vocabulary, vectors, syntax and entities.
3) Scikit-learn
It is a free machine learning library for Python. It provides various features
like classification, regression and clustering algorithms. In our project we
use a metric of sklearn to compute the similarity between the texts and

17
sentences. We the metric called as cosine similarity. The cosine similarity is
the measure of similarity between two non-zero vectors of an inner product
space. It is defined to equal the cosine of the angle between them which is
also same as the inner product of the same vectors normalized to both have
length 1.
4) Networkx
Networkx is a python library for studying graphs and networks. It is used for
creation, manipulation and study of structure, dynamics and functions of
complex networks.[9]
It is used for pagerank the nodes of the graph created
using the similarity matrix. The pagerank algorithm outputs a probability
distribution used to represent the likelihood for that text to be worthy in a
summary.
3.1 Proposed System
This project uses the extractive summary generation techniques. Extractive
summarization is the extraction of the most important sentences from the whole
document which would coherently represent the document. There can be multiple
approached depending on the summarization application like for summarizing the
reviews we may want to pick the highly positive and negative sentences as
summary. Another can be some sentences which contains some objects and entities
such as name, location, person, date, time, etc.
If we want to summarize a report or an article, we want to pick the important or
main sentences from the article. Here to address the problem of main and
important sentences we use ranking algorithm on different nodes of a generated
tree and one such is the Page Rank algorithm which has an unsupervised approach.
To get the additional information such as the named entities we use a custom
Named Entity Recognition using spaCy. It is a subtask of information extraction
and seeks out categories specified entities in a body.

18
Chapter 4
Design and Implementation
4.1 System Architecture
Document
File System Identify Roles
Text Processing
Intermediate State
Sentence Scoring
Sentence Extraction
Summarization
Result

19
4.2 Summarization Flow diagram
Text File
Sentence Segmentation
Sentence Tokenization
Stop Word Removal
Vectorizing
Similarity Matrix
Networkx Graph
Local Word Score
Total Sentence Score
Sentence Extraction
Summary Document

20
4.3 spaCy NER Architecture
NLP
Text
Tokenizer
Parser
NER
Document
Tagger

21
4.4 NER Flow Diagram
Labels
Training Data
Label
Text
Gradient
Model
Predict
Save
Updated Model

22
Chapter 5
Methodology
5.1 Approach
The type of approach we are using is called Extractive based approach. In this
approach a subset of words that represent the most important points is pulled from
a piece of the text and combined to make a summary. The dataset we will be using
consists of huge samples of text with valid sources.
These extracted features can be of two kinds statistical which are based on
frequency of some elements in the text and linguistic which are extracted from a
simplified argumentative structure of the text.
First, we implement the punkt and stopwords from NLTK for specific purposes.
We load the Spacy for its parts of speech tagging process which is listed in a spacy
module called en_core_web_sm. In corpus linguistics, part of speech tagging is
also called grammatical tagging is the process of making up a word in a text as
corresponding to a particular part of speech, based on both its definition and its
context. In simplified form it can be identification of words as nouns, verbs,
adjectives, adverbs, etc.
Unsupervised tagging techniques use an untagged corpus for their training data and
produce the tagset by induction. That is they observe the patters in the words used
and derive part of speech categories themselves. For example, the statistics readily
reveal that “the”, “a” and “an” occur in a similar context while “eat” occurs in a
very different context. With sufficient iteration similar classes of words emerge
that are very similar to those of human linguists would need. It converts the
sentences to desired forms like a list of tuples where each tuple has a word and a
tag element in it.
5.2 Preprocessing
In addition to the parts of speech tagging we need to score the individual words
and as well as the combined individual sentences in the text. For this we use the
Punkt sentence tokenizer, it is an NLTK project. This tokenizer divides a text into
a list of sentences by using an unsupervised algorithm to build a model for
abbreviation words, collocations and words that start sentences. The NLTK data
package includes a trained Punkt tokenizer for English language.

23
Before going with further process, we need to do some preprocessing of the text
data that we have. We have to filter out the useless data as to process further. In the
natural language processing the useless words or data is termed as the stop words.
A stop word is a very commonly used word such as “the”, “a”, an”, “in”, etc. that
an engine has been programmed to ignore both when indexing entries for searching
and retrieving them as the result. We do not want these words to take up space in
the system and take valuable processing time. For this reason, we can remove them
easily by storing a list of words that you consider to stop words. Natural Language
Toolkit in python has a list of stopwords stored in multiple different languages. So,
these words are not important for the algorithm and need to be removed.
In addition to remove the stop words we also clean the sentences:
1) Converting to lowercase
2) Remove non-alphabetic characters
3) Removing extraneous characters
4) Removing stop words
5) Lemmatizing texts
5.3 Vectorizing
The GloVe is used for obtaining vector representations of the words. It is an
unsupervised learning algorithm for obtaining vector representations for words.
Word vectors are simple vectors of numbers that represent the meaning of a word.
Traditional approaches of machine learning such as one-hot encoding and bag-of-
words models which are used as dummy variables to represent the presence or
absence of a word in observation or a sentence can be useful for some machine
learning tasks but it does not capture the information about a word meaning or
context. This means potential relationships such as contextual closeness are not
captured across collections of words.
Such encodings often provide insufficient baseline for complex natural language
processing and lack the sophistication for more complex tasks such as translation
and summary generation or categorization. The words vectors represent as
multidimensional continuous floating-point numbers where semantically similar
words are mapped to proximate points in geometric space.[5]
A word vector is a
row of real valued number s as opposed to dummy numbers, where each point
captures a dimension of the word’s meaning and where semantically similar words
have similar vectors.
5.4 Cosine Similarity

24
After generating the word vectors, we generate the similarity matrix. We generate
this matrix using the cosine similarity. It computes the cosine similarity between
the samples X and Y. It computes the similarity as the normalized dot product of X
and Y. It is a metric used to determine how similar two entities are irrespective to
their size. Mathematically it measures the cosine of the angle between two vectors
projected in a multi-dimensional space.[6]
So mathematically if “X” and “Y” are two vectors, the cosine equation gives the
angle between them. If the angle between those vectors is close to 0 degree or the
cosine of the angle is close to 1 then those two vectors are very similar to each
other, otherwise if the angle between them is closer to 90 degree or the cosine of
the angle formed is closer to 1 or -1 then those two vectors can be orthogonal or
completely opposite which makes them very different from each other.[6]
Similar vectors
A
B
C
The advantage of using cosine similarity over other similarity metric methods such
as Euclid distance method is that, cosine similarity does not take the size of the
sample into account, in other words it does not take the magnitude of the vectors
into consideration instead it takes the angle formed between the two vectors which
is very beneficial for comparison of similar words but with different lengths.
5.5 Ranking

25
In this part we rank the features of the text such as the preprocessed words and
then the sentences as a whole so to combine them to form an extractive summary.
For this we make the use of similarity matrix to construct a graph on which the
ranking algorithms will be performed. The NetworkX package provides for the
creation, manipulation and study of the structure, dynamics and functions of
complex networks and graphs.
We use the Numpy ndarray property to represent the graph of the similarity matrix
as an adjacency matrix. It gives the adjacency matrix representation of a graph. On
this adjacency matrix we then implement a ranking algorithm which is called the
pagerank algorithm. The pagerank algorithm returns the page rank of the nodes in
the adjacency matrix which represents the graph. It computes a ranking of the
nodes in the graph based on the structure of the incoming links.[7]
Originally it was
designed as an algorithm to rank web pages. It works by counting the number and
quality of links to a word to determine a rough estimate of how important the node
is. The underlying assumption is that the more important nodes are likely to
receive more references in multiple contexts.
The algorithm of PageRank is that is outputs a probability distribution used to
represent the likelihood that a person randomly referring to a node. It can be
calculated for collections of documents of any size.[8]
5.6 Named Entity Recognition
This process is used to extract the additional information from the document that is
very specific. This additional information extraction is very useful for some
documents like scientific, research papers, resume or a job application, etc.
Using SpaCy we get the language processing pipelines. When we call the functions
on the text, spaCy first tokenizes the text to product a Doc object. The Doc is then
processed in several different steps, this is also known as the processing pipeline.[3]
The pipeline used by the training pipelines typically includes a tagger, a
lemmatizer, a parser and the entity recognizer. Each pipeline component returns
the processed Doc which is then passed to the next component.
SpaCy does provide the NER pipeline component, but we train our own NER
model and put it into that NLP pipeline. For the training dataset the model is
provided with over 200 resume samples with marked entities and randomized after
each training iteration so it does not generalize the outcome. Using these we
calculate the rankings of the multiple nodes and features. Them we calculate some
extra parameters such as sentence position, sentence length, sentence weights,
position taggers, word weights, inverse sentence frequency, token weights, token
frequencies and using these we generate the extractive summaries.
Chapter 6

26
Implementation Details
The implementation of this project is done using the following sources in addition
to the python packages imported.
1) Glove.6B.100d.txt
The whole glove algorithm is used for obtaining the vector representations
for words. The complete glove algorithms set consists of four files for four
embedding representations which are[5]
:
a) Glove.6B.50d.txt for 6 billion tokens and 50 features
b) Glove.6B.100d.txt for 6 billion tokens and 100 features
c) Glove.6B.200d.txt for 6 billion tokens and 200 features
d) Glove.6B.300d.txt for 6 billion tokens and 300 features
We are using the one with 100 features as our project’s current need.
The contents of this glove file is the complete numerical representation of
words in almost every context with continuous numbers and also for special
characters which are commonly used in English language text.
An example of the content in the file is given as:
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953
, -0.10767 0.11053 0.59812 -0.54361 0.67396 0.10663
of -0.1529 -0.24279 0.89837 0.16996 0.53516
2) Pipeline en_core_web_sm
It is a pipeline model trained for English and written on web texts such as
blogs, news, comments, articles etc.
This machine learning model is used for parts of speech tagging in our
project. It comes with three variants of small, medium and large. The
difference is mostly about the size and statistical aspects. The larger models
take much longer to load as compared to small model.
3) Data for Training NER model
The custom NER model is made and trained on a dataset consisting of
documents which contain high amounts of additional information only. The
current choice of documents are resumes and job applications in the training
dataset. The dataset provided is in a .plk file form. This type of file is created
by pickle which is a python module that enables objects to be serialized to

27
files on a disk and deserialized back into the program at runtime. It contains
the byte stream which represents the objects in the file.
6.1 Working of the Code
1) Loading the SpaCy pipeline model for parts-of-speech tagging, tokenizer
and stopwords
nltk.download('punkt')
nltk.download('stopwords')
nlp_spacy = spacy.load('en_core_web_sm')
2) Removing the extra spaces from an article.
It takes the article or the document as an input and removes the extra spaces
and some special characters from the body.
def remove_extraneous_text(sentence:str)->str:
# Remove multiple spaces
sentence = re.sub(" +", " ", sentence)
if ") --" in sentence:
sentence = sentence.split(") --")[-1]
3) Removing StopWords
This function takes the sentences as the input and removes all the stopwords
from the sentence. It returns the same sentence without any stopwords.
def remove_stopwords(sentence:str)->str:
sentence = " ".join([word for word in sentence.split() if word not
in stop_words])
return sentence
4) Text Lemmatization

28
This function takes a sentence as input and uses SpaCy to convert each word
into its lemma. The lemma is a informal logic and argument mapping, a
general minor proven proposition which is used as a stepping stone for a
larger result. We use the nlp_spacy class defined when loading the spacy
pipeline for this purpose.
def lemmatize_text(sentence:str)->str:
sentence = nlp_spacy(sentence)
sentence = ' '.join([word.lemma_ if word.lemma_ != "-PRON-"
else word.text for word in sentence])
return sentence
5) Cleaning the Text
This function does some cleaning, preprocessing of the text data which is
given to it as the input. Also calls multiple functions to clean the sentences
by converting to lowercase, remove non alphabetic characters, removing
extraneous characters, removes the stopwords, and lemmatizing words.
def clean_text(sentence:str)->str:
sentence = sentence.lower()
sentence = re.sub("[^a-zA-Z]", " ", sentence)
sentence = remove_extraneous_text(sentence)
sentence = remove_stopwords(sentence)
sentence = lemmatize_text(sentence)
return sentence
6) Total terms
This function takes a sentence as input and returns the total number of terms
or tokens in that sentence.
def get_total_terms(cleaned_sentences:list)->int:
total_terms = 0
for sentence in cleaned_sentences:
total_terms += len(sentence.split())

29
return total_terms
7) Term Frequencies
This function takes a list as an input and returns a dictionary containing
tokens as keys and their frequencies as values.
def get_term_frequencies(cleaned_sentences:list)->dict:
freq_dict = {}
for word in sentence.split():
freq_dict[word] = freq_dict.get(word, 0) + 1
return freq_dict
8) Term weights
This function takes a list of sentences as input and returns a dictionary
containing Tokens as keys and their weightage as values. The weight of
everything in the list is calculated by using the formula:
TW(i) = (TF(ti) * 1000) / (Nt)
Where “ti” is each token, “TW” is the term weight, “TF” is the term
frequency and “Nt” is the total number of terms.
def get_term_weights(cleaned_sentences:list)->dict:
total_terms = get_total_terms(cleaned_sentences)
term_freq_dict = get_term_frequencies(cleaned_sentences)
term_weights = dict()
for key, value in term_freq_dict.items():
term_weights[key] = (value * 1000) / total_terms
return term_weights
9) Inverse sentence Frequency

30
This function takes in a list of sentences and returns a dictionary containing
Tokens as keys and their inverse sentence frequency as values. The inverse
sentence frequency is calculated as:
ISF(ti) = log((Ns) / Nti)
where “ti” is each token, “ISF” is inverse sentence frequency, “Ns” is total
number of sentences in paragraph and “Nti” are the total number of
sentences in which “ti” appeared in that paragraph.
def inverse_sentence_frequency(cleaned_sentences:list)->dict:
vocabulary = set()
vocabulary = vocabulary.union(set(sentence.split()))
isf = dict()
number_of_sentences = len(cleaned_sentences)
for word in vocabulary:
number_of_appearances = 0
if word in sentence:
number_of_appearances += 1
isf[word]=np.log(number_of_sentences/number_of_appearance
s)
return isf
10) Word weights
Takes in a list of sentences and returns a dictionary containing Tokens as keys and
their resultant weightage as values. The weightage is calculated as:
RW(ti) = ISF(ti) * TW(ti)

31
where “ti” is each token, “RW” is resultant weightage, “ISF” is inverse sentence
frequency and “TW” is term weightage.
def word_weights(cleaned_sentences:str)->dict:
term_weights = get_term_weights(cleaned_sentences)
inverse_sentence_freq=
inverse_sentence_frequency(cleaned_sentences)
resultant_weights = dict()
for word in term_weights.keys():
resultant_weights[word] = term_weights[word] *
inverse_sentence_freq[word]
return resultant_weights
11) Parts of speech tagging
Takes in a list of sentences and returns a list of lists, where each Token is
represented as a tuple of the form (Token, POS tag).
def pos_tagging(cleaned_sentences:list)->list:
tagged_sentences = []
sentence_nlp = nlp_spacy(sentence)
tagged_sentence = []
for word in sentence_nlp:
tagged_sentence.append((word, word.pos_))
tagged_sentences.append(tagged_sentence)
return tagged_sentences

32
12) Sentence Weights
Takes in a list of POS tagged sentences and total number of terms. Returns a list
containing the sentence weight of each sentence. The sentence weight is calculated
as:
SW(si) = Number of nouns and verbs in sentence / total number of terms in
paragraph.
def sentence_weights(tagged_sentences:list, total_terms:int)->list:
sent_weights = []
for sentence in tagged_sentences:
relevance_count = 0
for word, tag in sentence:
if tag == 'NOUN' or tag == 'VERB':
relevance_count += 1
sent_weights.append(relevance_count / total_terms)
return sent_weights
13) Sentence Position
Takes in a list of sentences and returns weight for each sentence based on it's
position. For this we have defined some weight checkpoints in the list named
weights.
def sentence_position(cleaned_sentences:list)->list:
sent_position = []
number_of_sentences = len(cleaned_sentences)
weights = [0, 0.25, 0.23, 0.14, 0.08, 0.05, 0.04, 0.06, 0.04, 0.04,
0.15]
for i in range(1, len(cleaned_sentences)+1):
sent_position.append(weights[int(np.ceil(10 * (i /
number_of_sentences)))])

33
return sent_position
14) Sentence Length
Takes in a list of sentences and returns a list containing length of each sentence.
def sentence_length(cleaned_sentences:list)->list:
sent_len = []
sent_len.append(len(sentence.split()))
return sent_len
15) Text ranking
Takes a list of sentences and Glove word embeddings as input and returns a
dictionary containing sentences index as key and rank as value. The ranking is
done based on the PageRank algorithm.
def text_rank(sentences:list, word_embeddings:dict)->dict:
# Clean sentences for PageRank algorithm.
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", "
")
clean_sentences = [s.lower() for s in clean_sentences]
clean_sentences = [remove_stopwords(r) for r in
clean_sentences]
# Replace each word with Glove embeddings. The Sentence
vector is the average of the sum of embeddings of all words in that
# sentence.
sentence_vectors = []
for i in clean_sentences:
if len(i) != 0:
v = sum([word_embeddings.get(w, np.zeros((100, ))) for
w in i.split()]) / (len(i.split()) + 0.001)
else:
v = np.zeros((100, ))
sentence_vectors.append(v)

34
# Initialize a similarity matrix for pair of sentences
sim_mat = np.zeros([len(sentences), len(sentences)])
# Calculate cosine similarity for each pair of sentences
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j]=
cosine_similarity(sentence_vectors[i].reshape(1,100),
sentence_vectors[j].reshape(1, 100))[0, 0]
# Create a PageRank graph using similarity matrix
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
return scores
16) Ranking Features
Takes a list of sentences as input and returns a dictionary containing ranking of
each sentence. The ranking is calculated using word and sentence level features.
def feature_rank(sentences:list)->dict:
cleaned_sentences = [clean_text(sentence) for sentence in
sentences]
term_weights = word_weights(cleaned_sentences)
tagged_sentences = pos_tagging(cleaned_sentences)
total_terms = get_total_terms(cleaned_sentences)
sent_weights = sentence_weights(tagged_sentences,
total_terms)
sent_position = sentence_position(cleaned_sentences)
sent_len = sentence_length(cleaned_sentences)
sentence_scores = []
for index, sentence in enumerate(cleaned_sentences):

35
score = 0
for word in sentence.split():
score += term_weights[word]
score *= sent_weights[index]
score += sent_position[index]
if sent_len[index] != 0:
score /= sent_len[index]
else:
score = 0
sentence_scores.append(score)
sentence_scores = sentence_scores / np.sum(sentence_scores)
final_scores = dict()
for i in range(len(sentence_scores)):
final_scores[i] = sentence_scores[i]
return final_scores
17) After this the extractive summary generation takes by calling all these
functions with their respective parameters.
18) NER
Open the training data to train the NER model for spacy.
train_data = pickle.load(open('train_data.pkl', 'rb'))
19) Create a new blank pipeline for model in spacy to further train it using training
dataset.
nlp = spacy.blank('en')
20) Function to train the model.
def train_model(train_data):
if 'ner' not in nlp.pipe_names: # Checking if NER is present in pipeline

36
ner = nlp.create_pipe('ner') # creating NER pipe if not present
nlp.add_pipe(ner, last=True) # adding NER pipe in the end
for _, annotations in train_data: # Getting 1 resume at a time from our training
data of 200 resumes
for ent in annotations['entities']: # Getting each tuple at a time from 'entities'
key in dictionary at index[1] i.e.,(0, 15, 'Name') and so on
ner.add_label(ent[2]) # here we are adding only labels of each tuple from
entities key dict, eg:- 'Name' label of (0, 15, 'Name')
# In above for loop we finally added all custom NER from training data.
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] # getting all other
pipes except NER.
with nlp.disable_pipes(*other_pipes): # Disabling other pipe's as we want to
train only NER.
optimizer = nlp.begin_training()
for itn in range(10): # trainig model for 10 iteraion
print('Starting iteration ' + str(itn))
random.shuffle(train_data) # shuffling data in every iteration
losses = {}
for text, annotations in train_data:
try:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses
)
print(losses)
except Exception as e:
print('Pass')
pass
train_model(train_data)

37
21) Saving the model and using it.
# Saving our trained model to re-use.
nlp.to_disk('nlp_model')
nlp_model = spacy.load('nlp_model')
# Checking all the custom NER created
print(nlp_model)
22) Sample form of training data for Named Entity Recognition.
TRAIN_DATA = [
("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
("I rented a screwdriver from our neighbour", {"entities": [(12,22,
"PRODUCT")]}),
("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
]

38
Chapter 7
Performance Evaluation
The performance evaluation of the summary generation is done on the basis of the
length of the summary generated as compared to the original text. For most of the
plain text documents the program has successfully generated the summaries with
successfully preservation of meaningful important words and as well as the
meaning in the context of each word.
The output of extractive summary generation is a summary of about 20% to 30%
of the total length of the original article or document. For a use case we took a text
document which has an article regarding Ganesh Chaturthi festival of the total
length of 3304 characters. After processing it through the program the resulted
summary contains 666 characters which is approximately 20% of the original
document.
The training of NER model pipeline is evaluated on the bases of pipeline losses for
each data entity after each iteration of training in the model. For a short dataset the
losses will be small. Out dataset contains 200 data points and randomly trained.
The initial losses of this for the first iteration was above 10000 and with
subsequent iterations the losses came down.
{'ner': 205.58794659748673}
{'ner': 216.2805469001487}
{'ner': 421.50882047068626}
{'ner': 431.6458966203809}
Pass
{'ner': 523.0486146875501}
{'ner': 533.3927916462893}
Pass
Pass
{'ner': 618.948973077106}
Pass
{'ner': 779.149585860538}

39
Scatter plot for NER training losses

40

41
Chapter 8
Results and Conclusion
This project is now able to generate the extractive summaries of the given text
document and the size of the generated summary is about 20% of the original
document. Also enables to extract the valuable additional information for the
documents such as the entities of names, locations, organizations etc. from special
documents.
NER systems have been created that use linguistic grammar-based techniques as
well as statistical models such as machine learning. Hand-crafted grammar-based
systems typically obtain better precision in reading and understanding the contexts
of the documents.
8.1 Screenshots
Input Text
Output summary

42
Snippet of Training losses of NER model:
Sample Input to NER model
Resume

43

44
Configuration of NER model:

45
Meta data of NER model:
Output using NER model

46
Glove data snippet:
GUI

47

48

49
Chapter 9
References
1. A Survey Automatic Text Summarization,
http://pressacademia.org/archives/pap/v5/29.pdf
2. GloVe, https://www.aclweb.org/anthology/D14-1162.pdf
3. NER, https://www.aclweb.org/anthology/D11-1141.pdf
4. SpaCy Language Processing, https://spacy.io/usage/processing-pipelines
5. GloVe: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher,
Christopher D. Manning Computer Science Department, Stanford University, Stanford,
CA 94305
6. Learning similarity with cosine similarity ensemble, Information Sciences,
volume 307,
https://www.sciencedirect.com/science/article/abs/pii/S0020025515001243
7. PageRank Algorithm,
https://ieeexplore.ieee.org/abstract/document/1344743
8. NetworkX, https://www.osti.gov/biblio/960616, Exploring network
structure, dynamics, and function using network
9. https://networkx.org/documentation/stable/index.html

Group 04 te_a_mini project_ report

More Related Content

What's hot (20)

Similar to Group 04 te_a_mini project_ report (20)

Recently uploaded (20)

Group 04 te_a_mini project_ report