Tricks in natural language processing

 Fetch from API - https://github.com/bear/python-twitter
 Crawl Web-sites - https://github.com/scrapy/scrapy
 Use Browser Hack - https://github.com/Jefferson-Henrique/GetOldTweets-python

 If the data is in csv
Df pd.read_csv(file_name,index_col=None, header=0,
usecols[“field_name1”,”field_name2”,…..])

 If the files is stored as Json
def load_json(path_to_json):
import os
import json
import pandas as pd
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#set up an empty dictionary
resultdict = {}
i=1
for f in list_files:
with open(os.path.join(path_to_json, f), "r") as inputjson:
resultdict[f] = json.load(inputjson)
i=i+1
bar.update(i)
df = pd.DataFrame(resultdict)
df2=df.T
return df2

 Try converting JSON into dict or CSV into Dict?
 Why?

Breaking texts into tokens which you want to consider to feed into NLP algorithm
from nltk.tokenize import RegexpTokenizer
#to separate words without punctuation
tokenizer = RegexpTokenizer(r'w+’)
#convert into lower case to avoid duplication of the same word
raw = text.lower()
tokens = tokenizer.tokenize(raw)

 Stop words are commonly occurring words which doesn’t contribute to topic
modelling.
 the, and, or
 However, sometimes, removing stop words affect topic modelling
 For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it
will be removed.

Common occurring words which doesn’t provide the context
from stop_words import get_stop_words
from stop_words import LANGUAGE_MAPPING
from stop_words import AVAILABLE_LANGUAGES
 # create English stop words list
 english_stop_words = get_stop_words(‘en')
 # remove stop words from tokens
 stopped_tokens = [i for i in tokens if not i in english_stop_words]

 Why cant we load our own stop words in a list and filter out the tokens with stop
words?
 Can we stop words repository for other purpose?

 Manually add Malay language in Stop words corpus
 Make a language detection mechanism using stop words

 A common NLP technique to reduce topically similar words to their root. For e.g.,
“stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces
those terms to “stem.”
 Important for topic modeling, which would otherwise view those terms as separate
entities and reduce their importance in the model.
 It's a bunch of rules for reducing a word:
 sses -> es
 ies -> i
 ational -> ate
 tional -> tion
 s -> ∅
 when conflicts, the longest rule wins
 Bad idea unless you customize it.

Arabic Stemming Process
Simple Stemming Process

 Code
from nltk.stem.porter import
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied', ... 'died', 'agreed', 'owned', 'humbled',
'sized', 'meeting', 'stating', 'siezing', 'itemization', ... 'sensational', 'traditional', 'reference',
'colonizer', ... 'plotted’]
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))
 Output:
caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer
colon plot

 Code
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
print(stemmer.stem("having"))
print(stemmer2.stem("having"))
 Output:
have
having

 Find out if there is any support of SnowBall Stemmer in Malay language? If not,
find out how you can implement your own?

 It goes one step further than stemming.
 It obtains grammatically correct words and distinguishes words by their word
sense with the use of a vocabulary (e.g., type can mean write or category).
 It is a much more difficult and expensive process than stemming.

Tricks in natural language processing

 Code
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
class MyWordNet:
def __init__(self, wn):
self._wordnet = wn
run = MyWordNet(wn)
lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, stopped_tokens)
itemarr=[]
for item in lem:
itemarr.append(item)
print( itemarr)

 from nltk import FreqDist
 fdist = FreqDist(itemarr)
 fdist2= FreqDist(stopped_tokens)

 from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
https://pythonprogramminglanguage.com/bag-of-words/

 LDA
 LSI
 TFIDF
 Doc2Vec

 LDA stands for latent dirichlet allocation
 It is basically of distribution of words in topic k (let’s say 50) with probability of
topic k occurring in document d (let’s say 5000)
 Mechanism - It uses special kind of distribution called Dirichlet Distribution
which is nothing but multi—variate generalization of Beta distribution of
probability density function

Andrew NgDavid Blei Michael I Jordan

Sentence 1: I spend the evening watching football
Sentence 2: I ate nachos and guacamole.
Sentence 3: I spend the evening watching football while eating nachos and guacamole.
LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 65% is Topic 1, 35% Topic 2
But also tells that
Topic 1 is about football (50%), evening (50%),
topic 2 is about nachos (50%), guacamole (50)%

https://ai.stanford.edu/~ang/papers/nips01-lda.pdf

 LDA works on set of documents, thus document need to be uniquely identified.
How can document tokenized and lemmatized can be stored?
Options:
1. JSON
2. DICT
3. PANDAS DATAFRAME
4. CSV

 Remember, we load the data into DF
 Can we iterate the DF and store in DICT?
 Code
for index, row in df2.iterrows():
#preprocess text data by applying stop words and/or stemming/Lemmatization
local_dict[str(row[’unique_identifier_of_record_in_Dataframe’])]=itemarr
 What are the benefits of storing the processed data in DICT instead of DF?

 Save the dictionary in pickle for later use by various models. How to do it?
 Code:
from six.moves import cPickle as pickle
with open(dict_file_name, 'wb') as f:
pickle.dump(local_dict, f)
Once loaded, how to reload. How to do it?
 Code
with open(dict_file_name 'rb') as f:
reload_dict= pickle.load(f)

Code:
# turn our tokenized documents into a id <-> term dictionary
from gensim import corpora, models
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix#our dictionary must be
converted into a bag-of-words:
random_seed = 69
state = np.random.RandomState(random_seed)
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./corpus/'+product_name+'.corpus.mm', corpus)
# generate LDA model
ldaModel = gensim.models.ldamodel.LdaModel(corpus,
num_topics=n_topics,
id2word = dictionary,
random_state=state)

 Try printing dictionary and corpus and understand the structure

#generate Matrix similarities to be used
from gensim.similarities import MatrixSimilarity
print(ldaModel[corpus])
index = MatrixSimilarity(ldaModel[corpus])

#generate the list of unique Document identifier
DocIdList=list(reload_dict.keys())
#Now, we’ve already stored the list of words against each DocID in reload_Dict
# we can generate a vector of bag of words using only those words which are present in the
given Doc ID
#let’s suppose Doc ID is 101, then reload_dict[101] will give the list of processed tokens
vec_bow = dictionary.doc2bow(reload_dict[101]
#like in the previous example, we get vector of LDA model of bag of words of the given
document
vec_lda = ldaModel[vec_bow]
# now, using the similarity matrix, we can find out what
sims = index[vec_lda]
print(sims)
# However, it is not sorted

 Can you sort the similarity matrix?
 Can you find out top 5 document ID which are similar and store in json file.

 based on the principle that words that are used in the same contexts tend to have
similar meanings
 identify patterns in the relationships between the terms and concepts contained in
an unstructured collection of text
 uses a mathematical technique called singular value decomposition
 https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing

https://en.wikipedia.org/wiki/Singular-value_decomposition

 Generate Model using genism similar to what has been done in LDA
 Find similar documents using LSI model

TFIDF, is a numerical statistic that is intended to reflect how important a word is
to a document in a collection or corpus
 Term Frequency of t in document d = the number of times that term t occurs in
document d
 Term Frequency adjusted for document length = the number of times that
term t occurs in document d / number of words in d
Inverse document frequency - how much information the word provides, that is,
whether the term is common or rare across all documents.
IDF= log ( Number of Documents / total number of documents by the number of
documents containing the term t)

 Generate Model using genism similar to what has been done in TFIDF
 Find similar documents using TFIDF model

https://arxiv.org/pdf/1301.3781.pdf

https://arxiv.org/pdf/1605.02019.pdf
LDA2VEC model adds in skipgrams.
A word predicts another word in the same window,
as in word2vec, but also has the notion of a context vector
which only changes at the document level as in LDA.

 Source: https://github.com/TropComplique/lda2vec-pytorch
 Go to 20newsgroups/.
 Run get_windows.ipynb to prepare data.
 Run python train.py for training.
 Run explore_trained_model.ipynb.
 To use this on your data you need to edit get_windows.ipynb. Also there are
hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.

Tricks in natural language processing

More Related Content

What's hot (19)

Similar to Tricks in natural language processing (20)

More from Babu Priyavrat (7)

Recently uploaded (20)

Tricks in natural language processing

Editor's Notes