SlideShare a Scribd company logo
PREDICTION MODELS
BASED ON MAX-STEMS
(or harnessing imbalanced data)
Episode One: One-Word Based
Ahmet Furkan EMREHAN
(matahmet@gmail.com)
11/17/2021
EMREHAN 1
PREDICTION MODELS
BASED ON MAX-STEMS
 Episode One: One-Word Based
 Episode Two: A Combinatorial Approach
 Episode Three: Effect of Hyperparameters
 Episode Four: Advanced Examinations
11/17/2021
EMREHAN 2
INTRODUCTION
 As is seen, quantity of information is grown in a rampant manner.
Correspondingly written information soars with social media apps day by day.
Tweets, comments, tags give a great contribution to that bulk of written
information.
11/17/2021
EMREHAN 3
PROBLEM
 Labelling written information, sentences in practical sense, is a problem in
Supervised Learning for Text Mining Literature.
 Moreover frequencies of labels are imbalanced in most cases. For example,
most headlines of news in a news portal are labelled as «breaking news» or
«news flash» in order to get attraction.
11/17/2021
EMREHAN 4
MOTIVATION
 Documents (docs) in this context are sentences. Sentences are composed of
ordered words. One computes frequency of a word in sentences with known
label (in train set) by labels.
 Frequency of words can give an idea about label of sentences in which they
are. My models in this study are based on that approach.
 A set of solutions for those problems (labelling and imbalanced data) is
proposed in this study.
 This study is aimed to be a contribution to Supervised Learning Literature as a
bunch of Prediction models for Text Mining.
11/17/2021
EMREHAN 5
METHOD (Word to Stem)
 Using words for prediction of a sentence entails an approach based on
structure of relevant language. This study focuses on the agglutinative
language (ex. Turkish, Hungarian, Estonian, Basque, Japanese, Korean etc.)
 Naturally,in agglutinative language, stem of a word is core part to create
«meaning». In most cases, word is in form of stem with derivational or/and
inflectional affixes (morphemes).
 But to use word for computing frequencies may not be efficient on account of
specific derivational and inflectional forms of word.
 For this reason, to use stem is more convenient than to use word because the
stem involves meaning or concept which word bear in pure form (without
fixes).
11/17/2021
EMREHAN 6
METHOD (Stem to Max-Stem)
 As length of a stem decreases, its meaning scope of the stem expands
semantically. Stem may involve broad which goes over the limit of scope of
word.
 In such cases, to choose derivational form of the stem with maximum length
but which the word includes fits for purpose in terms of reasonably marking
off scope of meaning of the word.
 That approach is extended to whole cases in order to guarantee saving the
meaning of the word. (for more discussion: Step1_turkish_stems_ReadMe.txt)
11/17/2021
EMREHAN 7
COMPONENTS OF MODELS
 𝑝: 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑠
 𝐿𝑎𝑏𝑒𝑙𝑝: 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥
 𝑛: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 (𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑠)
 𝑑𝑜𝑐𝑖: 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡, 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑖 𝑎𝑠 𝑎 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑜𝑟 𝑗𝑢𝑠𝑡 𝑎 ℎ𝑒𝑎𝑑𝑙𝑖𝑛𝑒
 𝑠𝑡𝑒𝑚𝑖𝑗: 𝑠𝑡𝑒𝑚 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑗 𝑜𝑓𝑑𝑜𝑐𝑖
𝑠𝑡𝑒𝑚 𝑐𝑎𝑛 𝑏𝑒 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑚𝑎𝑥 − 𝑠𝑡𝑒𝑚 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑠𝑙𝑖𝑑𝑒𝑠.
 𝑚𝑖: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓𝑠𝑡𝑒𝑚𝑖𝑗
 Σ𝑝
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
 Σ𝑖𝑗
𝑝
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑠𝑡𝑒𝑚𝑖𝑗, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
11/17/2021
EMREHAN 8
COMPONENTS OF MODELS
 Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max
𝑝
Σ𝑖𝑗
𝑝
 Λ𝑖
𝑝
: counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
 𝜆𝑖𝑗: 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓𝑠𝑡𝑒𝑚𝑖𝑗
 𝜌𝑖
𝑝
≔
𝑗=1
𝑚𝑖 Σ𝑖𝑗
𝑝
Σ𝑝 *
 ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 Σ𝑝
=0, 𝜌𝑖
𝑝
:=0
 Π𝑖𝑗
𝑝
≔
Σ𝑖𝑗
𝑝
𝑞=1
𝑛 Σ𝑖𝑗
𝑞
(𝑖𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑎𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑡𝑒𝑚𝑖𝑗 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥)
11/17/2021
EMREHAN 9
COMPONENTS OF MODELS
 Π𝑖
𝑝
: = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑗∗ Π𝑖𝑗∗
𝑝
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑎𝑙𝑙 "j∗"s 𝑚𝑒𝑒𝑡 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 Π𝑖𝑗∗
𝑝
> 0
∗ in case that Σ𝑖𝑗
𝑝
= 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 = 1,2, … 𝑛, Π𝑖
𝑝
=0
 Π𝑖
𝑝
≔ max 𝑗(Π𝑖𝑗
𝑝
)
11/17/2021
EMREHAN 10
General Scheme for Prediction Models
X_test 𝑑𝑜𝑐𝑖
𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖)
𝑠𝑡𝑒𝑚𝑖1
𝑠𝑡𝑒𝑚𝑖2
𝑠𝑡𝑒𝑚𝑖𝑚𝑖
.
.
.
Σ𝑖1
𝑝
, Π𝑖1
𝑝
, Λ𝑖1, 𝜆𝑖1
Σ𝑖2
𝑝
, Π𝑖2
𝑝
, Λ𝑖2, 𝜆𝑖2
Σ𝑖𝑚𝑖
𝑝
, Π𝑖𝑚𝑖
𝑝
, Λ𝑖𝑚𝑖
, 𝜆𝑖𝑚𝑖
.
.
.
Analyze_word
Predictions
X_train, y_train
Evaluations (accuracy,
confusion etc.) y_test 11/17/2021
EMREHAN 11
Model 1
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 =
𝐿𝑎𝑏𝑒𝑙𝑞 𝑞 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑖𝑓 Σ𝑖𝑗
𝑝
= 0 < Σ𝑖𝑗
𝑞
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞 𝑎𝑛𝑑 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 = 1, … , 𝑚𝑖
𝑞 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐 𝑟,𝑗 ∗ 𝑗 Σ𝑖𝑗
𝑟
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 * c r, j ≔ (𝑟 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
) 𝑎𝑛𝑑 (Λ𝑖𝑗 = 𝐿𝑎𝑏𝑒𝑙𝑟 ) (Because arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
may not be unique.)
11/17/2021
EMREHAN 12
Model 1
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 =
𝐿𝑎𝑏𝑒𝑙𝑞 𝑖𝑓 Σ𝑖𝑗
𝑝
= 0 < Σ𝑖𝑗
𝑞
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞
𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∗
∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
Model 2
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡2 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 𝜌𝑖
𝑝
11/17/2021
EMREHAN 14
Model 3
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝
11/17/2021
EMREHAN 15
Model 4
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡4 𝑑𝑜𝑐𝑖 =
𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝
𝑖𝑓 𝑞 𝑖𝑠 𝑢𝑛𝑖𝑞𝑢𝑒
𝑞 = arg 𝑚𝑎𝑥𝑟Π𝑖
𝑟
𝑎𝑛𝑑 𝑟 = arg 𝑚𝑎𝑥𝑝Σ𝑖𝑗
𝑝
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
11/17/2021
EMREHAN 16
Model 5
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡5 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 (𝑚𝑎𝑥𝑗 𝜆𝑖𝑗 ∗
Σ𝑖𝑗
𝑝
Σ𝑝
)
11/17/2021
EMREHAN 17
Case «No Prediction»
 No stem of a document may not be included by any document in train set, in
some cases. Trivially prediction functions generate label as «No Prediction».
This probability is nearly zero if size of train set is sufficiently large.
 However there is a higher probability of label «No Prediction» in model having
Combinatorial Approach in the study. Because probability of that all elements
of a combination (a bunch of stems in a document in test set) are in same
document (in train set) is obviously lower than probability of a stem (involved
by document in test set) in a document (in train set) .
 Some examples of that case is observed in Episode two.
11/17/2021
EMREHAN 18
Case «Not Unique»
 In some cases, values generating predictions, like "arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝
" 𝑎𝑛𝑑 "arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
" , may not be
unique because of equal values. Then models choose label indexed with minimum argument as a
prediction corresponding list structure in Python.
 I use extra parameters (figuratively considered as tiebreaker), Σ𝑖𝑗
𝑝
and 𝑗 Σ𝑖𝑗
𝑟
, on the purpose of
avoiding that case.
 Moreover as train set gets large, probability of exitence of equal values is expected to diminish.
11/17/2021
EMREHAN 19
Application (introduction)
 We use data of «nayn.co» a news portal in Turkish Language. Data is imported by url
«"https://raw.githubusercontent.com/naynco/nayn.data/master/classification_clean.csv"».
 Head of data is presented below
There are 11622 documents («Title» column) with label («DÜNYA» (World), «SPOR» (Sports), «SANAT»(Art) and
«Teknoloji»(Technology)). But data is imbalanced in favor of category «DÜNYA» such that the counts [and percentages] of
categories 9226[%79] ,1967 [%17],285 [%2] and 144 [%1] respectively.
11/17/2021
EMREHAN 20
General Scheme for Application of
Prediction Models
X_test 𝑑𝑜𝑐𝑖
𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖)
𝑠𝑡𝑒𝑚𝑖1
𝑠𝑡𝑒𝑚𝑖2
𝑠𝑡𝑒𝑚𝑖𝑚𝑖
.
.
.
Σ𝑖1
𝑝
, Π𝑖1
𝑝
, Λ𝑖1, 𝜆𝑖1
Σ𝑖2
𝑝
, Π𝑖2
𝑝
, Λ𝑖2, 𝜆𝑖2
Σ𝑖𝑚𝑖
𝑝
, Π𝑖𝑚𝑖
𝑝
, Λ𝑖𝑚𝑖
, 𝜆𝑖𝑚𝑖
.
.
.
Analyze_doc
Predictions
Turkish
Stem List
(32001 stems)
(for application)
X_train, y_train
Evaluations (accuracy,
confusion etc.) y_test
Step2_Preprocess_word.ipynb
Step3_Classifier_V10.ipynb
Step4_Prediction_Models_ub.ipynb
Step6_Run_Integrated_Model_ub.ipynb
11/17/2021
EMREHAN 21
Application (computations)
 Pandas and Sklearn libraries in Python is used for application of methods. Test size is
chosen as 0.2 and random_state parameter for partition as 57.
 Values of parameters in model are computed below
 Counts of categories: 𝑛 = 4
 Indexes and name of categories: 𝑝 = 1,2,3 𝑎𝑛𝑑 4 , 𝐿𝑎𝑏𝑒𝑙𝑝
= "DÜNYA","SPOR","SANAT
and "Teknoloji", 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦
 Counts of categories in train set : Σ1
= 7384, Σ2
= 1568, Σ3
= 229 𝑎𝑛𝑑 Σ4
= 116
11/17/2021
EMREHAN 22
Application (computations)
 Now Let’s show an example and compute its parameters (or compounds of models). We deal with document with
index number 𝑖 = 38296 , 𝑟𝑎𝑛𝑘 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 = 1356 (index number may not be related to rank)
 𝑑𝑜𝑐𝑖: 2 𝑘𝑒𝑑𝑖 2 yıldır sanat müzesine girmeye çalışıyor
(𝑒𝑛: 2 𝑐𝑎𝑡𝑠 𝑡𝑟𝑦 𝑡𝑜 𝑒𝑛𝑡𝑒𝑟 𝑎𝑟𝑡 𝑚𝑢𝑠𝑒𝑢𝑚 𝑓𝑜𝑟 2 𝑦𝑒𝑎𝑟𝑠)
 𝑙𝑎𝑏𝑒𝑙𝑖: 𝐷Ü𝑁𝑌𝐴 (𝑒𝑛: 𝑤𝑜𝑟𝑙𝑑)
 𝑠𝑡𝑒𝑚𝑠: [′𝑘𝑒𝑑𝑖′, ′𝑦𝚤𝑙𝑑𝚤𝑟′, ′𝑠𝑎𝑛𝑎𝑡′, ′𝑚ü𝑧𝑒′, ′𝑔𝑖𝑟′, ′ç𝑎𝑙𝚤′]
 Output of analyze_doc(doc,X_train,y_train): 𝑗 = 1, … , 𝑚𝑖 = 6
𝑠𝑡𝑒𝑚𝑖𝑗 𝜆𝑖𝑗 Σ𝑖𝑗
𝑝
Π𝑖𝑗
𝑝
Λ𝑖𝑗
Category of
2nd max Σ𝑖𝑗
𝑝
(not used)
11/17/2021
EMREHAN 23
Application (computations)
 𝑖 = 38296
 𝑠𝑡𝑒𝑚𝑠: 𝑠𝑡𝑒𝑚𝑖1 = "𝑘𝑒𝑑𝑖", 𝑠𝑡𝑒𝑚𝑖2 = "yıldır" , 𝑠𝑡𝑒𝑚𝑖3 = "sanat" , 𝑠𝑡𝑒𝑚𝑖4 = "müze", 𝑠𝑡𝑒𝑚𝑖5 = "gir" , 𝑠𝑡𝑒𝑚𝑖6 = "çalı"
 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 ∶ 𝜆𝑖1 = 4, 𝜆𝑖2 = 6, 𝜆𝑖3 = 5, 𝜆𝑖4 = 4, 𝜆𝑖5 = 3, 𝜆𝑖6 = 4
 Σ𝑖𝑗
𝑝
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑠𝑡𝑒𝑚𝑖𝑗, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
 𝑓𝑜𝑟 𝑗 = 1 𝑎𝑛𝑑 𝑗 = 6 : Σ𝑖1
1
= 15, Σ𝑖1
2
= 2, Σ𝑖1
3
= 0, Σ𝑖1
4
= 0 , Σ𝑖6
1
= 156, Σ𝑖6
2
= 25, Σ𝑖6
3
= 7, Σ𝑖6
4
= 4
 Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max
𝑝
Σ𝑖𝑗
𝑝
: Λi1 = "DÜNYA", Λi2 = "SPOR", Λi3 = "SANAT"
Λi4="DÜNYA", Λi5 = "DÜNYA", Λi6="DÜNYA"
 Λ𝑖
𝑝
: counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
, Λ𝑖
1
= 4, Λ𝑖
2
= 1, Λ𝑖
3
= 1, Λ𝑖
4
= 0
 Π𝑖𝑗
𝑝
for j = 2 and j = 4 ∶ Π𝑖2
1
= 0.21, Π𝑖2
2
= 0.77, Π𝑖2
3
= 0.01, Π𝑖2
4
= 0, Π𝑖4
1
= 0.53, Π𝑖4
2
= 0.06, Π𝑖4
3
= 0.41, Π𝑖4
4
= 0
11/17/2021
EMREHAN 24
Application (computations)
 𝜌𝑖
𝑝
: 𝜌𝑖
1
=
311
7384
= 0.042, 𝜌𝑖
2
=
99
1568
= 0.063, 𝜌𝑖
3
=
29
229
= 0.127, 𝜌𝑖
4
=
8
116
= 0.069
 Π𝑖
𝑝
: Π𝑖
1
=
0.88+0.21+0.43+0.53+0.81+0.81
6
= 0.612, Π𝑖
2
=
0.12+0.77+0.06+0.12+0.13
5
= 0.24
Π𝑖
3
=
0.01 + 0.57 + 0.41 + 0.04 + 0.04
5
= 0.214, Π𝑖
4
=
0.03 + 0.02
2
= 0.0.025
 Π𝑖
𝑝
: Π𝑖
1
= 0.88, Π𝑖
2
= 0.77, Π𝑖
3
= 0.57, Π𝑖
4
= 0.03
11/17/2021
EMREHAN 25
Application (computations)
 Some Notes:
Algorithm to find stem of word is not be said to work perfectly due to morphological nature of Turkish language:
word: yıldır[….for a year] → stem: yıl[year] but algorithm gives: yıldır(mak)[(to)discourage]
word: çalışıyor [(They) try to ] → stem: çalış(mak)[(to) try (to do something)] but algorithm gives: çalı [bush]
But it is reasonably well:
word: müzesine [to museum] → stem: müze [museum]
word: girmeye [for the purpose of entering] → stem: gir(mek) [(to) enter]
The reason of imperfect cases is turkish stem list which algorithm uses. Because excluding derivational forms in turkish
stem list may give rise to losing of true stem:
for example çalışıyor → çalış(mak) (true stem but in derivational form then excluded) → çal(mak) (original stem but not related modern meaning
of çalış(mak).Among these structures, algorithm gives «çalı», having different meaning but covered by «çalış(mak)». However it is not big deal
that is why nearly all documents including «çalı» related to «çalış(mak)», because «çalı» is not popular word in modern turkish.
This morphological problem in this point is related to computing «larger meaning scope than it should be» , not «narrower
than it should be».
11/17/2021
EMREHAN 26
Application (prediction)
 𝑓𝑜𝑟 𝑖 = 38296
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 = "𝑆𝑃𝑂𝑅"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡2 𝑑𝑜𝑐𝑖 = "𝑆𝐴𝑁𝐴𝑇"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 𝑑𝑜𝑐𝑖 = "𝐷Ü𝑁𝑌𝐴"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡4 𝑑𝑜𝑐𝑖 = "𝐷Ü𝑁𝑌𝐴"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡5 𝑑𝑜𝑐𝑖 = "𝑆𝑃𝑂𝑅"
11/17/2021
EMREHAN 27
Application (results)
Confusion Matrix for Model 1 (count) Confusion Matrix for Model 1 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total
Observed
DÜNYA 1509 20 302 11 1842 0.82 0.01 0.16 0.01 1
SANAT 21 21 14 0 56 0.38 0.38 0.25 0 1
SPOR 86 0 312 1 399 0.22 0 0.78 0 1
Teknoloji 25 0 2 1 28 0.89 0 0.07 0.04 1
Accuracy Rate For Model 1
0.79
Confusion Matrix for Model 2 (count) Confusion Matrix for Model 2 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total
Observed
DÜNYA 1189 196 165 292 1842 0.65 0.11 0.09 0.16 1
SANAT 5 39 7 5 56 0.09 0.7 0.13 0.09 1
SPOR 41 26 319 13 399 0.1 0.07 0.8 0.03 1
Teknoloji 6 3 2 17 28 0.21 0.11 0.07 0.61 1
Accuracy Rate For Model 2
0.67
11/17/2021
EMREHAN 28
Application (results)
Confusion Matrix for Model 3 (count) Confusion Matrix for Model 3 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total
Observed
DÜNYA 1838 1 2 1 1842 1 0 0 0 1
SANAT 55 0 1 0 56 0.98 0 0.02 0 1
SPOR 291 1 107 0 399 0.73 0 0.27 0 1
Teknoloji 28 0 0 0 28 1 0 0 0 1
Accuracy Rate For Model 3
0.84
Confusion Matrix for Model 4 (count) Confusion Matrix for Model 4 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total
Observed
DÜNYA 1824 0 16 2 1842 0.99 0 0.01 0 1
SANAT 44 9 3 0 56 0.79 0.16 0.05 0 1
SPOR 153 1 245 0 399 0.38 0 0.61 0 1
Teknoloji 28 0 0 0 28 1 0 0 0 1
Accuracy Rate For Model 4
0.89
11/17/2021
EMREHAN 29
Application (results)
Confusion Matrix for Model 5 (count) Confusion Matrix for Model 5 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total
Observed
DÜNYA 963 241 243 395 1842 0.52 0.13 0.13 0.21 1
SANAT 8 33 7 8 56 0.14 0.59 0.13 0.14 1
SPOR 56 33 275 35 399 0.14 0.08 0.69 0.09 1
Teknoloji 6 3 3 16 28 0.21 0.11 0.11 0.57 1
Accuracy Rate For Model 3
0.55
End of Episode One
11/17/2021
EMREHAN 30

More Related Content

PPTX
PREDICTION MODELS BASED ON MAX-STEMS Episode Two: Combinatorial Approach
PPTX
PREDICTION MODELS BASED ON MAX-STEMS Episode Four: _Some Advanced Examinations
PPTX
The Right Way
PDF
Economic dispatch using fuzzy logic
PDF
Decentralized stabilization of a class of large scale linear interconnected
PDF
40220140501006
PPTX
mathematical model
PREDICTION MODELS BASED ON MAX-STEMS Episode Two: Combinatorial Approach
PREDICTION MODELS BASED ON MAX-STEMS Episode Four: _Some Advanced Examinations
The Right Way
Economic dispatch using fuzzy logic
Decentralized stabilization of a class of large scale linear interconnected
40220140501006
mathematical model

What's hot (15)

PPTX
ePoster_Saunak.Amitangshu
PPTX
20 Simple CART
PDF
A comparative study of nonlinear circle criterion based observer and H∞ obser...
PDF
Bank loan purchase modeling
PPT
Numerical approximation and solution of equations
PDF
Mine Death Estimation
PPTX
Outlier
PDF
Evaluation of 6 noded quareter point element for crack analysis by analytical...
PPTX
What is Outlier Analysis and How Can It Improve Analysis?
PPT
Mathematical modelling
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
PPTX
Approximation and error
PDF
Matematika terapan week 6
PDF
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
PDF
Ijetcas14 608
ePoster_Saunak.Amitangshu
20 Simple CART
A comparative study of nonlinear circle criterion based observer and H∞ obser...
Bank loan purchase modeling
Numerical approximation and solution of equations
Mine Death Estimation
Outlier
Evaluation of 6 noded quareter point element for crack analysis by analytical...
What is Outlier Analysis and How Can It Improve Analysis?
Mathematical modelling
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Approximation and error
Matematika terapan week 6
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
Ijetcas14 608
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Computer network topology notes for revision
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Mega Projects Data Mega Projects Data
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Qualitative Qantitative and Mixed Methods.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
[EN] Industrial Machine Downtime Prediction
Supervised vs unsupervised machine learning algorithms
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Clinical guidelines as a resource for EBP(1).pdf
Introduction-to-Cloud-ComputingFinal.pptx
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Computer network topology notes for revision
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Mega Projects Data Mega Projects Data
Fluorescence-microscope_Botany_detailed content
Qualitative Qantitative and Mixed Methods.pptx
Ad

PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based

  • 1. PREDICTION MODELS BASED ON MAX-STEMS (or harnessing imbalanced data) Episode One: One-Word Based Ahmet Furkan EMREHAN (matahmet@gmail.com) 11/17/2021 EMREHAN 1
  • 2. PREDICTION MODELS BASED ON MAX-STEMS  Episode One: One-Word Based  Episode Two: A Combinatorial Approach  Episode Three: Effect of Hyperparameters  Episode Four: Advanced Examinations 11/17/2021 EMREHAN 2
  • 3. INTRODUCTION  As is seen, quantity of information is grown in a rampant manner. Correspondingly written information soars with social media apps day by day. Tweets, comments, tags give a great contribution to that bulk of written information. 11/17/2021 EMREHAN 3
  • 4. PROBLEM  Labelling written information, sentences in practical sense, is a problem in Supervised Learning for Text Mining Literature.  Moreover frequencies of labels are imbalanced in most cases. For example, most headlines of news in a news portal are labelled as «breaking news» or «news flash» in order to get attraction. 11/17/2021 EMREHAN 4
  • 5. MOTIVATION  Documents (docs) in this context are sentences. Sentences are composed of ordered words. One computes frequency of a word in sentences with known label (in train set) by labels.  Frequency of words can give an idea about label of sentences in which they are. My models in this study are based on that approach.  A set of solutions for those problems (labelling and imbalanced data) is proposed in this study.  This study is aimed to be a contribution to Supervised Learning Literature as a bunch of Prediction models for Text Mining. 11/17/2021 EMREHAN 5
  • 6. METHOD (Word to Stem)  Using words for prediction of a sentence entails an approach based on structure of relevant language. This study focuses on the agglutinative language (ex. Turkish, Hungarian, Estonian, Basque, Japanese, Korean etc.)  Naturally,in agglutinative language, stem of a word is core part to create «meaning». In most cases, word is in form of stem with derivational or/and inflectional affixes (morphemes).  But to use word for computing frequencies may not be efficient on account of specific derivational and inflectional forms of word.  For this reason, to use stem is more convenient than to use word because the stem involves meaning or concept which word bear in pure form (without fixes). 11/17/2021 EMREHAN 6
  • 7. METHOD (Stem to Max-Stem)  As length of a stem decreases, its meaning scope of the stem expands semantically. Stem may involve broad which goes over the limit of scope of word.  In such cases, to choose derivational form of the stem with maximum length but which the word includes fits for purpose in terms of reasonably marking off scope of meaning of the word.  That approach is extended to whole cases in order to guarantee saving the meaning of the word. (for more discussion: Step1_turkish_stems_ReadMe.txt) 11/17/2021 EMREHAN 7
  • 8. COMPONENTS OF MODELS  𝑝: 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑠  𝐿𝑎𝑏𝑒𝑙𝑝: 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥  𝑛: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 (𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑠)  𝑑𝑜𝑐𝑖: 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡, 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑖 𝑎𝑠 𝑎 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑜𝑟 𝑗𝑢𝑠𝑡 𝑎 ℎ𝑒𝑎𝑑𝑙𝑖𝑛𝑒  𝑠𝑡𝑒𝑚𝑖𝑗: 𝑠𝑡𝑒𝑚 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑗 𝑜𝑓𝑑𝑜𝑐𝑖 𝑠𝑡𝑒𝑚 𝑐𝑎𝑛 𝑏𝑒 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑚𝑎𝑥 − 𝑠𝑡𝑒𝑚 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑠𝑙𝑖𝑑𝑒𝑠.  𝑚𝑖: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓𝑠𝑡𝑒𝑚𝑖𝑗  Σ𝑝 : 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡  Σ𝑖𝑗 𝑝 : 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑠𝑡𝑒𝑚𝑖𝑗, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡 11/17/2021 EMREHAN 8
  • 9. COMPONENTS OF MODELS  Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max 𝑝 Σ𝑖𝑗 𝑝  Λ𝑖 𝑝 : counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝  𝜆𝑖𝑗: 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓𝑠𝑡𝑒𝑚𝑖𝑗  𝜌𝑖 𝑝 ≔ 𝑗=1 𝑚𝑖 Σ𝑖𝑗 𝑝 Σ𝑝 *  ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 Σ𝑝 =0, 𝜌𝑖 𝑝 :=0  Π𝑖𝑗 𝑝 ≔ Σ𝑖𝑗 𝑝 𝑞=1 𝑛 Σ𝑖𝑗 𝑞 (𝑖𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑎𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑡𝑒𝑚𝑖𝑗 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥) 11/17/2021 EMREHAN 9
  • 10. COMPONENTS OF MODELS  Π𝑖 𝑝 : = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑗∗ Π𝑖𝑗∗ 𝑝 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑎𝑙𝑙 "j∗"s 𝑚𝑒𝑒𝑡 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 Π𝑖𝑗∗ 𝑝 > 0 ∗ in case that Σ𝑖𝑗 𝑝 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 = 1,2, … 𝑛, Π𝑖 𝑝 =0  Π𝑖 𝑝 ≔ max 𝑗(Π𝑖𝑗 𝑝 ) 11/17/2021 EMREHAN 10
  • 11. General Scheme for Prediction Models X_test 𝑑𝑜𝑐𝑖 𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖) 𝑠𝑡𝑒𝑚𝑖1 𝑠𝑡𝑒𝑚𝑖2 𝑠𝑡𝑒𝑚𝑖𝑚𝑖 . . . Σ𝑖1 𝑝 , Π𝑖1 𝑝 , Λ𝑖1, 𝜆𝑖1 Σ𝑖2 𝑝 , Π𝑖2 𝑝 , Λ𝑖2, 𝜆𝑖2 Σ𝑖𝑚𝑖 𝑝 , Π𝑖𝑚𝑖 𝑝 , Λ𝑖𝑚𝑖 , 𝜆𝑖𝑚𝑖 . . . Analyze_word Predictions X_train, y_train Evaluations (accuracy, confusion etc.) y_test 11/17/2021 EMREHAN 11
  • 12. Model 1  𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑞 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑖𝑓 Σ𝑖𝑗 𝑝 = 0 < Σ𝑖𝑗 𝑞 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞 𝑎𝑛𝑑 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 = 1, … , 𝑚𝑖 𝑞 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐 𝑟,𝑗 ∗ 𝑗 Σ𝑖𝑗 𝑟 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒  * c r, j ≔ (𝑟 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖 𝑝 ) 𝑎𝑛𝑑 (Λ𝑖𝑗 = 𝐿𝑎𝑏𝑒𝑙𝑟 ) (Because arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖 𝑝 may not be unique.) 11/17/2021 EMREHAN 12
  • 13. Model 1  𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑖𝑓 Σ𝑖𝑗 𝑝 = 0 < Σ𝑖𝑗 𝑞 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖 𝑝 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∗ ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
  • 14. Model 2  𝑝𝑟𝑒𝑑𝑖𝑐𝑡2 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 𝜌𝑖 𝑝 11/17/2021 EMREHAN 14
  • 15. Model 3  𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖 𝑝 11/17/2021 EMREHAN 15
  • 16. Model 4  𝑝𝑟𝑒𝑑𝑖𝑐𝑡4 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖 𝑝 𝑖𝑓 𝑞 𝑖𝑠 𝑢𝑛𝑖𝑞𝑢𝑒 𝑞 = arg 𝑚𝑎𝑥𝑟Π𝑖 𝑟 𝑎𝑛𝑑 𝑟 = arg 𝑚𝑎𝑥𝑝Σ𝑖𝑗 𝑝 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 11/17/2021 EMREHAN 16
  • 17. Model 5  𝑝𝑟𝑒𝑑𝑖𝑐𝑡5 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 (𝑚𝑎𝑥𝑗 𝜆𝑖𝑗 ∗ Σ𝑖𝑗 𝑝 Σ𝑝 ) 11/17/2021 EMREHAN 17
  • 18. Case «No Prediction»  No stem of a document may not be included by any document in train set, in some cases. Trivially prediction functions generate label as «No Prediction». This probability is nearly zero if size of train set is sufficiently large.  However there is a higher probability of label «No Prediction» in model having Combinatorial Approach in the study. Because probability of that all elements of a combination (a bunch of stems in a document in test set) are in same document (in train set) is obviously lower than probability of a stem (involved by document in test set) in a document (in train set) .  Some examples of that case is observed in Episode two. 11/17/2021 EMREHAN 18
  • 19. Case «Not Unique»  In some cases, values generating predictions, like "arg 𝑚𝑎𝑥𝑝Π𝑖 𝑝 " 𝑎𝑛𝑑 "arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖 𝑝 " , may not be unique because of equal values. Then models choose label indexed with minimum argument as a prediction corresponding list structure in Python.  I use extra parameters (figuratively considered as tiebreaker), Σ𝑖𝑗 𝑝 and 𝑗 Σ𝑖𝑗 𝑟 , on the purpose of avoiding that case.  Moreover as train set gets large, probability of exitence of equal values is expected to diminish. 11/17/2021 EMREHAN 19
  • 20. Application (introduction)  We use data of «nayn.co» a news portal in Turkish Language. Data is imported by url «"https://raw.githubusercontent.com/naynco/nayn.data/master/classification_clean.csv"».  Head of data is presented below There are 11622 documents («Title» column) with label («DÜNYA» (World), «SPOR» (Sports), «SANAT»(Art) and «Teknoloji»(Technology)). But data is imbalanced in favor of category «DÜNYA» such that the counts [and percentages] of categories 9226[%79] ,1967 [%17],285 [%2] and 144 [%1] respectively. 11/17/2021 EMREHAN 20
  • 21. General Scheme for Application of Prediction Models X_test 𝑑𝑜𝑐𝑖 𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖) 𝑠𝑡𝑒𝑚𝑖1 𝑠𝑡𝑒𝑚𝑖2 𝑠𝑡𝑒𝑚𝑖𝑚𝑖 . . . Σ𝑖1 𝑝 , Π𝑖1 𝑝 , Λ𝑖1, 𝜆𝑖1 Σ𝑖2 𝑝 , Π𝑖2 𝑝 , Λ𝑖2, 𝜆𝑖2 Σ𝑖𝑚𝑖 𝑝 , Π𝑖𝑚𝑖 𝑝 , Λ𝑖𝑚𝑖 , 𝜆𝑖𝑚𝑖 . . . Analyze_doc Predictions Turkish Stem List (32001 stems) (for application) X_train, y_train Evaluations (accuracy, confusion etc.) y_test Step2_Preprocess_word.ipynb Step3_Classifier_V10.ipynb Step4_Prediction_Models_ub.ipynb Step6_Run_Integrated_Model_ub.ipynb 11/17/2021 EMREHAN 21
  • 22. Application (computations)  Pandas and Sklearn libraries in Python is used for application of methods. Test size is chosen as 0.2 and random_state parameter for partition as 57.  Values of parameters in model are computed below  Counts of categories: 𝑛 = 4  Indexes and name of categories: 𝑝 = 1,2,3 𝑎𝑛𝑑 4 , 𝐿𝑎𝑏𝑒𝑙𝑝 = "DÜNYA","SPOR","SANAT and "Teknoloji", 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦  Counts of categories in train set : Σ1 = 7384, Σ2 = 1568, Σ3 = 229 𝑎𝑛𝑑 Σ4 = 116 11/17/2021 EMREHAN 22
  • 23. Application (computations)  Now Let’s show an example and compute its parameters (or compounds of models). We deal with document with index number 𝑖 = 38296 , 𝑟𝑎𝑛𝑘 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 = 1356 (index number may not be related to rank)  𝑑𝑜𝑐𝑖: 2 𝑘𝑒𝑑𝑖 2 yıldır sanat müzesine girmeye çalışıyor (𝑒𝑛: 2 𝑐𝑎𝑡𝑠 𝑡𝑟𝑦 𝑡𝑜 𝑒𝑛𝑡𝑒𝑟 𝑎𝑟𝑡 𝑚𝑢𝑠𝑒𝑢𝑚 𝑓𝑜𝑟 2 𝑦𝑒𝑎𝑟𝑠)  𝑙𝑎𝑏𝑒𝑙𝑖: 𝐷Ü𝑁𝑌𝐴 (𝑒𝑛: 𝑤𝑜𝑟𝑙𝑑)  𝑠𝑡𝑒𝑚𝑠: [′𝑘𝑒𝑑𝑖′, ′𝑦𝚤𝑙𝑑𝚤𝑟′, ′𝑠𝑎𝑛𝑎𝑡′, ′𝑚ü𝑧𝑒′, ′𝑔𝑖𝑟′, ′ç𝑎𝑙𝚤′]  Output of analyze_doc(doc,X_train,y_train): 𝑗 = 1, … , 𝑚𝑖 = 6 𝑠𝑡𝑒𝑚𝑖𝑗 𝜆𝑖𝑗 Σ𝑖𝑗 𝑝 Π𝑖𝑗 𝑝 Λ𝑖𝑗 Category of 2nd max Σ𝑖𝑗 𝑝 (not used) 11/17/2021 EMREHAN 23
  • 24. Application (computations)  𝑖 = 38296  𝑠𝑡𝑒𝑚𝑠: 𝑠𝑡𝑒𝑚𝑖1 = "𝑘𝑒𝑑𝑖", 𝑠𝑡𝑒𝑚𝑖2 = "yıldır" , 𝑠𝑡𝑒𝑚𝑖3 = "sanat" , 𝑠𝑡𝑒𝑚𝑖4 = "müze", 𝑠𝑡𝑒𝑚𝑖5 = "gir" , 𝑠𝑡𝑒𝑚𝑖6 = "çalı"  𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 ∶ 𝜆𝑖1 = 4, 𝜆𝑖2 = 6, 𝜆𝑖3 = 5, 𝜆𝑖4 = 4, 𝜆𝑖5 = 3, 𝜆𝑖6 = 4  Σ𝑖𝑗 𝑝 : 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑠𝑡𝑒𝑚𝑖𝑗, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡  𝑓𝑜𝑟 𝑗 = 1 𝑎𝑛𝑑 𝑗 = 6 : Σ𝑖1 1 = 15, Σ𝑖1 2 = 2, Σ𝑖1 3 = 0, Σ𝑖1 4 = 0 , Σ𝑖6 1 = 156, Σ𝑖6 2 = 25, Σ𝑖6 3 = 7, Σ𝑖6 4 = 4  Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max 𝑝 Σ𝑖𝑗 𝑝 : Λi1 = "DÜNYA", Λi2 = "SPOR", Λi3 = "SANAT" Λi4="DÜNYA", Λi5 = "DÜNYA", Λi6="DÜNYA"  Λ𝑖 𝑝 : counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝 , Λ𝑖 1 = 4, Λ𝑖 2 = 1, Λ𝑖 3 = 1, Λ𝑖 4 = 0  Π𝑖𝑗 𝑝 for j = 2 and j = 4 ∶ Π𝑖2 1 = 0.21, Π𝑖2 2 = 0.77, Π𝑖2 3 = 0.01, Π𝑖2 4 = 0, Π𝑖4 1 = 0.53, Π𝑖4 2 = 0.06, Π𝑖4 3 = 0.41, Π𝑖4 4 = 0 11/17/2021 EMREHAN 24
  • 25. Application (computations)  𝜌𝑖 𝑝 : 𝜌𝑖 1 = 311 7384 = 0.042, 𝜌𝑖 2 = 99 1568 = 0.063, 𝜌𝑖 3 = 29 229 = 0.127, 𝜌𝑖 4 = 8 116 = 0.069  Π𝑖 𝑝 : Π𝑖 1 = 0.88+0.21+0.43+0.53+0.81+0.81 6 = 0.612, Π𝑖 2 = 0.12+0.77+0.06+0.12+0.13 5 = 0.24 Π𝑖 3 = 0.01 + 0.57 + 0.41 + 0.04 + 0.04 5 = 0.214, Π𝑖 4 = 0.03 + 0.02 2 = 0.0.025  Π𝑖 𝑝 : Π𝑖 1 = 0.88, Π𝑖 2 = 0.77, Π𝑖 3 = 0.57, Π𝑖 4 = 0.03 11/17/2021 EMREHAN 25
  • 26. Application (computations)  Some Notes: Algorithm to find stem of word is not be said to work perfectly due to morphological nature of Turkish language: word: yıldır[….for a year] → stem: yıl[year] but algorithm gives: yıldır(mak)[(to)discourage] word: çalışıyor [(They) try to ] → stem: çalış(mak)[(to) try (to do something)] but algorithm gives: çalı [bush] But it is reasonably well: word: müzesine [to museum] → stem: müze [museum] word: girmeye [for the purpose of entering] → stem: gir(mek) [(to) enter] The reason of imperfect cases is turkish stem list which algorithm uses. Because excluding derivational forms in turkish stem list may give rise to losing of true stem: for example çalışıyor → çalış(mak) (true stem but in derivational form then excluded) → çal(mak) (original stem but not related modern meaning of çalış(mak).Among these structures, algorithm gives «çalı», having different meaning but covered by «çalış(mak)». However it is not big deal that is why nearly all documents including «çalı» related to «çalış(mak)», because «çalı» is not popular word in modern turkish. This morphological problem in this point is related to computing «larger meaning scope than it should be» , not «narrower than it should be». 11/17/2021 EMREHAN 26
  • 27. Application (prediction)  𝑓𝑜𝑟 𝑖 = 38296  𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 = "𝑆𝑃𝑂𝑅"  𝑝𝑟𝑒𝑑𝑖𝑐𝑡2 𝑑𝑜𝑐𝑖 = "𝑆𝐴𝑁𝐴𝑇"  𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 𝑑𝑜𝑐𝑖 = "𝐷Ü𝑁𝑌𝐴"  𝑝𝑟𝑒𝑑𝑖𝑐𝑡4 𝑑𝑜𝑐𝑖 = "𝐷Ü𝑁𝑌𝐴"  𝑝𝑟𝑒𝑑𝑖𝑐𝑡5 𝑑𝑜𝑐𝑖 = "𝑆𝑃𝑂𝑅" 11/17/2021 EMREHAN 27
  • 28. Application (results) Confusion Matrix for Model 1 (count) Confusion Matrix for Model 1 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total Observed DÜNYA 1509 20 302 11 1842 0.82 0.01 0.16 0.01 1 SANAT 21 21 14 0 56 0.38 0.38 0.25 0 1 SPOR 86 0 312 1 399 0.22 0 0.78 0 1 Teknoloji 25 0 2 1 28 0.89 0 0.07 0.04 1 Accuracy Rate For Model 1 0.79 Confusion Matrix for Model 2 (count) Confusion Matrix for Model 2 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total Observed DÜNYA 1189 196 165 292 1842 0.65 0.11 0.09 0.16 1 SANAT 5 39 7 5 56 0.09 0.7 0.13 0.09 1 SPOR 41 26 319 13 399 0.1 0.07 0.8 0.03 1 Teknoloji 6 3 2 17 28 0.21 0.11 0.07 0.61 1 Accuracy Rate For Model 2 0.67 11/17/2021 EMREHAN 28
  • 29. Application (results) Confusion Matrix for Model 3 (count) Confusion Matrix for Model 3 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total Observed DÜNYA 1838 1 2 1 1842 1 0 0 0 1 SANAT 55 0 1 0 56 0.98 0 0.02 0 1 SPOR 291 1 107 0 399 0.73 0 0.27 0 1 Teknoloji 28 0 0 0 28 1 0 0 0 1 Accuracy Rate For Model 3 0.84 Confusion Matrix for Model 4 (count) Confusion Matrix for Model 4 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total Observed DÜNYA 1824 0 16 2 1842 0.99 0 0.01 0 1 SANAT 44 9 3 0 56 0.79 0.16 0.05 0 1 SPOR 153 1 245 0 399 0.38 0 0.61 0 1 Teknoloji 28 0 0 0 28 1 0 0 0 1 Accuracy Rate For Model 4 0.89 11/17/2021 EMREHAN 29
  • 30. Application (results) Confusion Matrix for Model 5 (count) Confusion Matrix for Model 5 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total Observed DÜNYA 963 241 243 395 1842 0.52 0.13 0.13 0.21 1 SANAT 8 33 7 8 56 0.14 0.59 0.13 0.14 1 SPOR 56 33 275 35 399 0.14 0.08 0.69 0.09 1 Teknoloji 6 3 3 16 28 0.21 0.11 0.11 0.57 1 Accuracy Rate For Model 3 0.55 End of Episode One 11/17/2021 EMREHAN 30