PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based

PREDICTION MODELS
BASED ON MAX-STEMS
(or harnessing imbalanced data)
Episode One: One-Word Based
Ahmet Furkan EMREHAN
(matahmet@gmail.com)
11/17/2021
EMREHAN 1

PREDICTION MODELS
BASED ON MAX-STEMS
 Episode One: One-Word Based
 Episode Two: A Combinatorial Approach
 Episode Three: Effect of Hyperparameters
 Episode Four: Advanced Examinations
11/17/2021
EMREHAN 2

INTRODUCTION
 As is seen, quantity of information is grown in a rampant manner.
Correspondingly written information soars with social media apps day by day.
Tweets, comments, tags give a great contribution to that bulk of written
information.
11/17/2021
EMREHAN 3

PROBLEM
 Labelling written information, sentences in practical sense, is a problem in
Supervised Learning for Text Mining Literature.
 Moreover frequencies of labels are imbalanced in most cases. For example,
most headlines of news in a news portal are labelled as «breaking news» or
«news flash» in order to get attraction.
11/17/2021
EMREHAN 4

MOTIVATION
 Documents (docs) in this context are sentences. Sentences are composed of
ordered words. One computes frequency of a word in sentences with known
label (in train set) by labels.
 Frequency of words can give an idea about label of sentences in which they
are. My models in this study are based on that approach.
 A set of solutions for those problems (labelling and imbalanced data) is
proposed in this study.
 This study is aimed to be a contribution to Supervised Learning Literature as a
bunch of Prediction models for Text Mining.
11/17/2021
EMREHAN 5

METHOD (Word to Stem)
 Using words for prediction of a sentence entails an approach based on
structure of relevant language. This study focuses on the agglutinative
language (ex. Turkish, Hungarian, Estonian, Basque, Japanese, Korean etc.)
 Naturally,in agglutinative language, stem of a word is core part to create
«meaning». In most cases, word is in form of stem with derivational or/and
inflectional affixes (morphemes).
 But to use word for computing frequencies may not be efficient on account of
specific derivational and inflectional forms of word.
 For this reason, to use stem is more convenient than to use word because the
stem involves meaning or concept which word bear in pure form (without
fixes).
11/17/2021
EMREHAN 6

METHOD (Stem to Max-Stem)
 As length of a stem decreases, its meaning scope of the stem expands
semantically. Stem may involve broad which goes over the limit of scope of
word.
 In such cases, to choose derivational form of the stem with maximum length
but which the word includes fits for purpose in terms of reasonably marking
off scope of meaning of the word.
 That approach is extended to whole cases in order to guarantee saving the
meaning of the word. (for more discussion: Step1_turkish_stems_ReadMe.txt)
11/17/2021
EMREHAN 7

COMPONENTS OF MODELS
 𝑝: 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑠
 𝐿𝑎𝑏𝑒𝑙𝑝: 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥
 𝑛: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 (𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑠)
 𝑑𝑜𝑐𝑖: 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡, 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑖 𝑎𝑠 𝑎 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑜𝑟 𝑗𝑢𝑠𝑡 𝑎 ℎ𝑒𝑎𝑑𝑙𝑖𝑛𝑒
 𝑠𝑡𝑒𝑚𝑖𝑗: 𝑠𝑡𝑒𝑚 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑗 𝑜𝑓𝑑𝑜𝑐𝑖
𝑠𝑡𝑒𝑚 𝑐𝑎𝑛 𝑏𝑒 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑚𝑎𝑥 − 𝑠𝑡𝑒𝑚 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑠𝑙𝑖𝑑𝑒𝑠.
 𝑚𝑖: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓𝑠𝑡𝑒𝑚𝑖𝑗
 Σ𝑝
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
 Σ𝑖𝑗
𝑝
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑠𝑡𝑒𝑚𝑖𝑗, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
11/17/2021
EMREHAN 8

 Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max
𝑝
Σ𝑖𝑗
𝑝
 Λ𝑖
𝑝
: counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
 𝜆𝑖𝑗: 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓𝑠𝑡𝑒𝑚𝑖𝑗
 𝜌𝑖
𝑝
≔
𝑗=1
𝑚𝑖 Σ𝑖𝑗
𝑝
Σ𝑝 *
 ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 Σ𝑝
=0, 𝜌𝑖
𝑝
:=0
 Π𝑖𝑗
𝑝
≔
Σ𝑖𝑗
𝑝
𝑞=1
𝑛 Σ𝑖𝑗
𝑞
(𝑖𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑎𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑡𝑒𝑚𝑖𝑗 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥)
11/17/2021
EMREHAN 9

 Π𝑖
𝑝
: = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑗∗ Π𝑖𝑗∗
𝑝
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑎𝑙𝑙 "j∗"s 𝑚𝑒𝑒𝑡 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 Π𝑖𝑗∗
𝑝
> 0
∗ in case that Σ𝑖𝑗
𝑝
= 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 = 1,2, … 𝑛, Π𝑖
𝑝
=0
 Π𝑖
𝑝
≔ max 𝑗(Π𝑖𝑗
𝑝
)
11/17/2021
EMREHAN 10

General Scheme for Prediction Models
X_test 𝑑𝑜𝑐𝑖
𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖)
𝑠𝑡𝑒𝑚𝑖1
𝑠𝑡𝑒𝑚𝑖𝑚𝑖
.
.
.
Σ𝑖1
𝑝
, Π𝑖1
𝑝
, Λ𝑖1, 𝜆𝑖1
Σ𝑖2
𝑝
, Π𝑖2
𝑝
Σ𝑖𝑚𝑖
𝑝
, Π𝑖𝑚𝑖
𝑝
, Λ𝑖𝑚𝑖
, 𝜆𝑖𝑚𝑖
.
.
.
Analyze_word
Predictions
X_train, y_train
Evaluations (accuracy,
confusion etc.) y_test 11/17/2021
EMREHAN 11

Model 1
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 =
𝐿𝑎𝑏𝑒𝑙𝑞 𝑞 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑖𝑓 Σ𝑖𝑗
𝑝
= 0 < Σ𝑖𝑗
𝑞
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞 𝑎𝑛𝑑 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 = 1, … , 𝑚𝑖
𝑞 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐 𝑟,𝑗 ∗ 𝑗 Σ𝑖𝑗
𝑟
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 * c r, j ≔ (𝑟 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
) 𝑎𝑛𝑑 (Λ𝑖𝑗 = 𝐿𝑎𝑏𝑒𝑙𝑟 ) (Because arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
may not be unique.)
11/17/2021
EMREHAN 12

Model 1
𝐿𝑎𝑏𝑒𝑙𝑞 𝑖𝑓 Σ𝑖𝑗
𝑝
= 0 < Σ𝑖𝑗
𝑞
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞
𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∗
∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛

Model 2
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡2 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 𝜌𝑖
𝑝
11/17/2021
EMREHAN 14

Model 3
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝
11/17/2021
EMREHAN 15

Model 4
𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝
𝑖𝑓 𝑞 𝑖𝑠 𝑢𝑛𝑖𝑞𝑢𝑒
𝑞 = arg 𝑚𝑎𝑥𝑟Π𝑖
𝑟
𝑎𝑛𝑑 𝑟 = arg 𝑚𝑎𝑥𝑝Σ𝑖𝑗
𝑝
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
11/17/2021
EMREHAN 16

Model 5
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡5 𝑑𝑜𝑐𝑖 = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 (𝑚𝑎𝑥𝑗 𝜆𝑖𝑗 ∗
Σ𝑖𝑗
𝑝
Σ𝑝
)
11/17/2021
EMREHAN 17

Case «No Prediction»
 No stem of a document may not be included by any document in train set, in
some cases. Trivially prediction functions generate label as «No Prediction».
This probability is nearly zero if size of train set is sufficiently large.
 However there is a higher probability of label «No Prediction» in model having
Combinatorial Approach in the study. Because probability of that all elements
of a combination (a bunch of stems in a document in test set) are in same
document (in train set) is obviously lower than probability of a stem (involved
by document in test set) in a document (in train set) .
 Some examples of that case is observed in Episode two.
11/17/2021
EMREHAN 18

Case «Not Unique»
 In some cases, values generating predictions, like "arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝
" 𝑎𝑛𝑑 "arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝
" , may not be
unique because of equal values. Then models choose label indexed with minimum argument as a
prediction corresponding list structure in Python.
 I use extra parameters (figuratively considered as tiebreaker), Σ𝑖𝑗
𝑝
and 𝑗 Σ𝑖𝑗
𝑟
, on the purpose of
avoiding that case.
 Moreover as train set gets large, probability of exitence of equal values is expected to diminish.
11/17/2021
EMREHAN 19

Application (introduction)
 We use data of «nayn.co» a news portal in Turkish Language. Data is imported by url
«"https://raw.githubusercontent.com/naynco/nayn.data/master/classification_clean.csv"».
 Head of data is presented below
There are 11622 documents («Title» column) with label («DÜNYA» (World), «SPOR» (Sports), «SANAT»(Art) and
«Teknoloji»(Technology)). But data is imbalanced in favor of category «DÜNYA» such that the counts [and percentages] of
categories 9226[%79] ,1967 [%17],285 [%2] and 144 [%1] respectively.
11/17/2021
EMREHAN 20

General Scheme for Application of
Prediction Models
X_test 𝑑𝑜𝑐𝑖
𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖)
𝑠𝑡𝑒𝑚𝑖𝑚𝑖
.
.
.
Σ𝑖1
𝑝
, Π𝑖1
𝑝
Σ𝑖2
𝑝
, Π𝑖2
𝑝
Σ𝑖𝑚𝑖
𝑝
, Π𝑖𝑚𝑖
𝑝
, Λ𝑖𝑚𝑖
, 𝜆𝑖𝑚𝑖
.
.
.
Analyze_doc
Predictions
Turkish
Stem List
(32001 stems)
(for application)
X_train, y_train
Evaluations (accuracy,
confusion etc.) y_test
Step2_Preprocess_word.ipynb
Step3_Classifier_V10.ipynb
Step4_Prediction_Models_ub.ipynb
Step6_Run_Integrated_Model_ub.ipynb
11/17/2021
EMREHAN 21

Application (computations)
 Pandas and Sklearn libraries in Python is used for application of methods. Test size is
chosen as 0.2 and random_state parameter for partition as 57.
 Values of parameters in model are computed below
 Counts of categories: 𝑛 = 4
 Indexes and name of categories: 𝑝 = 1,2,3 𝑎𝑛𝑑 4 , 𝐿𝑎𝑏𝑒𝑙𝑝
= "DÜNYA","SPOR","SANAT
and "Teknoloji", 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦
 Counts of categories in train set : Σ1
= 7384, Σ2
= 1568, Σ3
= 229 𝑎𝑛𝑑 Σ4
= 116
11/17/2021
EMREHAN 22

 Now Let’s show an example and compute its parameters (or compounds of models). We deal with document with
index number 𝑖 = 38296 , 𝑟𝑎𝑛𝑘 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 = 1356 (index number may not be related to rank)
 𝑑𝑜𝑐𝑖: 2 𝑘𝑒𝑑𝑖 2 yıldır sanat müzesine girmeye çalışıyor
(𝑒𝑛: 2 𝑐𝑎𝑡𝑠 𝑡𝑟𝑦 𝑡𝑜 𝑒𝑛𝑡𝑒𝑟 𝑎𝑟𝑡 𝑚𝑢𝑠𝑒𝑢𝑚 𝑓𝑜𝑟 2 𝑦𝑒𝑎𝑟𝑠)
 𝑙𝑎𝑏𝑒𝑙𝑖: 𝐷Ü𝑁𝑌𝐴 (𝑒𝑛: 𝑤𝑜𝑟𝑙𝑑)
 𝑠𝑡𝑒𝑚𝑠: [′𝑘𝑒𝑑𝑖′, ′𝑦𝚤𝑙𝑑𝚤𝑟′, ′𝑠𝑎𝑛𝑎𝑡′, ′𝑚ü𝑧𝑒′, ′𝑔𝑖𝑟′, ′ç𝑎𝑙𝚤′]
 Output of analyze_doc(doc,X_train,y_train): 𝑗 = 1, … , 𝑚𝑖 = 6
𝑠𝑡𝑒𝑚𝑖𝑗 𝜆𝑖𝑗 Σ𝑖𝑗
𝑝
Π𝑖𝑗
𝑝
Λ𝑖𝑗
Category of
2nd max Σ𝑖𝑗
𝑝
(not used)
11/17/2021
EMREHAN 23

 𝑖 = 38296
 𝑠𝑡𝑒𝑚𝑠: 𝑠𝑡𝑒𝑚𝑖1 = "𝑘𝑒𝑑𝑖", 𝑠𝑡𝑒𝑚𝑖2 = "yıldır" , 𝑠𝑡𝑒𝑚𝑖3 = "sanat" , 𝑠𝑡𝑒𝑚𝑖4 = "müze", 𝑠𝑡𝑒𝑚𝑖5 = "gir" , 𝑠𝑡𝑒𝑚𝑖6 = "çalı"
 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 ∶ 𝜆𝑖1 = 4, 𝜆𝑖2 = 6, 𝜆𝑖3 = 5, 𝜆𝑖4 = 4, 𝜆𝑖5 = 3, 𝜆𝑖6 = 4
 Σ𝑖𝑗
𝑝
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑠𝑡𝑒𝑚𝑖𝑗, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
 𝑓𝑜𝑟 𝑗 = 1 𝑎𝑛𝑑 𝑗 = 6 : Σ𝑖1
1
= 15, Σ𝑖1
2
= 2, Σ𝑖1
3
= 0, Σ𝑖1
4
= 0 , Σ𝑖6
1
= 156, Σ𝑖6
2
= 25, Σ𝑖6
3
= 7, Σ𝑖6
4
= 4
 Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max
𝑝
Σ𝑖𝑗
𝑝
: Λi1 = "DÜNYA", Λi2 = "SPOR", Λi3 = "SANAT"
Λi4="DÜNYA", Λi5 = "DÜNYA", Λi6="DÜNYA"
 Λ𝑖
𝑝
: counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
, Λ𝑖
1
= 4, Λ𝑖
2
= 1, Λ𝑖
3
= 1, Λ𝑖
4
= 0
 Π𝑖𝑗
𝑝
for j = 2 and j = 4 ∶ Π𝑖2
1
= 0.21, Π𝑖2
2
= 0.77, Π𝑖2
3
= 0.01, Π𝑖2
4
= 0, Π𝑖4
1
= 0.53, Π𝑖4
2
= 0.06, Π𝑖4
3
= 0.41, Π𝑖4
4
= 0
11/17/2021
EMREHAN 24

 𝜌𝑖
𝑝
: 𝜌𝑖
1
=
311
7384
= 0.042, 𝜌𝑖
2
=
99
1568
= 0.063, 𝜌𝑖
3
=
29
229
= 0.127, 𝜌𝑖
4
=
8
116
= 0.069
 Π𝑖
𝑝
: Π𝑖
1
=
0.88+0.21+0.43+0.53+0.81+0.81
6
= 0.612, Π𝑖
2
=
0.12+0.77+0.06+0.12+0.13
5
= 0.24
Π𝑖
3
=
0.01 + 0.57 + 0.41 + 0.04 + 0.04
5
= 0.214, Π𝑖
4
=
0.03 + 0.02
2
= 0.0.025
 Π𝑖
𝑝
: Π𝑖
1
= 0.88, Π𝑖
2
= 0.77, Π𝑖
3
= 0.57, Π𝑖
4
= 0.03
11/17/2021
EMREHAN 25

 Some Notes:
Algorithm to find stem of word is not be said to work perfectly due to morphological nature of Turkish language:
word: yıldır[….for a year] → stem: yıl[year] but algorithm gives: yıldır(mak)[(to)discourage]
word: çalışıyor [(They) try to ] → stem: çalış(mak)[(to) try (to do something)] but algorithm gives: çalı [bush]
But it is reasonably well:
word: müzesine [to museum] → stem: müze [museum]
word: girmeye [for the purpose of entering] → stem: gir(mek) [(to) enter]
The reason of imperfect cases is turkish stem list which algorithm uses. Because excluding derivational forms in turkish
stem list may give rise to losing of true stem:
for example çalışıyor → çalış(mak) (true stem but in derivational form then excluded) → çal(mak) (original stem but not related modern meaning
of çalış(mak).Among these structures, algorithm gives «çalı», having different meaning but covered by «çalış(mak)». However it is not big deal
that is why nearly all documents including «çalı» related to «çalış(mak)», because «çalı» is not popular word in modern turkish.
This morphological problem in this point is related to computing «larger meaning scope than it should be» , not «narrower
than it should be».
11/17/2021
EMREHAN 26

Application (prediction)
 𝑓𝑜𝑟 𝑖 = 38296
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡1 𝑑𝑜𝑐𝑖 = "𝑆𝑃𝑂𝑅"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡2 𝑑𝑜𝑐𝑖 = "𝑆𝐴𝑁𝐴𝑇"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 𝑑𝑜𝑐𝑖 = "𝐷Ü𝑁𝑌𝐴"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡4 𝑑𝑜𝑐𝑖 = "𝐷Ü𝑁𝑌𝐴"
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡5 𝑑𝑜𝑐𝑖 = "𝑆𝑃𝑂𝑅"
11/17/2021
EMREHAN 27

Application (results)
Confusion Matrix for Model 1 (count) Confusion Matrix for Model 1 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji Total DÜNYA SANAT SPOR Teknoloji Total
Observed
DÜNYA 1509 20 302 11 1842 0.82 0.01 0.16 0.01 1
SANAT 21 21 14 0 56 0.38 0.38 0.25 0 1
SPOR 86 0 312 1 399 0.22 0 0.78 0 1
Teknoloji 25 0 2 1 28 0.89 0 0.07 0.04 1
Accuracy Rate For Model 1
0.79
Observed
DÜNYA 1189 196 165 292 1842 0.65 0.11 0.09 0.16 1
SANAT 5 39 7 5 56 0.09 0.7 0.13 0.09 1
SPOR 41 26 319 13 399 0.1 0.07 0.8 0.03 1
Teknoloji 6 3 2 17 28 0.21 0.11 0.07 0.61 1
0.67
11/17/2021
EMREHAN 28

Observed
DÜNYA 1838 1 2 1 1842 1 0 0 0 1
SANAT 55 0 1 0 56 0.98 0 0.02 0 1
SPOR 291 1 107 0 399 0.73 0 0.27 0 1
Teknoloji 28 0 0 0 28 1 0 0 0 1
0.84
Observed
DÜNYA 1824 0 16 2 1842 0.99 0 0.01 0 1
SANAT 44 9 3 0 56 0.79 0.16 0.05 0 1
SPOR 153 1 245 0 399 0.38 0 0.61 0 1
Teknoloji 28 0 0 0 28 1 0 0 0 1
0.89
11/17/2021
EMREHAN 29

Observed
DÜNYA 963 241 243 395 1842 0.52 0.13 0.13 0.21 1
SANAT 8 33 7 8 56 0.14 0.59 0.13 0.14 1
SPOR 56 33 275 35 399 0.14 0.08 0.69 0.09 1
Teknoloji 6 3 3 16 28 0.21 0.11 0.11 0.57 1
0.55
End of Episode One
11/17/2021
EMREHAN 30

PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based

More Related Content

What's hot (15)

Recently uploaded (20)

PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based