SlideShare a Scribd company logo
PREDICTION MODELS
BASED ON MAX-STEMS
(or harnessing imbalanced data)
Episode Two: A Combinatorial Approach
Ahmet Furkan EMREHAN
(matahmet@gmail.com)
MOTIVATION
 This study is an extension of previous study (chapter one) with combinatiorial
approach.
 In chapter one, I examine five models using distribution of stems separately.
Combination of stems with s-elements have a potential to help efficient
prediction. Because documents including comination of stems may be
semantically closer than one stem based prediction (defined in chapter one).
ADAPTATION OF COMPONENTS OF
MODELS
 𝑝, 𝐿𝑎𝑏𝑒𝑙𝑝
, 𝑛, 𝑑𝑜𝑐𝑖 𝑎𝑛𝑑 Σ𝑝
, components of general parameters, are defined in
Chapter One−Slide 8
 𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
∶ 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 𝑖𝑛𝑑𝑒𝑥𝑒𝑑 𝑗 𝑤𝑖𝑡ℎ 𝑠 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓𝑑𝑜𝑐𝑖
𝑠𝑡𝑒𝑚 𝑐𝑎𝑛 𝑏𝑒 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑚𝑎𝑥 − 𝑠𝑡𝑒𝑚 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑠𝑙𝑖𝑑𝑒𝑠.
 𝑚𝑖
𝑠
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
 Σ𝑖𝑗
𝑝,𝑠
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑎𝑙𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
 Λ𝑖𝑗
𝑠
: = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max
𝑝
Σ𝑖𝑗
𝑝,𝑠
 Λ𝑖
𝑝,𝑠
: counts of Λ𝑖𝑗
𝑠
𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
ADAPTATION OF COMPONENTS OF
MODELS
 𝜆𝑖𝑗
𝑠
: 𝑠𝑢𝑚 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 𝑖𝑛 𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
 𝜌𝑖
𝑝,𝑠
≔
𝑗=1
𝑚𝑖 Σ𝑖𝑗
𝑝,𝑠
Σ𝑝 *
 ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 Σ𝑝=0, 𝜌𝑖
𝑝
=0
 Π𝑖𝑗
𝑝,𝑠
≔
Σ𝑖𝑗
𝑝,𝑠
𝑞=1
𝑛 Σ𝑖𝑗
𝑞
(𝑖𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑎𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥)
 Π𝑖
𝑝,𝑠
: = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑗∗ Π𝑖𝑗∗
𝑝,𝑠
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑎𝑙𝑙 "j∗"s 𝑚𝑒𝑒𝑡 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 Π𝑖𝑗∗
𝑝,𝑠
> 0 *
∗ in case that Σ𝑖𝑗
𝑝,𝑠
= 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 = 1,2, … 𝑛, Π𝑖
𝑝,𝑠
=0
 Π𝑖
𝑝,𝑠
≔ max 𝑗(Π𝑖𝑗
𝑝,𝑠
)
General Scheme for Prediction Models
with Combinatorial Approach
X_test 𝑑𝑜𝑐𝑖
𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖)
𝑐𝑜𝑚𝑏𝑜𝑖𝑖1
𝑐𝑜𝑚𝑏𝑜𝑖𝑖2
𝑐𝑜𝑚𝑏𝑜𝑖𝑖𝑚𝑖
𝑠
.
.
.
Σ𝑖1
𝑝,𝑠
, Π𝑖1
𝑝,𝑠
, Λ𝑖1
𝑠
,𝜆𝑖1
𝑠
Σ𝑖2
𝑝,𝑠
, Π𝑖2
𝑝,𝑠
, Λ𝑖2
𝑠
,𝜆𝑖2
𝑠
Σ𝑖𝑚𝑖
𝑝,𝑠
, Π𝑖𝑚𝑖
𝑝,𝑠
, , Λ𝑖𝑚𝑖
𝑠
𝑠
, 𝜆𝑖𝑚𝑖
𝑠
𝑠
.
.
.
Analyze_word_s
Predictions
With
Combinatiorial
Approach
X_train, y_train
Evaluations (accuracy,
confusion etc.) y_test
𝑐𝑜𝑚𝑏𝑜𝑖
𝑠
= get_combo(𝑠𝑡𝑒𝑚𝑠𝑖)
Model 1
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏1 𝑑𝑜𝑐𝑖, 𝑠 =
𝐿𝑎𝑏𝑒𝑙𝑞 𝑖𝑓 Σ𝑖𝑗
𝑝,𝑠
= 0 < Σ𝑖𝑗
𝑞,𝑠
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞
𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖
𝑝,𝑠
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∗
∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
4/6/2022
EMREHAN 6
Model 2
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏2 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 𝜌𝑖
𝑝,𝑠
4/6/2022
EMREHAN 7
Model 3
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏3 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝,𝑠
4/6/2022
EMREHAN 8
Model 4
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏4 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖
𝑝,𝑠
∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
4/6/2022
EMREHAN 9
Model 5
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏5 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 (𝑚𝑎𝑥𝑗 𝜆𝑖𝑗 ∗
Σ𝑖𝑗
𝑝
Σ𝑝
)
4/6/2022
EMREHAN 10
A Trivial Result
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏𝑘 𝑑𝑜𝑐𝑖, 1 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑘 𝑑𝑜𝑐𝑖 𝑓𝑜𝑟 𝑘 = 1,2, … , 5
 Moreover all parameters in case s = 1, equal to corresponding parameters in
chapter one (slides 8-10) For example 𝑚𝑖
𝑠
= 𝑚𝑖, Σ𝑖𝑗
𝑝,1
= Σ𝑖𝑗
𝑝
and Π𝑖
𝑝,1
= Π𝑖
𝑝
.
Application (introduction)
 We use data of «nayn.co» a news portal in Turkish Language. Data is imported by url
«"https://raw.githubusercontent.com/naynco/nayn.data/master/classification_clean.c
sv"» as done in chapter one.
 Head of data is presented below
There are 11622 documents («Title» column) with label («DÜNYA» (World), «SPOR» (Sports),
«SANAT»(Art) and «Teknoloji»(Technology)). But data is imbalanced in favor of category «DÜNYA» such
that the counts [and percentages] of categories 9226[%79] ,1967 [%17],285 [%2] and 144 [%1]
respectively.
General Scheme for Application of Prediction
Models with Combinatorial Approach
X_test 𝑑𝑜𝑐𝑖
𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖)
𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖)
𝑐𝑜𝑚𝑏𝑜𝑖𝑖1
𝑐𝑜𝑚𝑏𝑜𝑖𝑖2
𝑐𝑜𝑚𝑏𝑜𝑖𝑖𝑚𝑖
𝑠
.
.
.
Σ𝑖1
𝑝,𝑠
, Π𝑖1
𝑝,𝑠
, Λ𝑖1
𝑠
,𝜆𝑖1
𝑠
Σ𝑖2
𝑝,𝑠
, Π𝑖2
𝑝,𝑠
, Λ𝑖2
𝑠
,𝜆𝑖2
𝑠
Σ𝑖𝑚𝑖
𝑝,𝑠
, Π𝑖𝑚𝑖
𝑝,𝑠
, , Λ𝑖𝑚𝑖
𝑠
𝑠
, 𝜆𝑖𝑚𝑖
𝑠
𝑠
.
.
.
Analyze_word_s
Predictions
With
Combinatiorial
Approach
X_train, y_train
Evaluations (accuracy,
confusion etc.) y_test
𝑐𝑜𝑚𝑏𝑜𝑖
𝑠
= get_combo(𝑠𝑡𝑒𝑚𝑠𝑖)
Turkish
Stem List
(32001 stems)
(for application)
Step2_Preprocess_word.ipynb
Step3_Classifier_V10.ipynb
Step4_Prediction_Models_cb.ipynb
Step6_Run_Integrated_Model_cb.ipynb
13
Application (computations)
 Pandas and Sklearn libraries in Python is used for application of methods. Test size is
chosen as 0.2 and random_state parameter for partition as 57.
 Values of parameters in model are computed below
 Counts of categories: 𝑛 = 4
 Indexes and name of categories: 𝑝 = 1,2,3 𝑎𝑛𝑑 4 , 𝐿𝑎𝑏𝑒𝑙𝑝
= "DÜNYA","SPOR","SANAT
and "Teknoloji", 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦
 Counts of categories in train set : Σ1
= 7384, Σ2
= 1568, Σ3
= 229 𝑎𝑛𝑑 Σ4
= 116
4/6/2022
EMREHAN 14
Application (computations)
 Now Let’s show an example and compute its parameters (or compounds of models). We apply models to document,
used in chapter one, with index number 𝑖 = 38296 , 𝑟𝑎𝑛𝑘 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 = 1356 (index number may not be related to
rank). We examine case 𝑠 = 2, because set of 𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
of the document is empty for s > 2 .
 𝑑𝑜𝑐𝑖: 2 𝑘𝑒𝑑𝑖 2 yıldır sanat müzesine girmeye çalışıyor. (𝑒𝑛: 2 𝑐𝑎𝑡𝑠 𝑡𝑟𝑦 𝑡𝑜 𝑒𝑛𝑡𝑒𝑟 𝑎𝑟𝑡 𝑚𝑢𝑠𝑒𝑢𝑚 𝑓𝑜𝑟 2 𝑦𝑒𝑎𝑟𝑠)
 𝑙𝑎𝑏𝑒𝑙𝑖: 𝐷Ü𝑁𝑌𝐴 (𝑒𝑛: 𝑤𝑜𝑟𝑙𝑑)
 𝑠𝑡𝑒𝑚𝑠: [′𝑘𝑒𝑑𝑖′, ′𝑦𝚤𝑙𝑑𝚤𝑟′, ′𝑠𝑎𝑛𝑎𝑡′, ′𝑚ü𝑧𝑒′, ′𝑔𝑖𝑟′, ′ç𝑎𝑙𝚤′]
 Output of analyze_doc(doc,X_train,y_train): 𝑗 = 1, … , 𝑚𝑖
2
= 7
 Output of analyze_doc_s(doc,X_train,y_train,2):𝑗 = 1, … , 7 𝑚𝑖 = 6
𝑐𝑜𝑚𝑏𝑜𝑖𝑗
2
𝜆𝑖𝑗
2
Σ𝑖𝑗
𝑝,2
Π𝑖𝑗
𝑝,2
Λ𝑖
𝑝,2 Category of
2nd max Σ𝑖𝑗
𝑝,2
(not used)
4/6/2022
EMREHAN 15
Application (computations)
 𝑖 = 38296, 𝑠 = 2
 𝑐𝑜𝑚𝑏𝑜: comboi1
2
= [′sanat′,′müze′], comboi2
2
= [′kedi′,′gir′], comboi3
2
= [′müze′,′çalı′], comboi4
2
= [′kedi′,′yıldır′],
comboi5
2
= [′gir′,′çalı′], comboi6
2
= [′yıldır′,′çalı′], comboi7
2
= [′sanat′,′çalı′]
 𝑠𝑢𝑚 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 𝑖𝑛 𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
∶ 𝜆𝑖1
2
= 9, 𝜆𝑖2
2
= 7, 𝜆𝑖3
2
= 8, 𝜆𝑖4
2
= 9, 𝜆𝑖5
2
= 7, 𝜆𝑖6
2
= 10, 𝜆𝑖7
2
= 9
 Σ𝑖𝑗
𝑝,𝑠
: 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑎𝑙𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗
𝑠
, 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡
 𝑓𝑜𝑟 𝑗 = 3 𝑎𝑛𝑑 𝑗 = 7 : Σ𝑖3
1,2
= 2,: Σ𝑖3
2,2
= 0, Σ𝑖3
3,2
= 1, Σ𝑖3
,4,2
= 0 , Σ𝑖7
1,2
= 1, Σ𝑖7
1,2
= 0, Σ𝑖7
1,2
= 0, Σ𝑖7
1,2
= 0
 Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞
𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max
𝑝
Σ𝑖𝑗
𝑝
: Λi1 = "SANAT", Λi2 = "DÜNYA", Λi3 = "DÜNYA" , Λi4="DÜNYA",
Λi5="DÜNYA", Λi6 = "SPOR", Λi7="DÜNYA"
 Λ𝑖
𝑝
: counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
, Λ𝑖
1
= 5, Λ𝑖
2
= 1, Λ𝑖
3
= 1, Λ𝑖
4
= 0
 Π𝑖𝑗
𝑝
for j = 1 and j = 5 ∶ Π𝑖1
1,2
= 0, Π𝑖1
2,2
= 0, Π𝑖1
3,2
= 1, Π𝑖1
4,2
= 0, Π𝑖5
1,2
= 0.75, Π𝑖5
2,2
= 0.25, Π𝑖5
3,2
= 0, Π𝑖5
4,2
= 0
4/6/2022
EMREHAN 16
Application (computations)
 𝜌𝑖
𝑝
: 𝜌𝑖
1
=
8
7384
= 0.001, 𝜌𝑖
2
=
3
1568
= 0.002, 𝜌𝑖
3
=
3
229
= 0.013, 𝜌𝑖
4
=
0
116
= 0
 Π𝑖
𝑝
: Π𝑖
1
=
1+0.67+1+0.75+1
4
= 0.884, Π𝑖
2
=
0.25+1
2
= 0.625, Π𝑖
3
=
1+0.33
2
= 0.665, Π𝑖
4
= 0
 Π𝑖
𝑝
: Π𝑖
1
= 1, Π𝑖
2
= 1, Π𝑖
3
= 1, Π𝑖
4
= 0
4/6/2022
EMREHAN 17
Application (Predictions)
 𝑓𝑜𝑟 𝑖 = 38296
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏1 𝑑𝑜𝑐𝑖, 2 = SPOR
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏2 𝑑𝑜𝑐𝑖, 2 = SANAT
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏3 𝑑𝑜𝑐𝑖, 2 = DÜNYA
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏4 𝑑𝑜𝑐𝑖, 2 = DÜNYA
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏5 𝑑𝑜𝑐𝑖, 2 = SANAT
Application (Results)
Confusion Matrix for Model 1 (count) s= 2 Confusion Matrix for Model 1 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji
No
Prediction
Total DÜNYA SANAT SPOR Teknoloji No Prediction Total
Observed
DÜNYA 1204 47 519 47 25 1842 0.65 0.03 0.28 0.03 0.01 1
SANAT 15 23 15 0 3 56 0.27 0.41 0.27 0 0.05 1
SPOR 201 3 182 0 13 399 0.5 0.01 0.46 0 0.03 1
Teknoloji 18 0 8 2 0 28 0.64 0 0.29 0.07 0 1
Accuracy Rate For Model 1
0.61
Confusion Matrix for Model 2 (count) s= 2 Confusion Matrix for Model 2 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji
No
Prediction
Total DÜNYA SANAT SPOR Teknoloji No Prediction Total
Observed
DÜNYA 1286 174 130 227 25 1842 0.7 0.09 0.07 0.12 0.01 1
SANAT 3 41 9 0 3 56 0.05 0.73 0.16 0 0.05 1
SPOR 30 18 324 14 13 399 0.08 0.05 0.81 0.04 0.03 1
Teknoloji 14 0 2 12 0 28 0.5 0 0.07 0.43 0 1
Accuracy Rate For Model 2
0.72
Application (Results)
Confusion Matrix for Model 3 (count) s= 2 Confusion Matrix for Model 3 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji
No
Prediction
Total DÜNYA SANAT SPOR Teknoloji No Prediction Total
Observed
DÜNYA 1755 17 25 20 25 1842 0.95 0.01 0.01 0.01 0.01 1
SANAT 39 8 6 0 3 56 0.7 0.14 0.11 0 0.05 1
SPOR 159 8 214 5 13 399 0.4 0.02 0.54 0.01 0.03 1
Teknoloji 28 0 0 0 0 28 1 0 0 0 0 1
Accuracy Rate For Model 3
0.85
Confusion Matrix for Model 4 (count) s= 2 Confusion Matrix for Model 4 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji
No
Prediction
Total DÜNYA SANAT SPOR Teknoloji No Prediction Total
Observed
DÜNYA 1805 3 7 2 25 1842 0.98 0 0 0 0.01 1
SANAT 47 2 4 0 3 56 0.84 0.04 0.07 0 0.05 1
SPOR 284 0 102 0 13 399 0.71 0 0.26 0 0.03 1
Teknoloji 28 0 0 0 0 28 1 0 0 0 0 1
Accuracy Rate For Model 4
0.82
Application (Results)
Confusion Matrix for Model 5 (count) s= 2 Confusion Matrix for Model 5 (percentage)
Prediction Prediction (rounded to 2 digits)
DÜNYA SANAT SPOR Teknoloji
No
Prediction
Total DÜNYA SANAT SPOR Teknoloji No Prediction Total
Observed
DÜNYA 977 289 189 362 25 1842 0.53 0.16 0.1 0.2 0.01 1
SANAT 3 40 10 0 3 56 0.05 0.71 0.18 0 0.05 1
SPOR 25 27 308 26 13 399 0.06 0.07 0.77 0.07 0.03 1
Teknoloji 10 0 3 15 0 28 0.36 0 0.11 0.54 0 1
Accuracy Rate For Model 5
0.58
A Note:
All predictions of 25,2,13 documents labelled with «DÜNYA», «SANAT» and
«SPOR» respectively are «No prediction». Because no combinations, with s = 2
stems,of those documents in test set are covered by a document in train set.
Trivially prediction based combinations of stems of these documents, with s > 2
stems, are «No Predidiction».
End of Chapter Two

More Related Content

PPTX
PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based
PPTX
PREDICTION MODELS BASED ON MAX-STEMS Episode Four: _Some Advanced Examinations
PPTX
The Right Way
PDF
Decentralized stabilization of a class of large scale linear interconnected
PDF
Economic dispatch using fuzzy logic
PDF
40220140501006
PPTX
20 Simple CART
PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based
PREDICTION MODELS BASED ON MAX-STEMS Episode Four: _Some Advanced Examinations
The Right Way
Decentralized stabilization of a class of large scale linear interconnected
Economic dispatch using fuzzy logic
40220140501006
20 Simple CART

What's hot (15)

PPTX
mathematical model
PPTX
ePoster_Saunak.Amitangshu
PDF
A comparative study of nonlinear circle criterion based observer and H∞ obser...
PDF
Evaluation of 6 noded quareter point element for crack analysis by analytical...
PDF
Bank loan purchase modeling
PPT
Numerical approximation and solution of equations
PPTX
Outlier
PDF
Mine Death Estimation
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
PPTX
What is Outlier Analysis and How Can It Improve Analysis?
PDF
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
PDF
Ijetcas14 608
PPTX
Approximation and error
PPT
Mathematical modelling
PDF
Matematika terapan week 6
mathematical model
ePoster_Saunak.Amitangshu
A comparative study of nonlinear circle criterion based observer and H∞ obser...
Evaluation of 6 noded quareter point element for crack analysis by analytical...
Bank loan purchase modeling
Numerical approximation and solution of equations
Outlier
Mine Death Estimation
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
What is Outlier Analysis and How Can It Improve Analysis?
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
Ijetcas14 608
Approximation and error
Mathematical modelling
Matematika terapan week 6
Ad

Similar to PREDICTION MODELS BASED ON MAX-STEMS Episode Two: Combinatorial Approach (20)

PDF
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
PDF
MIXTURES OF TRAINED REGRESSION CURVESMODELS FOR HANDRITTEN ARABIC CHARACTER R...
PDF
research on journaling
PPTX
MLU_DTE_Lecture_2.pptx
PDF
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
PPTX
A hybrid sine cosine optimization algorithm for solving global optimization p...
PDF
CCC-Bicluster Analysis for Time Series Gene Expression Data
PPTX
Lecture-1-Algorithms.pptx
PDF
A computational method for system of linear fredholm integral equations
PDF
Design of predictive controller for smooth set point tracking for fast dynami...
PDF
2014-mo444-practical-assignment-04-paulo_faria
PPTX
MACHINE LEARNING.pptx
PPTX
Py data19 final
PPTX
Artificial Neural Network
PDF
Quality Prediction in Fingerprint Compression
PPTX
BU_FCAI_SCC430_Modeling&Simulation_Ch05-P2.pptx
PDF
PATTERN SYNTHESIS OF NON-UNIFORM AMPLITUDE EQUALLY SPACED MICROSTRIP ARRAY AN...
PDF
Computational Intelligence Assisted Engineering Design Optimization (using MA...
PPTX
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PDF
Symmetric quadratic tetration interpolation using forward and backward opera...
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
MIXTURES OF TRAINED REGRESSION CURVESMODELS FOR HANDRITTEN ARABIC CHARACTER R...
research on journaling
MLU_DTE_Lecture_2.pptx
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
A hybrid sine cosine optimization algorithm for solving global optimization p...
CCC-Bicluster Analysis for Time Series Gene Expression Data
Lecture-1-Algorithms.pptx
A computational method for system of linear fredholm integral equations
Design of predictive controller for smooth set point tracking for fast dynami...
2014-mo444-practical-assignment-04-paulo_faria
MACHINE LEARNING.pptx
Py data19 final
Artificial Neural Network
Quality Prediction in Fingerprint Compression
BU_FCAI_SCC430_Modeling&Simulation_Ch05-P2.pptx
PATTERN SYNTHESIS OF NON-UNIFORM AMPLITUDE EQUALLY SPACED MICROSTRIP ARRAY AN...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
Symmetric quadratic tetration interpolation using forward and backward opera...
Ad

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PDF
Introduction to the R Programming Language
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Business Analytics and business intelligence.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to machine learning and Linear Models
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Database Infoormation System (DBIS).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
Introduction to the R Programming Language
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Analytics and business intelligence.pdf
Reliability_Chapter_ presentation 1221.5784
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data
.pdf is not working space design for the following data for the following dat...
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to machine learning and Linear Models
ISS -ESG Data flows What is ESG and HowHow
oil_refinery_comprehensive_20250804084928 (1).pptx
[EN] Industrial Machine Downtime Prediction
Miokarditis (Inflamasi pada Otot Jantung)
Database Infoormation System (DBIS).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx

PREDICTION MODELS BASED ON MAX-STEMS Episode Two: Combinatorial Approach

  • 1. PREDICTION MODELS BASED ON MAX-STEMS (or harnessing imbalanced data) Episode Two: A Combinatorial Approach Ahmet Furkan EMREHAN (matahmet@gmail.com)
  • 2. MOTIVATION  This study is an extension of previous study (chapter one) with combinatiorial approach.  In chapter one, I examine five models using distribution of stems separately. Combination of stems with s-elements have a potential to help efficient prediction. Because documents including comination of stems may be semantically closer than one stem based prediction (defined in chapter one).
  • 3. ADAPTATION OF COMPONENTS OF MODELS  𝑝, 𝐿𝑎𝑏𝑒𝑙𝑝 , 𝑛, 𝑑𝑜𝑐𝑖 𝑎𝑛𝑑 Σ𝑝 , components of general parameters, are defined in Chapter One−Slide 8  𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠 ∶ 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 𝑖𝑛𝑑𝑒𝑥𝑒𝑑 𝑗 𝑤𝑖𝑡ℎ 𝑠 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓𝑑𝑜𝑐𝑖 𝑠𝑡𝑒𝑚 𝑐𝑎𝑛 𝑏𝑒 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑚𝑎𝑥 − 𝑠𝑡𝑒𝑚 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑠𝑙𝑖𝑑𝑒𝑠.  𝑚𝑖 𝑠 : 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠  Σ𝑖𝑗 𝑝,𝑠 : 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑎𝑙𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠 , 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡  Λ𝑖𝑗 𝑠 : = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max 𝑝 Σ𝑖𝑗 𝑝,𝑠  Λ𝑖 𝑝,𝑠 : counts of Λ𝑖𝑗 𝑠 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝
  • 4. ADAPTATION OF COMPONENTS OF MODELS  𝜆𝑖𝑗 𝑠 : 𝑠𝑢𝑚 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 𝑖𝑛 𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠  𝜌𝑖 𝑝,𝑠 ≔ 𝑗=1 𝑚𝑖 Σ𝑖𝑗 𝑝,𝑠 Σ𝑝 *  ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 Σ𝑝=0, 𝜌𝑖 𝑝 =0  Π𝑖𝑗 𝑝,𝑠 ≔ Σ𝑖𝑗 𝑝,𝑠 𝑞=1 𝑛 Σ𝑖𝑗 𝑞 (𝑖𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑎𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠 , 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑝 𝑖𝑛𝑑𝑒𝑥)  Π𝑖 𝑝,𝑠 : = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑗∗ Π𝑖𝑗∗ 𝑝,𝑠 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑎𝑙𝑙 "j∗"s 𝑚𝑒𝑒𝑡 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 Π𝑖𝑗∗ 𝑝,𝑠 > 0 * ∗ in case that Σ𝑖𝑗 𝑝,𝑠 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 = 1,2, … 𝑛, Π𝑖 𝑝,𝑠 =0  Π𝑖 𝑝,𝑠 ≔ max 𝑗(Π𝑖𝑗 𝑝,𝑠 )
  • 5. General Scheme for Prediction Models with Combinatorial Approach X_test 𝑑𝑜𝑐𝑖 𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖) 𝑐𝑜𝑚𝑏𝑜𝑖𝑖1 𝑐𝑜𝑚𝑏𝑜𝑖𝑖2 𝑐𝑜𝑚𝑏𝑜𝑖𝑖𝑚𝑖 𝑠 . . . Σ𝑖1 𝑝,𝑠 , Π𝑖1 𝑝,𝑠 , Λ𝑖1 𝑠 ,𝜆𝑖1 𝑠 Σ𝑖2 𝑝,𝑠 , Π𝑖2 𝑝,𝑠 , Λ𝑖2 𝑠 ,𝜆𝑖2 𝑠 Σ𝑖𝑚𝑖 𝑝,𝑠 , Π𝑖𝑚𝑖 𝑝,𝑠 , , Λ𝑖𝑚𝑖 𝑠 𝑠 , 𝜆𝑖𝑚𝑖 𝑠 𝑠 . . . Analyze_word_s Predictions With Combinatiorial Approach X_train, y_train Evaluations (accuracy, confusion etc.) y_test 𝑐𝑜𝑚𝑏𝑜𝑖 𝑠 = get_combo(𝑠𝑡𝑒𝑚𝑠𝑖)
  • 6. Model 1  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏1 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑖𝑓 Σ𝑖𝑗 𝑝,𝑠 = 0 < Σ𝑖𝑗 𝑞,𝑠 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑝 ≠ 𝑞 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 2𝑛𝑑 𝑚𝑎𝑥𝑝 Λ𝑖 𝑝,𝑠 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∗ ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 4/6/2022 EMREHAN 6
  • 7. Model 2  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏2 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 𝜌𝑖 𝑝,𝑠 4/6/2022 EMREHAN 7
  • 8. Model 3  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏3 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖 𝑝,𝑠 4/6/2022 EMREHAN 8
  • 9. Model 4  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏4 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝Π𝑖 𝑝,𝑠 ∗ 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑡ℎ𝑎𝑡 𝑞 𝑖𝑠 𝑛𝑜𝑡 𝑢𝑛𝑖𝑞𝑒, 𝑞 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑑𝑒𝑥 𝑚𝑒𝑒𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 4/6/2022 EMREHAN 9
  • 10. Model 5  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏5 𝑑𝑜𝑐𝑖, 𝑠 = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg 𝑚𝑎𝑥𝑝 (𝑚𝑎𝑥𝑗 𝜆𝑖𝑗 ∗ Σ𝑖𝑗 𝑝 Σ𝑝 ) 4/6/2022 EMREHAN 10
  • 11. A Trivial Result  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏𝑘 𝑑𝑜𝑐𝑖, 1 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑘 𝑑𝑜𝑐𝑖 𝑓𝑜𝑟 𝑘 = 1,2, … , 5  Moreover all parameters in case s = 1, equal to corresponding parameters in chapter one (slides 8-10) For example 𝑚𝑖 𝑠 = 𝑚𝑖, Σ𝑖𝑗 𝑝,1 = Σ𝑖𝑗 𝑝 and Π𝑖 𝑝,1 = Π𝑖 𝑝 .
  • 12. Application (introduction)  We use data of «nayn.co» a news portal in Turkish Language. Data is imported by url «"https://raw.githubusercontent.com/naynco/nayn.data/master/classification_clean.c sv"» as done in chapter one.  Head of data is presented below There are 11622 documents («Title» column) with label («DÜNYA» (World), «SPOR» (Sports), «SANAT»(Art) and «Teknoloji»(Technology)). But data is imbalanced in favor of category «DÜNYA» such that the counts [and percentages] of categories 9226[%79] ,1967 [%17],285 [%2] and 144 [%1] respectively.
  • 13. General Scheme for Application of Prediction Models with Combinatorial Approach X_test 𝑑𝑜𝑐𝑖 𝑑𝑜𝑐𝑖 = 𝑙𝑜𝑤𝑒𝑟(𝑑𝑜𝑐𝑖) 𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑑𝑜𝑐𝑖 = 𝑐𝑙𝑒𝑎𝑛(𝑑𝑜𝑐𝑖) 𝑠𝑡𝑒𝑚𝑠𝑖 = max _𝑠𝑡𝑒𝑚𝑠(𝑡𝑜𝑘𝑒𝑛𝑖) 𝑐𝑜𝑚𝑏𝑜𝑖𝑖1 𝑐𝑜𝑚𝑏𝑜𝑖𝑖2 𝑐𝑜𝑚𝑏𝑜𝑖𝑖𝑚𝑖 𝑠 . . . Σ𝑖1 𝑝,𝑠 , Π𝑖1 𝑝,𝑠 , Λ𝑖1 𝑠 ,𝜆𝑖1 𝑠 Σ𝑖2 𝑝,𝑠 , Π𝑖2 𝑝,𝑠 , Λ𝑖2 𝑠 ,𝜆𝑖2 𝑠 Σ𝑖𝑚𝑖 𝑝,𝑠 , Π𝑖𝑚𝑖 𝑝,𝑠 , , Λ𝑖𝑚𝑖 𝑠 𝑠 , 𝜆𝑖𝑚𝑖 𝑠 𝑠 . . . Analyze_word_s Predictions With Combinatiorial Approach X_train, y_train Evaluations (accuracy, confusion etc.) y_test 𝑐𝑜𝑚𝑏𝑜𝑖 𝑠 = get_combo(𝑠𝑡𝑒𝑚𝑠𝑖) Turkish Stem List (32001 stems) (for application) Step2_Preprocess_word.ipynb Step3_Classifier_V10.ipynb Step4_Prediction_Models_cb.ipynb Step6_Run_Integrated_Model_cb.ipynb 13
  • 14. Application (computations)  Pandas and Sklearn libraries in Python is used for application of methods. Test size is chosen as 0.2 and random_state parameter for partition as 57.  Values of parameters in model are computed below  Counts of categories: 𝑛 = 4  Indexes and name of categories: 𝑝 = 1,2,3 𝑎𝑛𝑑 4 , 𝐿𝑎𝑏𝑒𝑙𝑝 = "DÜNYA","SPOR","SANAT and "Teknoloji", 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦  Counts of categories in train set : Σ1 = 7384, Σ2 = 1568, Σ3 = 229 𝑎𝑛𝑑 Σ4 = 116 4/6/2022 EMREHAN 14
  • 15. Application (computations)  Now Let’s show an example and compute its parameters (or compounds of models). We apply models to document, used in chapter one, with index number 𝑖 = 38296 , 𝑟𝑎𝑛𝑘 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 = 1356 (index number may not be related to rank). We examine case 𝑠 = 2, because set of 𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠 of the document is empty for s > 2 .  𝑑𝑜𝑐𝑖: 2 𝑘𝑒𝑑𝑖 2 yıldır sanat müzesine girmeye çalışıyor. (𝑒𝑛: 2 𝑐𝑎𝑡𝑠 𝑡𝑟𝑦 𝑡𝑜 𝑒𝑛𝑡𝑒𝑟 𝑎𝑟𝑡 𝑚𝑢𝑠𝑒𝑢𝑚 𝑓𝑜𝑟 2 𝑦𝑒𝑎𝑟𝑠)  𝑙𝑎𝑏𝑒𝑙𝑖: 𝐷Ü𝑁𝑌𝐴 (𝑒𝑛: 𝑤𝑜𝑟𝑙𝑑)  𝑠𝑡𝑒𝑚𝑠: [′𝑘𝑒𝑑𝑖′, ′𝑦𝚤𝑙𝑑𝚤𝑟′, ′𝑠𝑎𝑛𝑎𝑡′, ′𝑚ü𝑧𝑒′, ′𝑔𝑖𝑟′, ′ç𝑎𝑙𝚤′]  Output of analyze_doc(doc,X_train,y_train): 𝑗 = 1, … , 𝑚𝑖 2 = 7  Output of analyze_doc_s(doc,X_train,y_train,2):𝑗 = 1, … , 7 𝑚𝑖 = 6 𝑐𝑜𝑚𝑏𝑜𝑖𝑗 2 𝜆𝑖𝑗 2 Σ𝑖𝑗 𝑝,2 Π𝑖𝑗 𝑝,2 Λ𝑖 𝑝,2 Category of 2nd max Σ𝑖𝑗 𝑝,2 (not used) 4/6/2022 EMREHAN 15
  • 16. Application (computations)  𝑖 = 38296, 𝑠 = 2  𝑐𝑜𝑚𝑏𝑜: comboi1 2 = [′sanat′,′müze′], comboi2 2 = [′kedi′,′gir′], comboi3 2 = [′müze′,′çalı′], comboi4 2 = [′kedi′,′yıldır′], comboi5 2 = [′gir′,′çalı′], comboi6 2 = [′yıldır′,′çalı′], comboi7 2 = [′sanat′,′çalı′]  𝑠𝑢𝑚 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑠𝑡𝑒𝑚𝑠 𝑖𝑛 𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠 ∶ 𝜆𝑖1 2 = 9, 𝜆𝑖2 2 = 7, 𝜆𝑖3 2 = 8, 𝜆𝑖4 2 = 9, 𝜆𝑖5 2 = 7, 𝜆𝑖6 2 = 10, 𝜆𝑖7 2 = 9  Σ𝑖𝑗 𝑝,𝑠 : 𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝑎𝑙𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓𝑐𝑜𝑚𝑏𝑜𝑖𝑗 𝑠 , 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑤𝑖𝑡ℎ 𝑖𝑛𝑑𝑒𝑥 𝑝 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑠𝑒𝑡  𝑓𝑜𝑟 𝑗 = 3 𝑎𝑛𝑑 𝑗 = 7 : Σ𝑖3 1,2 = 2,: Σ𝑖3 2,2 = 0, Σ𝑖3 3,2 = 1, Σ𝑖3 ,4,2 = 0 , Σ𝑖7 1,2 = 1, Σ𝑖7 1,2 = 0, Σ𝑖7 1,2 = 0, Σ𝑖7 1,2 = 0  Λ𝑖𝑗: = 𝐿𝑎𝑏𝑒𝑙𝑞 𝑤ℎ𝑒𝑟𝑒 𝑞 = arg max 𝑝 Σ𝑖𝑗 𝑝 : Λi1 = "SANAT", Λi2 = "DÜNYA", Λi3 = "DÜNYA" , Λi4="DÜNYA", Λi5="DÜNYA", Λi6 = "SPOR", Λi7="DÜNYA"  Λ𝑖 𝑝 : counts of Λ𝑖𝑗 𝑤ℎ𝑖𝑐ℎ 𝑒𝑞𝑢𝑎𝑙𝑠 𝑡𝑜 𝐿𝑎𝑏𝑒𝑙𝑝 , Λ𝑖 1 = 5, Λ𝑖 2 = 1, Λ𝑖 3 = 1, Λ𝑖 4 = 0  Π𝑖𝑗 𝑝 for j = 1 and j = 5 ∶ Π𝑖1 1,2 = 0, Π𝑖1 2,2 = 0, Π𝑖1 3,2 = 1, Π𝑖1 4,2 = 0, Π𝑖5 1,2 = 0.75, Π𝑖5 2,2 = 0.25, Π𝑖5 3,2 = 0, Π𝑖5 4,2 = 0 4/6/2022 EMREHAN 16
  • 17. Application (computations)  𝜌𝑖 𝑝 : 𝜌𝑖 1 = 8 7384 = 0.001, 𝜌𝑖 2 = 3 1568 = 0.002, 𝜌𝑖 3 = 3 229 = 0.013, 𝜌𝑖 4 = 0 116 = 0  Π𝑖 𝑝 : Π𝑖 1 = 1+0.67+1+0.75+1 4 = 0.884, Π𝑖 2 = 0.25+1 2 = 0.625, Π𝑖 3 = 1+0.33 2 = 0.665, Π𝑖 4 = 0  Π𝑖 𝑝 : Π𝑖 1 = 1, Π𝑖 2 = 1, Π𝑖 3 = 1, Π𝑖 4 = 0 4/6/2022 EMREHAN 17
  • 18. Application (Predictions)  𝑓𝑜𝑟 𝑖 = 38296  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏1 𝑑𝑜𝑐𝑖, 2 = SPOR  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏2 𝑑𝑜𝑐𝑖, 2 = SANAT  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏3 𝑑𝑜𝑐𝑖, 2 = DÜNYA  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏4 𝑑𝑜𝑐𝑖, 2 = DÜNYA  𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑐𝑏5 𝑑𝑜𝑐𝑖, 2 = SANAT
  • 19. Application (Results) Confusion Matrix for Model 1 (count) s= 2 Confusion Matrix for Model 1 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji No Prediction Total DÜNYA SANAT SPOR Teknoloji No Prediction Total Observed DÜNYA 1204 47 519 47 25 1842 0.65 0.03 0.28 0.03 0.01 1 SANAT 15 23 15 0 3 56 0.27 0.41 0.27 0 0.05 1 SPOR 201 3 182 0 13 399 0.5 0.01 0.46 0 0.03 1 Teknoloji 18 0 8 2 0 28 0.64 0 0.29 0.07 0 1 Accuracy Rate For Model 1 0.61 Confusion Matrix for Model 2 (count) s= 2 Confusion Matrix for Model 2 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji No Prediction Total DÜNYA SANAT SPOR Teknoloji No Prediction Total Observed DÜNYA 1286 174 130 227 25 1842 0.7 0.09 0.07 0.12 0.01 1 SANAT 3 41 9 0 3 56 0.05 0.73 0.16 0 0.05 1 SPOR 30 18 324 14 13 399 0.08 0.05 0.81 0.04 0.03 1 Teknoloji 14 0 2 12 0 28 0.5 0 0.07 0.43 0 1 Accuracy Rate For Model 2 0.72
  • 20. Application (Results) Confusion Matrix for Model 3 (count) s= 2 Confusion Matrix for Model 3 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji No Prediction Total DÜNYA SANAT SPOR Teknoloji No Prediction Total Observed DÜNYA 1755 17 25 20 25 1842 0.95 0.01 0.01 0.01 0.01 1 SANAT 39 8 6 0 3 56 0.7 0.14 0.11 0 0.05 1 SPOR 159 8 214 5 13 399 0.4 0.02 0.54 0.01 0.03 1 Teknoloji 28 0 0 0 0 28 1 0 0 0 0 1 Accuracy Rate For Model 3 0.85 Confusion Matrix for Model 4 (count) s= 2 Confusion Matrix for Model 4 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji No Prediction Total DÜNYA SANAT SPOR Teknoloji No Prediction Total Observed DÜNYA 1805 3 7 2 25 1842 0.98 0 0 0 0.01 1 SANAT 47 2 4 0 3 56 0.84 0.04 0.07 0 0.05 1 SPOR 284 0 102 0 13 399 0.71 0 0.26 0 0.03 1 Teknoloji 28 0 0 0 0 28 1 0 0 0 0 1 Accuracy Rate For Model 4 0.82
  • 21. Application (Results) Confusion Matrix for Model 5 (count) s= 2 Confusion Matrix for Model 5 (percentage) Prediction Prediction (rounded to 2 digits) DÜNYA SANAT SPOR Teknoloji No Prediction Total DÜNYA SANAT SPOR Teknoloji No Prediction Total Observed DÜNYA 977 289 189 362 25 1842 0.53 0.16 0.1 0.2 0.01 1 SANAT 3 40 10 0 3 56 0.05 0.71 0.18 0 0.05 1 SPOR 25 27 308 26 13 399 0.06 0.07 0.77 0.07 0.03 1 Teknoloji 10 0 3 15 0 28 0.36 0 0.11 0.54 0 1 Accuracy Rate For Model 5 0.58 A Note: All predictions of 25,2,13 documents labelled with «DÜNYA», «SANAT» and «SPOR» respectively are «No prediction». Because no combinations, with s = 2 stems,of those documents in test set are covered by a document in train set. Trivially prediction based combinations of stems of these documents, with s > 2 stems, are «No Predidiction». End of Chapter Two