SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 9, No. 1, February 2019, pp. 652~659
ISSN: 2088-8708, DOI: 10.11591/ijece.v9i1.pp652-659  652
Journal homepage: http://iaescore.com/journals/index.php/IJECE
The effect of training set size in authorship attribution:
application on short Arabic texts
Mohammed AL-Sarem1
, Abdel-Hamid Emara2
1Department of Information System, Taibah University, Madinah, KSA
1Department of Computer Science, Saba'a Region University, Mareb, Yemen
2Department of Computer Science, Taibah University, Madinah, KSA
2Computers and Systems Engineering Department, Al-Azhar University, Egypt
Article Info ABSTRACT
Article history:
Received Apr 14, 2018
Revised Sep 3, 2018
Accepted Sep 26, 2018
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to
identify the original author among a set of candidate authors. Several
research papers were published and several methods and models were
developed for many languages. However, the number of related works for
Arabic is limited. Moreover, investigating the impact of short words length
and training set size is not well addressed. To the best of our knowledge, no
published works or researches, in this direction or even in other languages,
are available. Therefore, we propose to investigate this effect, taking into
account different stylomatric combination. The Mahalanobis distance (MD),
Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA
classifiers. During the experiment, the training dataset size is increased and
the accuracy of the classifiers is recorded. The results are quite interesting
and show different classifiers behaviors. Combining word-based stylomatric
features with n-grams provides the best accuracy reached in average 93%.
Keywords:
Arabic language
Authorship attribution training
set size
Linear regression
Mahalonobis distance
MLP classifier
Copyright © 2019 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Mohammed Al-Sarem,
Department of Information System, Taibah University,
PO box 344, Medina, Kingdom of Saudi Arabia.
Email: mohsarem@gmail.com
1. INTRODUCTION
Unlike text categorization, authorship attribution (AA) is a subfield of linguistics analysis which
aims to identify the original author among a set of candidate authors [1]. The idea can be created basically as
follows: for given text sets of known authorship, the author of the unseen text is estimated by matching the
anonymous text to one author of the candidate set. The estimated function depends on the extracted human
“stylometric fingerprint” features. Despite author's writing style that can change from topic to topic, some
persistent, uncontrolled habit and writing styles are still valid over the time.
Although the importance of Arabic language and the global position which it holds and the widely
use by hundreds of millions of people, only a very small number of authorship attribution studies that employ
machine-learning methods has been published for Arabic texts so [2]. Abbasi and Chen, e.g., applied support
vector machine (SVM) and C4.5 decision trees on political and social Arabic web forum messages [3].
Despite the notable results they have, the dataset is quite large to extract enough features. Stamatatos [4]
tested also the use of SVM on Arabic newspaper. Although the experiment was conducted to investigate the
effect of class imbalance problem, the findings encouraged us to investigate which size and length effect the
performance of AA classifiers has. In [2], the linear discriminant analysis (LDA) is used. For attributing text,
function words were used as feature. At training and testing phase, the used texts were divided into two
chunks: the first chunk with 1000-words, while the second chunk with 2000-words. The findings indicate that
the longer the text is, the higher the obtained performance. The best performance they obtained with the
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
653
second chunk was around 87% accuracy. The same findings were reported in [5] in which the authors
examined the performance of several classifiers, namely, SVM, MLP, Linear Regression, Stamatatos
distance and Manhattan distance. The results confirmed the findings by Eder in 2013 for the English
language [6] and the minimum size of textual data is 2500 words per documents.
Existing works on authorship attribution are conducted with long text size. Besides that, there are
many problems facing researchers in this domain, especially in Arabic:
a. Researches in AA are quite few and are not tackled as much as in other languages;
b. Absence a benchmark data sets, full adaptable support tools lead to increase the exerted effort in
extracting the necessary features;
c. Nature of Arabic language also adds extra steps to text preprocessing, and words in Arabic tend to be
short, which might reduce the effectiveness of some stylistic features, such as word length
distribution [3].
Like any supervised learning problem, solving the AA problems go through three key phases: (i)
data preprocessing phase where the typical dataset collection, feature extraction, feature selection, and
dimension reduction are necessary steps; (ii) training/testing phase: at this step, the classifiers are selected
and the model is built; and (iii) pattern evaluation: to ensure the performance of the selected classifier, k-fold
cross validation with different feature combination is conducted and accuracy of the models are computed.
Current paper tries to draw the researchers' attention to increase their efforts to enhance research in
Arabic language domain. Since the AA problems in Arabic domain are many, current study focuses, on one
hand, on investigating the effect of training set size with short text length. The reason behind such selection is
to provide insights on the way the training set size impacts the accuracy performance. On the other hand, to
help guide others in selecting suitable classifiers based on the size of the training set so that it will give the
optimum result.
In the preparation for this research, there are, to the best of our knowledge, no published works or
researches in this direction or even in other languages. Since there are numerous methods which have been
developed to tackle the AA problem, the researcher limits his investigation to selected methods and writing
style features; however, the idea is still valid to assess other classifiers with different writing style
combination.
This paper is organized as follows: Section II gives a general overview of the authorship attribution
problem, the selected writing styles, and the selected classifiers. Section III presents the conducted
experiment and finding results and Section IV provides some comparison with other existing works in term
of accuracy. Finally, Section V summarizes the whole paper.
2. AUTHORSHIP ATTRIBUTION
2.1 The selected writing style features
To identify the author of an unseen text, set of popular features are extracted in order to quantify the
writing style. In the literature, several features can be found. The interested reader is referred to [1], [5], [7]
and [8]. These features can be categorized into seven main groups: lexical, character, syntactic, semantic,
content-specific, structural and language-specific. In this paper, the lexical features, namely word-based
features have been selected. The features were combined with n-gram method and Part-of-Speech and
organized different sets for testing as follows:
Set1: Word-based Lexical Features (WLF).
Set2: N-Gram.
Set3: Part-of-Speech (POS)
Set4: N-gram + POS
Set5: WLF+N-gram
Set6: WLF+POS
Set7: WLF+N-gram+ POS
2.2 The selected classifiers
Since there are various machine learning methods that can be used for AA and due to many
classifiers can be selected, our selection was randomly and the following methods were used: Mahalanobis
distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP).
2.2.1. Mahalanobis distance
The Mahalanobis distance is a measure of distance between an observation 𝑥⃗ = (𝑥 , 𝑥 , 𝑥 , … , 𝑥 ) ,
a set of its means 𝜇⃗ = (𝜇 , 𝜇 , 𝜇 , … , 𝜇 ) and a distribution D ( covariance matrix S ) as follows:
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659
654
)()()( 1
  
xSxxD T
M (1)
It can also be defined as a dissimilarity measure between two random vectors x and
y of the
same distribution with the covariance matrix S:
)()(),( 1
yxSyxyxd T
 
(2)
In the cases the where the covariance matrix is the identity matrix, the Mahalanobis distance is
reduced to the Euclidean distance. In addition to this, if the covariance matrix is diagonal, then the resulting
distance measure is called a normalized Euclidean distance:



N
i i
ii
s
yx
yxd
1
2
2
)(
),( (3)
where is is the standard deviation of the ix and iy over the sample set. The Mahanalobis distance is
widely used in LDA, Clustering analysis, and classification techniques.
2.2.2. Linear regression
LR is one of the oldest methods that is widely used in predictive analysis. The main idea behind it is
to minimize the sum of the squared errors to fit a straight line to a set of data points. The LR model assumes
that the relationship between the dependent variable iy and the p-vector of regressors ix is linear.
Formally, the model takes the form:
  Xy (4)
where,
y is called the regressand or response variable, X are regressors or predictor variables,  are a
"estimated effects" or regression coefficients, and is an error term.
2.2.3. Multilayer perceptron
The MLP is a classical neural network classifier in which the errors of the output are used to train
the network [9]. When applying MLP classifier, we should take in consideration the high computational
effort for training [10] and problem falling in a local minima which leads to some errors of classification. In
nature, the MLP is a class of feed forward network. It consists of three layers of nodes with at least one or
more hidden layer(s). To ensure training of the networks, several back- propagation techniques have been
utilized. Figure 1 represents a MLP with a single hidden layer (interested reader can refer [11].
Figure 1. A MLP with single hidden layer
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
655
3. EXPERIMENT AND FINDING RESULTS
3.1. Dataset
Our dataset corpus consists of 4631 texts documents which were extracted from Dar Al-ifta AL
Misriyyah website. The texts include Islamic fatwas from 1896 to 1996. The original texts were unstructured
and required special tool to extract the fatwa structure from the whole documents. For this purpose, a C# tool
was implemented. Since we seek to investigate the size and word length effects on AA classifiers, we
grouped fatwas by the word length and size into different groups. Then, a set with the smallest test size is
used later during the experiment. Table 3 summarizes the statistical features of the used dataset. The word
size ranges widely. It varies from 11 words to 1500 words per texts.
3.2. Design of experiment
As mentioned previously, AA problem can be considered a supervised learning task which means
there is a need to train the model, and then test it on a set of unseen texts. Despite the whole text mining
process, we focus here only on the training/test phases and evaluate the performance of the classifiers by the
number of right predicted texts. As said earlier, a combination of WLF, N-gram, and POS features are used
and three types of classifiers are employed (i.e., Mahalanobis distance, MLP, and Linear Regression).
Figure 2 presents in addition to the typical solution of an AA problem, the way that we are followed
during the experiment (the colored lines in red). Below, the general procedure of the experiment:
a. After executing the whole training and attributing phase, the finding results are recorded. Then, at the
next iteration, we increase the training set size and the whole procedure is executed again and again. We
limit the experiment with 4-fold iterations. However, there is a possibility to increase number of iterations
based on the partitioned training sets.
b. The classifier type is still the same until we investigate all the partitioned training sets.
c. The all writing feature combinations are examined, and then the accuracy of the model is recorded.
To avoid a bias that may occur, the partitioned training sets were collected carefully and t-tests were
conducted to ensure that the training sets are not significant differ. The training sets were organized as
follows:
a. TS1: the training set with 330 documents
b. TS2: the training set with 378 documents
c. TS3: the training set with 414 documents
d. TS4: the training set with 438 documents
Figure 2. Design of experiment
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659
656
3.3. Experimental findings
3.3.1. Authorship attribution using WLS features
We recorded the accuracy performance after employing the classifiers with different training sets as
shown in Table 1. As we can see, the classifiers performed differently within the training sets. In almost data
sets, the MLP overcomes others classifiers when the training sets have been increased.
Table 1. Performance of classifiers with WLF features
Classifier Type TS4 TS3 TS2 TS1
MLP 77.08% 91.67% 79.167% 75%
LR 72.916% 91.67% 70.83% 83.33%
MD 64.58% 66.67% 37.5% 58.33%
Interesting finding is notated that accuracy of the MLP classifier increased with increasing the
training set size, then decreased vastly. The same thing can be said about LR and MD methods with some
nuance change. The LR performed better than MD. In addition, it gives also better results where the training
set is quite few.
3.3.2. Authorship attribution using N-gram features
To investigate impact of the training sets size on AA using word N-gram. Indeed, the value of the
gram can be selected randomly. However, to capture more words, we adjusted the n-gram to be five. Table 2
shows the performance of classifiers regarding each training set. Table 3 summarizes the statistical features
of the used dataset. The word size ranges widely. It varies from 11 words to 1500 words per texts.
Table 2. Performance of classifiers with N-gram
Classifier Type TS4 TS3 TS2 TS1
MLP 89.58% 91.67% 91.67% 100%
LR 89.58% 91.67% 87.5% 100%
MD 95.83% 94.44% 95.83% 100%
Table 3. Statistical features of dataset
Statistical
Features
Textual Data Feature
# words # sentences Avg. words/sent. # paragraphs Avg. words length
Mean 125.26 4.51 28.04 12.4975 4.514755
St. Dev. 147.73 3.2850 22.115 12.5138 0.2897
Variance 21824.9 10.79 489.079 156.596 0.08395
Max. 1844 31 195.5 160 5.585
Min. 11 2 4.75 160 3.761
Comparing the classifiers' performance with n-gram with their performance with WLF, we note an
improvement in AA accuracy. However, with increase the training set size the performance deteriorated.
Furthermore, the MD method overcomes other classifiers even with increasing the training set size as shown
in Figure 3.
Figure 3. Classifier Performance with N-gram feature
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
657
3.3.3. Authorship attribution using POS features
With POS method, the performance of the classifiers are different. Both the accuracy of MLP and
LR methods increases with increasing the training set size until a certain point, then decreases again with
increasing the size. The MLP classifier decreases significantly comparing with LR. On the opposite, the MD
method shows a different behavior. The accuracy of MD classifier is still better than other classifiers as
shown in Figure 4.
Figure 4. Classifier performance with POS
3.3.4. Authorship attribution with all features
In this part, we investigate performance of the classifiers with different combination. Table 4
summarizes all the finding results with different combinations. In most cases, the LR overcomes the other
classifiers. The MD provides low performance accuracy, especially when the WLF features combined with
POS. In general, combining POS with N-grams enhances the performance of attribution task. However, the
accuracy is still lower than the WLF and n-gram combination. It is also notable that the classifier's
performance affected negatively with increasing the training set size. In contrary, applying the full
combination of WLF, N-grams, and POS lead to decreasing the accuracy of AA.
Table 4. Performance of classifiers with different combination
Training Sets (TS)
Methods
TS1TS2TS3TS4
69.44%83.33%91.67%83.33%MLPWLF+N-gram
62.5%83.33%91.67%66.67%LR
41.67%60.42%61.11%50%MD
58.33%66.67%83.33%79.167%MLPWLF+ POS
48.61%70.83%86.11%79.167%LR
34.22%41.67%47.22%37.5%MD
66.67%83.33%69.44%87.5%MLPAll
62.5%87.5%94.44%62.5%LR
30.56%56.25%61.11%41.67%MD
79.17%81.25%91.67%91.67%MLPn-gram + POS
86.11%79.17%91.67%87.5%LR
62.5%62.5%77.78%75%MD
4. COMPARISON WITH OTHER WORKS
As conclusion of the previous sections, we found that combining word-based stylomatric features
with n-grams provides the best accuracy reached in average 93%. Thus, current section compares
performances of those classifiers that are used in the reviewed works with our findings in term of accuracy.
Even if they obtained come from different datasets, the comparison gives an indication of the performance of
the different methods. Table 5 presents the comparison regardless the text size. It presents accuracy of the
used classifiers of those works mentioned throughout this paper regardless the text size, whilst Table 6
present the accuracy taking in consideration the same text size.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659
658
Table 5. Comparison of the best reached accuracy in this work with accuracy of other methods regardless the
text size
Reference Accuracy Data
Our Work 100% Islamic Fatwas collected from Dar Al-ifta AL Misriyyah website
Shaker and Corne [2] 87.63% Arabic books obtained from the website of the Arab Writers Union
Abbasi and Chen [3] 85.4% Arabic web forum messages from Yahoo groups
Stamatatos [4] 93.4% Arabic newspaper report of Alhayat
Ouamour et al. [5] 100% Ancient historical books in Arabic
Altheneyan and El Bachir Menai [12] 97,84% Arabic books collected from Alwaraq website
Table 6. Comparison of the best reached accuracy in this workwith accuracy of other methods regarding the
text size
Reference Accuracy Data Size
Our Work 93% [11-1500] words per document
Ouamour et al. [5] 80% [100-1500] words per document
Al-Ayyoub et al. [8] 59.8% Up to 140 words per tweets
5. CONCLUSIONS AND FUTURE WORK
This study has investigated the effect of increasing training set size on performance of authorship
attribution classifiers. The experiment was designed to measure accuracy of classifiers with short Arabic
texts. It was also designed to investigate the performance of AA classifier using different combination of
stylomatric features. We limited our experiment to assess three well-known classifiers, namely linear
regression, multilayer perceptron, and Mahanolobis distance. However, the research methodology is still
valid with other classifiers.
The overall results show that the classifiers have different behaviors regarding the used AA features
and training set size. The MLP classifier exceeds the other classifiers when the WLF features are used. The
MLP, to a certain point, is positively affected by the increase in the training set size, then decreased. The n-
gram methods lead to decreasing the classifiers' performance with increasing the training set size. However,
generally speaking, the n-gram features provide the best results among the all stylomatric features. With POS
features, the classifiers show different behaviors. The MLP classifier decreases significantly compared with
LR. On the contrary, the MD shows a different behaviors and provides better accuracy than MLP and LR.
We also applied the classifiers with different features combination (WLF+ n-Gram, WLF+ POS, n-
Gram+ POS, and a set of all features). The MD provides low accuracy especially when the WLF features
combined with POS. Combining POS with N-grams improves the performance of attribution task. Contrary
to the expected results, applying the full combination of WLF, N-grams, and POS leads to decreasing the
accuracy of AA.
For future work, we intend to extend the experiments to investigate accuracy of other classifiers and
to a larger training set. We also plan to investigate the impact of other feature selection methods on the
performance of the AA problem.
REFERENCES
[1] A. Al-Falahi, M. Ramdani, M. Bellafkih, M. Al-Sarem, "Authorship Attribution in Arabic Poetry' Context Using
Markov Chain classifier," IEEE, 2015.
[2] Shaker, K., Corne, D., “Authorship Attributionin Arabic Using A Hybrid Of Evolutionary Search And Linear
Discriminant Analysis,” In: 2010UK Workshopon Computational Intelligence (UKCI), pp.1–6. 2010,
Doi:10.1109/UKCI.2010.5625580.
[3] Abbasi, A., Chen, H., “Applying Authorship Analysis To Arabic Webcontent,” In: Kantor, P., Muresan, G.,
Roberts, F.,Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (Eds.), Intelligence and Security Informatics, vol.
3495. Springer-Verlag, Berlin, Heidelberg, pp. 183–197, 2005.
[4] Stamatatos, E., “Author Identification: Using Text Sampling To Handle The Class Imbalance Problem,” Inf.
Process. Manage. 44 (2), 790–799, 2008, http://dx.doi.org/10.1016/j.ipm.2007.05.012.
[5] Ouamour, S., Khennouf, S., Bourib, S., Hadjadj, H., & Sayoud, H. “Effect of the Text Size on Stylometry-
Application on Arabic Religious Texts”. In Advanced Computational Methods for Knowledge Engineering pp. 215-
228, Springer, Cham., 2016.
[6] Eder, M., “Does Size Matter? Authorship Attribution, Small Samples, Big Problem,” Lit. Ling. Comput. 2013.
doi:10.1093/llc/fqt066.
[7] Sara El Manar El Bouanani, Ismail Kassou, "Authorship Analysis Studies: A survey," International Journal of
Computer Applications (0975-8887) Volume 86-No 12, January 2014.
[8] M. Al-Ayyoub, Y. Jararweh, A., Rababa'ah, and M., Aldwairi. “Feature Extraction and Selection for Arabic Tweets
Authorship Authentication,” J. Ambient Intell Human Comput. 8:383-393, 2017, DOI 10.1007/s12652-017-0452-1.
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
659
[9] Sayoud, H., “Automatic Speaker Recognition-Connexionnist Approach,” PhD thesis, USTHB University, Algiers,
2003.
[10] F.J Tweedie, S. Singh, and D.I. Holmes, "Neural Network Application in Stylometry: The Federalist Papers,"
Computers and the Humanities, vol. 30, pp. 1-10, 1996.
[11] Pal, S. K., & Mitra, S. “Multilayer Perceptron, Fuzzy Sets, and Classification,” IEEE Transactions on neural
networks, 3(5), 683-697, 1992.
[12] Altheneyan AS, “Menai MEB Naïve Bayes Classifiers for Authorship Attribution of Arabic Texts,” J King Saud
Univ Comput Inf Sci, 26 (4):473–484, 2014.
BIOGRAPHIES OF AUTHORS
Mohammed Al-Sarem is an assistant professor of information system at the Taibah University,
Al Madinah Al Monawarah, KSA. He received the PhD in Informatics from Hassan II
University, Mohammadia, Morocco in 2014. His research interests center on E-learning,
educational data mining, Arabic text mining, and intelligent and adaptive systems. He published
several research papers and participated in several local/international conferences
Abdel Hamid Emara received his BSc, MSc, and PhD in Systems and computers engineering
from Al-Azhar university in 1992, 2000, 2006, respectively. He works as a lecturer in Systems
and Computers Engineering Department at Al-Azhar University in Cairo, Egypt. He has
experience of 12 years which includes both academic and research. He is currently an Assistant
Professor in Computer Science Department at University of Taibah at Al- Madinah Al
Monawarah, KSA. He is research interest includes Knowledge Discovery and Data Mining, Data
Security, Data analysis and Artificial intelligence. He has published quality International journal
papers.

More Related Content

PDF
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
PDF
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPM
PDF
Arabic text categorization algorithm using vector evaluation method
PDF
A Survey of Arabic Text Classification Models
PDF
PDF
Semantic extraction of arabic
PDF
Indexing of Arabic documents automatically based on lexical analysis
PDF
Arabic tweets categorization
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPM
Arabic text categorization algorithm using vector evaluation method
A Survey of Arabic Text Classification Models
Semantic extraction of arabic
Indexing of Arabic documents automatically based on lexical analysis
Arabic tweets categorization

What's hot (19)

PDF
04. 9990 16097-1-ed (edited arf)
PDF
Arabic tweeps dialect prediction based on machine learning approach
PDF
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
PDF
Ny3424442448
PDF
A novel method for arabic multi word term extraction
PDF
Cross language information retrieval in indian
PDF
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
PDF
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
PDF
P33077080
PDF
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
PDF
A New Concept Extraction Method for Ontology Construction From Arabic Text
PDF
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
PDF
L1803058388
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PDF
Suitability of naïve bayesian methods for paragraph level text classification...
PDF
Two Level Disambiguation Model for Query Translation
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
PDF
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
04. 9990 16097-1-ed (edited arf)
Arabic tweeps dialect prediction based on machine learning approach
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
Ny3424442448
A novel method for arabic multi word term extraction
Cross language information retrieval in indian
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
P33077080
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A New Concept Extraction Method for Ontology Construction From Arabic Text
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
L1803058388
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
Suitability of naïve bayesian methods for paragraph level text classification...
Two Level Disambiguation Model for Query Translation
Survey on Indian CLIR and MT systems in Marathi Language
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
Ad

Similar to The effect of training set size in authorship attribution: application on short arabic texts (20)

PDF
A hybrid composite features based sentence level sentiment analyzer
PDF
BERT-based models for classifying multi-dialect Arabic texts
PDF
Text classification supervised algorithms with term frequency inverse documen...
PDF
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
PDF
Text classification
PDF
Text classification
PDF
Text classification
PDF
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
PDF
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
PDF
An in-depth review on News Classification through NLP
PDF
EasyChair-Preprint-7375.pdf
PDF
Query Answering Approach Based on Document Summarization
PDF
E43022023
PDF
Writer Identification via CNN Features and SVM
PDF
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
PDF
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
PDF
ATAR: Attention-based LSTM for Arabizi transliteration
PDF
An improved Arabic text classification method using word embedding
PDF
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
PDF
Taxonomy extraction from automotive natural language requirements using unsup...
A hybrid composite features based sentence level sentiment analyzer
BERT-based models for classifying multi-dialect Arabic texts
Text classification supervised algorithms with term frequency inverse documen...
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
Text classification
Text classification
Text classification
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
An in-depth review on News Classification through NLP
EasyChair-Preprint-7375.pdf
Query Answering Approach Based on Document Summarization
E43022023
Writer Identification via CNN Features and SVM
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
ATAR: Attention-based LSTM for Arabizi transliteration
An improved Arabic text classification method using word embedding
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
Taxonomy extraction from automotive natural language requirements using unsup...
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Artificial Intelligence
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
Management Information system : MIS-e-Business Systems.pptx
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Software Engineering and software moduleing
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Artificial Intelligence
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Categorization of Factors Affecting Classification Algorithms Selection
R24 SURVEYING LAB MANUAL for civil enggi
III.4.1.2_The_Space_Environment.p pdffdf
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Fundamentals of Mechanical Engineering.pptx
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Exploratory_Data_Analysis_Fundamentals.pdf
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Management Information system : MIS-e-Business Systems.pptx
Abrasive, erosive and cavitation wear.pdf
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Software Engineering and software moduleing
Visual Aids for Exploratory Data Analysis.pdf
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Automation-in-Manufacturing-Chapter-Introduction.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...

The effect of training set size in authorship attribution: application on short arabic texts

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 9, No. 1, February 2019, pp. 652~659 ISSN: 2088-8708, DOI: 10.11591/ijece.v9i1.pp652-659  652 Journal homepage: http://iaescore.com/journals/index.php/IJECE The effect of training set size in authorship attribution: application on short Arabic texts Mohammed AL-Sarem1 , Abdel-Hamid Emara2 1Department of Information System, Taibah University, Madinah, KSA 1Department of Computer Science, Saba'a Region University, Mareb, Yemen 2Department of Computer Science, Taibah University, Madinah, KSA 2Computers and Systems Engineering Department, Al-Azhar University, Egypt Article Info ABSTRACT Article history: Received Apr 14, 2018 Revised Sep 3, 2018 Accepted Sep 26, 2018 Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviors. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%. Keywords: Arabic language Authorship attribution training set size Linear regression Mahalonobis distance MLP classifier Copyright © 2019 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Mohammed Al-Sarem, Department of Information System, Taibah University, PO box 344, Medina, Kingdom of Saudi Arabia. Email: mohsarem@gmail.com 1. INTRODUCTION Unlike text categorization, authorship attribution (AA) is a subfield of linguistics analysis which aims to identify the original author among a set of candidate authors [1]. The idea can be created basically as follows: for given text sets of known authorship, the author of the unseen text is estimated by matching the anonymous text to one author of the candidate set. The estimated function depends on the extracted human “stylometric fingerprint” features. Despite author's writing style that can change from topic to topic, some persistent, uncontrolled habit and writing styles are still valid over the time. Although the importance of Arabic language and the global position which it holds and the widely use by hundreds of millions of people, only a very small number of authorship attribution studies that employ machine-learning methods has been published for Arabic texts so [2]. Abbasi and Chen, e.g., applied support vector machine (SVM) and C4.5 decision trees on political and social Arabic web forum messages [3]. Despite the notable results they have, the dataset is quite large to extract enough features. Stamatatos [4] tested also the use of SVM on Arabic newspaper. Although the experiment was conducted to investigate the effect of class imbalance problem, the findings encouraged us to investigate which size and length effect the performance of AA classifiers has. In [2], the linear discriminant analysis (LDA) is used. For attributing text, function words were used as feature. At training and testing phase, the used texts were divided into two chunks: the first chunk with 1000-words, while the second chunk with 2000-words. The findings indicate that the longer the text is, the higher the obtained performance. The best performance they obtained with the
  • 2. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 653 second chunk was around 87% accuracy. The same findings were reported in [5] in which the authors examined the performance of several classifiers, namely, SVM, MLP, Linear Regression, Stamatatos distance and Manhattan distance. The results confirmed the findings by Eder in 2013 for the English language [6] and the minimum size of textual data is 2500 words per documents. Existing works on authorship attribution are conducted with long text size. Besides that, there are many problems facing researchers in this domain, especially in Arabic: a. Researches in AA are quite few and are not tackled as much as in other languages; b. Absence a benchmark data sets, full adaptable support tools lead to increase the exerted effort in extracting the necessary features; c. Nature of Arabic language also adds extra steps to text preprocessing, and words in Arabic tend to be short, which might reduce the effectiveness of some stylistic features, such as word length distribution [3]. Like any supervised learning problem, solving the AA problems go through three key phases: (i) data preprocessing phase where the typical dataset collection, feature extraction, feature selection, and dimension reduction are necessary steps; (ii) training/testing phase: at this step, the classifiers are selected and the model is built; and (iii) pattern evaluation: to ensure the performance of the selected classifier, k-fold cross validation with different feature combination is conducted and accuracy of the models are computed. Current paper tries to draw the researchers' attention to increase their efforts to enhance research in Arabic language domain. Since the AA problems in Arabic domain are many, current study focuses, on one hand, on investigating the effect of training set size with short text length. The reason behind such selection is to provide insights on the way the training set size impacts the accuracy performance. On the other hand, to help guide others in selecting suitable classifiers based on the size of the training set so that it will give the optimum result. In the preparation for this research, there are, to the best of our knowledge, no published works or researches in this direction or even in other languages. Since there are numerous methods which have been developed to tackle the AA problem, the researcher limits his investigation to selected methods and writing style features; however, the idea is still valid to assess other classifiers with different writing style combination. This paper is organized as follows: Section II gives a general overview of the authorship attribution problem, the selected writing styles, and the selected classifiers. Section III presents the conducted experiment and finding results and Section IV provides some comparison with other existing works in term of accuracy. Finally, Section V summarizes the whole paper. 2. AUTHORSHIP ATTRIBUTION 2.1 The selected writing style features To identify the author of an unseen text, set of popular features are extracted in order to quantify the writing style. In the literature, several features can be found. The interested reader is referred to [1], [5], [7] and [8]. These features can be categorized into seven main groups: lexical, character, syntactic, semantic, content-specific, structural and language-specific. In this paper, the lexical features, namely word-based features have been selected. The features were combined with n-gram method and Part-of-Speech and organized different sets for testing as follows: Set1: Word-based Lexical Features (WLF). Set2: N-Gram. Set3: Part-of-Speech (POS) Set4: N-gram + POS Set5: WLF+N-gram Set6: WLF+POS Set7: WLF+N-gram+ POS 2.2 The selected classifiers Since there are various machine learning methods that can be used for AA and due to many classifiers can be selected, our selection was randomly and the following methods were used: Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP). 2.2.1. Mahalanobis distance The Mahalanobis distance is a measure of distance between an observation 𝑥⃗ = (𝑥 , 𝑥 , 𝑥 , … , 𝑥 ) , a set of its means 𝜇⃗ = (𝜇 , 𝜇 , 𝜇 , … , 𝜇 ) and a distribution D ( covariance matrix S ) as follows:
  • 3.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659 654 )()()( 1    xSxxD T M (1) It can also be defined as a dissimilarity measure between two random vectors x and y of the same distribution with the covariance matrix S: )()(),( 1 yxSyxyxd T   (2) In the cases the where the covariance matrix is the identity matrix, the Mahalanobis distance is reduced to the Euclidean distance. In addition to this, if the covariance matrix is diagonal, then the resulting distance measure is called a normalized Euclidean distance:    N i i ii s yx yxd 1 2 2 )( ),( (3) where is is the standard deviation of the ix and iy over the sample set. The Mahanalobis distance is widely used in LDA, Clustering analysis, and classification techniques. 2.2.2. Linear regression LR is one of the oldest methods that is widely used in predictive analysis. The main idea behind it is to minimize the sum of the squared errors to fit a straight line to a set of data points. The LR model assumes that the relationship between the dependent variable iy and the p-vector of regressors ix is linear. Formally, the model takes the form:   Xy (4) where, y is called the regressand or response variable, X are regressors or predictor variables,  are a "estimated effects" or regression coefficients, and is an error term. 2.2.3. Multilayer perceptron The MLP is a classical neural network classifier in which the errors of the output are used to train the network [9]. When applying MLP classifier, we should take in consideration the high computational effort for training [10] and problem falling in a local minima which leads to some errors of classification. In nature, the MLP is a class of feed forward network. It consists of three layers of nodes with at least one or more hidden layer(s). To ensure training of the networks, several back- propagation techniques have been utilized. Figure 1 represents a MLP with a single hidden layer (interested reader can refer [11]. Figure 1. A MLP with single hidden layer
  • 4. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 655 3. EXPERIMENT AND FINDING RESULTS 3.1. Dataset Our dataset corpus consists of 4631 texts documents which were extracted from Dar Al-ifta AL Misriyyah website. The texts include Islamic fatwas from 1896 to 1996. The original texts were unstructured and required special tool to extract the fatwa structure from the whole documents. For this purpose, a C# tool was implemented. Since we seek to investigate the size and word length effects on AA classifiers, we grouped fatwas by the word length and size into different groups. Then, a set with the smallest test size is used later during the experiment. Table 3 summarizes the statistical features of the used dataset. The word size ranges widely. It varies from 11 words to 1500 words per texts. 3.2. Design of experiment As mentioned previously, AA problem can be considered a supervised learning task which means there is a need to train the model, and then test it on a set of unseen texts. Despite the whole text mining process, we focus here only on the training/test phases and evaluate the performance of the classifiers by the number of right predicted texts. As said earlier, a combination of WLF, N-gram, and POS features are used and three types of classifiers are employed (i.e., Mahalanobis distance, MLP, and Linear Regression). Figure 2 presents in addition to the typical solution of an AA problem, the way that we are followed during the experiment (the colored lines in red). Below, the general procedure of the experiment: a. After executing the whole training and attributing phase, the finding results are recorded. Then, at the next iteration, we increase the training set size and the whole procedure is executed again and again. We limit the experiment with 4-fold iterations. However, there is a possibility to increase number of iterations based on the partitioned training sets. b. The classifier type is still the same until we investigate all the partitioned training sets. c. The all writing feature combinations are examined, and then the accuracy of the model is recorded. To avoid a bias that may occur, the partitioned training sets were collected carefully and t-tests were conducted to ensure that the training sets are not significant differ. The training sets were organized as follows: a. TS1: the training set with 330 documents b. TS2: the training set with 378 documents c. TS3: the training set with 414 documents d. TS4: the training set with 438 documents Figure 2. Design of experiment
  • 5.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659 656 3.3. Experimental findings 3.3.1. Authorship attribution using WLS features We recorded the accuracy performance after employing the classifiers with different training sets as shown in Table 1. As we can see, the classifiers performed differently within the training sets. In almost data sets, the MLP overcomes others classifiers when the training sets have been increased. Table 1. Performance of classifiers with WLF features Classifier Type TS4 TS3 TS2 TS1 MLP 77.08% 91.67% 79.167% 75% LR 72.916% 91.67% 70.83% 83.33% MD 64.58% 66.67% 37.5% 58.33% Interesting finding is notated that accuracy of the MLP classifier increased with increasing the training set size, then decreased vastly. The same thing can be said about LR and MD methods with some nuance change. The LR performed better than MD. In addition, it gives also better results where the training set is quite few. 3.3.2. Authorship attribution using N-gram features To investigate impact of the training sets size on AA using word N-gram. Indeed, the value of the gram can be selected randomly. However, to capture more words, we adjusted the n-gram to be five. Table 2 shows the performance of classifiers regarding each training set. Table 3 summarizes the statistical features of the used dataset. The word size ranges widely. It varies from 11 words to 1500 words per texts. Table 2. Performance of classifiers with N-gram Classifier Type TS4 TS3 TS2 TS1 MLP 89.58% 91.67% 91.67% 100% LR 89.58% 91.67% 87.5% 100% MD 95.83% 94.44% 95.83% 100% Table 3. Statistical features of dataset Statistical Features Textual Data Feature # words # sentences Avg. words/sent. # paragraphs Avg. words length Mean 125.26 4.51 28.04 12.4975 4.514755 St. Dev. 147.73 3.2850 22.115 12.5138 0.2897 Variance 21824.9 10.79 489.079 156.596 0.08395 Max. 1844 31 195.5 160 5.585 Min. 11 2 4.75 160 3.761 Comparing the classifiers' performance with n-gram with their performance with WLF, we note an improvement in AA accuracy. However, with increase the training set size the performance deteriorated. Furthermore, the MD method overcomes other classifiers even with increasing the training set size as shown in Figure 3. Figure 3. Classifier Performance with N-gram feature
  • 6. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 657 3.3.3. Authorship attribution using POS features With POS method, the performance of the classifiers are different. Both the accuracy of MLP and LR methods increases with increasing the training set size until a certain point, then decreases again with increasing the size. The MLP classifier decreases significantly comparing with LR. On the opposite, the MD method shows a different behavior. The accuracy of MD classifier is still better than other classifiers as shown in Figure 4. Figure 4. Classifier performance with POS 3.3.4. Authorship attribution with all features In this part, we investigate performance of the classifiers with different combination. Table 4 summarizes all the finding results with different combinations. In most cases, the LR overcomes the other classifiers. The MD provides low performance accuracy, especially when the WLF features combined with POS. In general, combining POS with N-grams enhances the performance of attribution task. However, the accuracy is still lower than the WLF and n-gram combination. It is also notable that the classifier's performance affected negatively with increasing the training set size. In contrary, applying the full combination of WLF, N-grams, and POS lead to decreasing the accuracy of AA. Table 4. Performance of classifiers with different combination Training Sets (TS) Methods TS1TS2TS3TS4 69.44%83.33%91.67%83.33%MLPWLF+N-gram 62.5%83.33%91.67%66.67%LR 41.67%60.42%61.11%50%MD 58.33%66.67%83.33%79.167%MLPWLF+ POS 48.61%70.83%86.11%79.167%LR 34.22%41.67%47.22%37.5%MD 66.67%83.33%69.44%87.5%MLPAll 62.5%87.5%94.44%62.5%LR 30.56%56.25%61.11%41.67%MD 79.17%81.25%91.67%91.67%MLPn-gram + POS 86.11%79.17%91.67%87.5%LR 62.5%62.5%77.78%75%MD 4. COMPARISON WITH OTHER WORKS As conclusion of the previous sections, we found that combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%. Thus, current section compares performances of those classifiers that are used in the reviewed works with our findings in term of accuracy. Even if they obtained come from different datasets, the comparison gives an indication of the performance of the different methods. Table 5 presents the comparison regardless the text size. It presents accuracy of the used classifiers of those works mentioned throughout this paper regardless the text size, whilst Table 6 present the accuracy taking in consideration the same text size.
  • 7.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659 658 Table 5. Comparison of the best reached accuracy in this work with accuracy of other methods regardless the text size Reference Accuracy Data Our Work 100% Islamic Fatwas collected from Dar Al-ifta AL Misriyyah website Shaker and Corne [2] 87.63% Arabic books obtained from the website of the Arab Writers Union Abbasi and Chen [3] 85.4% Arabic web forum messages from Yahoo groups Stamatatos [4] 93.4% Arabic newspaper report of Alhayat Ouamour et al. [5] 100% Ancient historical books in Arabic Altheneyan and El Bachir Menai [12] 97,84% Arabic books collected from Alwaraq website Table 6. Comparison of the best reached accuracy in this workwith accuracy of other methods regarding the text size Reference Accuracy Data Size Our Work 93% [11-1500] words per document Ouamour et al. [5] 80% [100-1500] words per document Al-Ayyoub et al. [8] 59.8% Up to 140 words per tweets 5. CONCLUSIONS AND FUTURE WORK This study has investigated the effect of increasing training set size on performance of authorship attribution classifiers. The experiment was designed to measure accuracy of classifiers with short Arabic texts. It was also designed to investigate the performance of AA classifier using different combination of stylomatric features. We limited our experiment to assess three well-known classifiers, namely linear regression, multilayer perceptron, and Mahanolobis distance. However, the research methodology is still valid with other classifiers. The overall results show that the classifiers have different behaviors regarding the used AA features and training set size. The MLP classifier exceeds the other classifiers when the WLF features are used. The MLP, to a certain point, is positively affected by the increase in the training set size, then decreased. The n- gram methods lead to decreasing the classifiers' performance with increasing the training set size. However, generally speaking, the n-gram features provide the best results among the all stylomatric features. With POS features, the classifiers show different behaviors. The MLP classifier decreases significantly compared with LR. On the contrary, the MD shows a different behaviors and provides better accuracy than MLP and LR. We also applied the classifiers with different features combination (WLF+ n-Gram, WLF+ POS, n- Gram+ POS, and a set of all features). The MD provides low accuracy especially when the WLF features combined with POS. Combining POS with N-grams improves the performance of attribution task. Contrary to the expected results, applying the full combination of WLF, N-grams, and POS leads to decreasing the accuracy of AA. For future work, we intend to extend the experiments to investigate accuracy of other classifiers and to a larger training set. We also plan to investigate the impact of other feature selection methods on the performance of the AA problem. REFERENCES [1] A. Al-Falahi, M. Ramdani, M. Bellafkih, M. Al-Sarem, "Authorship Attribution in Arabic Poetry' Context Using Markov Chain classifier," IEEE, 2015. [2] Shaker, K., Corne, D., “Authorship Attributionin Arabic Using A Hybrid Of Evolutionary Search And Linear Discriminant Analysis,” In: 2010UK Workshopon Computational Intelligence (UKCI), pp.1–6. 2010, Doi:10.1109/UKCI.2010.5625580. [3] Abbasi, A., Chen, H., “Applying Authorship Analysis To Arabic Webcontent,” In: Kantor, P., Muresan, G., Roberts, F.,Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (Eds.), Intelligence and Security Informatics, vol. 3495. Springer-Verlag, Berlin, Heidelberg, pp. 183–197, 2005. [4] Stamatatos, E., “Author Identification: Using Text Sampling To Handle The Class Imbalance Problem,” Inf. Process. Manage. 44 (2), 790–799, 2008, http://dx.doi.org/10.1016/j.ipm.2007.05.012. [5] Ouamour, S., Khennouf, S., Bourib, S., Hadjadj, H., & Sayoud, H. “Effect of the Text Size on Stylometry- Application on Arabic Religious Texts”. In Advanced Computational Methods for Knowledge Engineering pp. 215- 228, Springer, Cham., 2016. [6] Eder, M., “Does Size Matter? Authorship Attribution, Small Samples, Big Problem,” Lit. Ling. Comput. 2013. doi:10.1093/llc/fqt066. [7] Sara El Manar El Bouanani, Ismail Kassou, "Authorship Analysis Studies: A survey," International Journal of Computer Applications (0975-8887) Volume 86-No 12, January 2014. [8] M. Al-Ayyoub, Y. Jararweh, A., Rababa'ah, and M., Aldwairi. “Feature Extraction and Selection for Arabic Tweets Authorship Authentication,” J. Ambient Intell Human Comput. 8:383-393, 2017, DOI 10.1007/s12652-017-0452-1.
  • 8. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 659 [9] Sayoud, H., “Automatic Speaker Recognition-Connexionnist Approach,” PhD thesis, USTHB University, Algiers, 2003. [10] F.J Tweedie, S. Singh, and D.I. Holmes, "Neural Network Application in Stylometry: The Federalist Papers," Computers and the Humanities, vol. 30, pp. 1-10, 1996. [11] Pal, S. K., & Mitra, S. “Multilayer Perceptron, Fuzzy Sets, and Classification,” IEEE Transactions on neural networks, 3(5), 683-697, 1992. [12] Altheneyan AS, “Menai MEB Naïve Bayes Classifiers for Authorship Attribution of Arabic Texts,” J King Saud Univ Comput Inf Sci, 26 (4):473–484, 2014. BIOGRAPHIES OF AUTHORS Mohammed Al-Sarem is an assistant professor of information system at the Taibah University, Al Madinah Al Monawarah, KSA. He received the PhD in Informatics from Hassan II University, Mohammadia, Morocco in 2014. His research interests center on E-learning, educational data mining, Arabic text mining, and intelligent and adaptive systems. He published several research papers and participated in several local/international conferences Abdel Hamid Emara received his BSc, MSc, and PhD in Systems and computers engineering from Al-Azhar university in 1992, 2000, 2006, respectively. He works as a lecturer in Systems and Computers Engineering Department at Al-Azhar University in Cairo, Egypt. He has experience of 12 years which includes both academic and research. He is currently an Assistant Professor in Computer Science Department at University of Taibah at Al- Madinah Al Monawarah, KSA. He is research interest includes Knowledge Discovery and Data Mining, Data Security, Data analysis and Artificial intelligence. He has published quality International journal papers.