SlideShare a Scribd company logo
Application of linguistic cues in the analysis of
language of hate groups
Bartlomiej Balcerzak1
, Wojciech Jaworski1,2
, and Adam Wierzbicki1
1
Polish-Japanese Institute of Information Technology
Koszykowa 86
02-008 Warsaw, Poland
2
Institute of Informatics, University of Warsaw
Banacha 2
02-097 Warsaw, Poland
Abstract. Hate speech and fringe ideologies are social phenomena that
seem to thrive on-line. Various members of the political and religious
fringe are able, via the Internet, to propagate the idea with less effort
compared to more traditional media. In this article we attempt to use
linguistic cues such as parts of speech occurrence in order to distinguish
the language of fringe groups from strictly informative sources. The aim
of this research is to provide a preliminary model for identifying decep-
tive and persuasive materials on-line. Examples of such would include
aggressive marketing and hate speech. For the sake of this paper we aim
to focus on the political aspect. Our research has shown that informa-
tion about sentence length and the occurrence of adjectives and adverbs
can provide information for the identification of differences between the
language of fringe political groups and mainstream news media. The
most efficient method involved the classification of fringe sentences (as
detected by Part of Speech occurrence) within an article, and taking into
account their frequency within the article.
Key words: hate speech, natural language processing, propa-
ganda, machine learning
1 Introduction
In cyberspace millions of people send out and review terabites of data. At the
same time, countless political, religious and ideological agendas can be propa-
gated with nearly no limitations. This in turn leads to an increased risk of the
successful spread of various types of ideologies. Many tools can be introduced,
in order to combat this process and promote a critical approach to informa-
tion available on-line. With the ever growing plethora of websites and forums,
an automated method of recognizing such material seems a viable solution. In
this paper we propose an approach to this problem, which can lead to the de-
velopment of such tools. By applying methods of natural language processing,
we aim to construct a classifier for identifying the language of political propa-
ganda. For the sake of this research we decided to focus on written text that
constitute the language of entities from the political extremes. Applying meth-
ods that concentrate on the structure of text rather than word count may allow
for a more generalized method of automated text processing. In this paper we
want to use a selection of verbal cues as a method of distinguishing the language
of political propaganda. These characteristics include basic information about
sentence length, and its’ structure, and will be extracted with the use of Part
of Speech taggers already available for the English language. Unlike the bag of
words approach, which focuses on word concurrence within a set of documents,
our approach aims to focus on a deeper level of analysis, one involving cues that
describe the style of the text. By using this information, we plan to construct
robust classifiers able to identify the language of fringe elements in the political
spectrum.
In order to perform our research we collected two corpora representing the
extremes of the political spectrum. One, are texts gathered from a set websites
connected with American groups promoting radical nationalistic ideologies (i.e.
national socialism, white and black supremacy, racism, anti-immigrant combat-
ants). The other corpus consists of text from websites that promote communism.
We decided to split the fringe material in two corpora, in order to take account
of potential differences between extreme ideologies. This collection will be com-
pared with the language of mainstream political news, both from national and
local news agencies and newspapers.
If this model of linguistic cues that are based on part of speech occurrence
provides viable results with such differing materials it would indicate that it can
also be used in more subtle scenarios. Analysis of the given material has been
conducted both on the article and sentence level. This involved the application of
machine learning algorithms in order to identify sentences and articles belonging
to political propaganda. Our main assumption is that since fringe ideologies are
bent on changing the world in a radical way, materials that endorse such a
sentiment would be mostly emotional and filled with statements that evaluate
the current situation as well as the desired ideal world, the end game of fully
implementing the tenets of the ideology. Therefore, when speaking in linguistic
term, texts belonging to a fringe ideology corpus would have higher amounts of
adjectives and adverbs than strictly informative sources. Hence, the following
hypotheses regarding the role each of the selected linguistic attributes have been
proposed for fringe ideology detection:
Hypothesis 1 Text belonging to fringe ideologies contain more adjec-
tives and adverbs than strictly informative texts.
which can also be rephrased as:
Hypothesis 2 Sentences belonging to fringe ideologies contain more
adjectives and adverbs than strictly informative texts.
We also propose an additional hypothesis regarding the average length of a sen-
tence in both fringe and informative sources.
Hypothesis 3 Texts that endorse fringe ideologies contain longer sen-
tences than strictly informative sources.
In our current research we focus both on the sentence and article level. We
aim to verify these hypotheses by using machine learning techniques. Extracted
features will serve as attributes used by the algorithm for training. This means
that standard procedures of machine learning evaluation will be applied.
We also propose a fourth hypothesis:
Hypothesis 4 When predicting whether a text belongs to a fringe ide-
ology or informative sources, information about the frequency of fringe
sentences within it provides the highest measures of performance (clas-
sification accuracy)
In order to test these hypotheses, two main tasks have been conducted. First
involved using machine learning techniques to identify whether the sentences or
articles from our corpus belong to the fringe or informative class. Afterwards the
output from the models applied to this tasks will be used on the set of articles
from our corpora. The second task would be predicting, based on the frequency
of propaganda sentences in an article, as given by the classifier, whether said
article belongs to the category of fringe or information. After the performance
of algorithms used in each task is evaluated, attribute extraction methods are
used in order to determine which attributes are crucial for determining whether
the source is fringe or information.
2 Related work
Various fields of natural language processing can be connected with the problem
of identifying the language of political fringe. One of these fields would be text
genre detection, where NLP tools are used in order to identify the genre of a
given text. Such tools involve a bag-of-words approach as well as Part-of-speech
n-grams as used in [1]. Other applications of genre classification include the use
of both linguistic cues and html structure of the web page[2]. In our work, how-
ever we focus not only on detecting text genres, but also on identifying a very
specific type of narration, which can be classified as manipulative ore deceptive.
This leads to a second field of study that can be applied to the study of ma-
terials from the political fringe. Analysis of such sources on the Internet had
been present in the field of computer science for some time. The research done
so far can be divided into three main areas of investigation. First of them can
be described as studies into behavioral patterns of propagandists on-line. This
approach focuses mostly on the agent distributing the content, rather than the
content itself, it is very strongly inspired by research into spam detection [3].
Such research was mostly focused on micro-blogging platforms such as Twitter
[?]lumezano) or open source projects such as Wikipedia [4]. What is notable is
the fact that in these papers the main source of traits that could identify pro-
paganda are mostly meta information, referring to frequency of commenting on
a material, and presenting the same content repeatedly. The other main type of
research is the field of deception detection which deals with the problem of distin-
guishing whether the author of a text intended to fool the recipient. Researchers
working in this field wanted to identify linguistic cues related with deceptive
behavior. Most of these cues such as the ones used by [9] in experimental setups
or [6] for fraudulent financial claims emphasized basic shallow characteristics
such as sentence or noun phrase length that showed to be indicative of deceptive
materials. A more sophisticated method was proposed by [8]. In their paper they
introduced a concept of stylometry (analyzing larger syntactical structures) for
deception detection and opinion spam. Their work focuses on the structure of
syntax, the relation between larger elements of the sentence (ie. Noun phrases).
A slightly different approach was tested by [5] in their essay experiment. With
the aid of LIWC tool[12] they aimed to identify words that appear most often
in deceptive materials. In our work we want to extend and test the findings of
deception detection for the task of identifying textual materials forming the bulk
of fringe ideologies. It is also worth noting that most of the research in this field
was done for the English language. The only instance of deception detection
in another language that we managed to find was study [7] that was done for
Italian.
3 Hypothesis verification
3.1 Dataset
In order to test our hypothesis we gathered a corpus consisting of both radical
ideological material and balanced informative text. All of the corpora represent
modern American English. We prepared a collection of texts taken from three
distinguished sources:
– Nazi corpus: This contains articles from websites belonging to groups in
the US that promote national socialism and ideas of racial and ethnic superi-
ority. When collecting the websites, we used the list of hate groups provided
by the Southern Poverty Law Center[18]. Noticeable sources include Amer-
ican National Socialist Party, National Socialist movement, Aryan Nations
etc. In all, 100 web pages were extracted. An example of a text from this
corpus is shown below:
To a National Socialist, things like PRIDE, HONOR, LOYALTY, COURAGE,
DISCIPLINE, and MORALITY - actually MEAN something. Like our fore-
fathers, we too are willing to SACRIFICE to build a better world for our
children whom we love deeply, and like them as well - we are willing to
DO ANYTHING NECESSARY to ACHIEVE THAT GOAL. We are your
brothers and your sisters, we are your fathers and your mothers, your friends
and your co-workers - WE ARE WHITE AMERICA, just like YOU! Your
enemies in control attempt with their constant ”anti-nazi” propaganda - to
persuade you that ”they” are your ”real friends” - and that WE, your own
kin, are your lifelong enemies. Yet, ask yourself THIS - ”WHO” have you
to ”THANK” for ALL the PROBLEMS FACING YOU? The creatures who
HAVE been in total CONTROL - or - those of us resisting the evil? The
TRUTH is right there in front of you - DON’T be afraid to understand it
and to ACT upon it! You hold the FUTURE - BRIGHT or DARK - in
YOUR hands, and TIME IS RUNNING OUT!
– Communist Corpus: This corpus contains pages from websites of Ameri-
can groups and parties that describe themselves as communist. These include
among others: Progresive Labour Party and The Communist party of the
United States of America. As with the Nazi corpus, 100 web pages were ex-
tracted. Example is presented below:
Today’s action, organized by Good Jobs Nation, comes a year after it filed
a complaint with the Department of Labor that accused food franchises at
federal buildings of violating minimum-wage and overtime laws. They want
Obama to sign an executive order requiring federal agencies to contract only
with companies that engage in collective bargaining. The union leaders pulling
the strings behind Good Jobs Nation are the same people who got us into this
mess in the first place. Most contract jobs used to be full-time union jobs,
and the unions did nothing to stop the bosses from eliminating them. Now
the unions are trying to rebuild their ranks among low-wage workers who
replaced their former members. We need to abolish wage slavery with com-
munist revolution. And the struggle between reform and revolution must be
waged within struggles like this one.
– News Corpus: This corpus will serve as a reference point. It contains polit-
ical news and opinion sections from various American news sources (CNN,
FOX news, CNBC, and local media outlets), in order to construct a balanced
corpus for analysis also 100 web pages are extracted for the task of training
a machine learning algorithm. A typical article from this set looks like one
shown below:
President Barack Obama’s policy toward Syria – three years of red lines and
calls for regime change – culminated Monday in a barrage of air strikes on
terror targets there, marking a turning point for the conflict and thrusting
the President further into it.The U.S. said Saudi Arabia, the United Arab
Emirates, Qatar, Bahrain and Jordan had joined in the attack on ISIS tar-
gets near Raqqa in Syria. The U.S. also launched air strikes against another
terrorist organization, the Khorasan Group.
3.2 Dataset preprocessing
After the working corpus has been prepared we divided each sentence in it into
tokens and conducted Part-of-Speech tagging (with the use of NLTK default
POS tagger). Sentence and word length have been calculated as well, hence
creating a database containing the following attributes of all sentences:
– Sentence Class: a variable determining, whether the sentence belongs to
the informative or fringe corpus
– Traits describing textual quantity: number of tokens in a sentence and
average number of characters in words constituting a sentence.
– Frequency of adjectives and adverbs
We also included occurrence of other parts of speech in order to test if they
have any impact on distinguishing fringe language from informative sources.
After the database was prepared, machine learning algorithms have been
implemented. In order to evaluate the performance of said algorithms in both of
the tasks, the following measures will be used:
– Accuracy: the percentage of all true positives and true negatives produced
by the machine learning algorithm.
– AUC (Area under ROC Curve): the measure of True Positive to False Posi-
tive ratio for various thresholds provided by the machine learning algorithm.
3.3 Machine learning: sentence prediction
The task given to the algorithms was to identify whether a sentence or article
belongs to the fringe or news corpus. 5-fold cross validation procedure has been
implemented in order to validate the algorithms’ performance.
3.4 Article prediction
In this section, we analyze the performance of the article prediction based on
the frequency of Parts of Speech. Since our corpora are balanced, the baseline
measuring performance is 50%, both for accuracy and AUC. Three algorithms
turned put to be the most effective, these being Naive Bayes, K-NN (K=7, cosine
distance) and Neural Net.
Table 1. AUC and Accuracy of used algorithms
Algorithms applied Acc. Nazi AUC Nazi Acc. Comm AUC Comm
Naive Bayes 71% 83% 75% 85%
K-NN 60% 66% 59% 61%
Neural Net 70% 75% 77% 87%
As it is shown in table 1, the best scores were produced by the Naive Bayes
and Neural Net Algorithms. The accuracy and AUC values for these were within
the range of 70 to 80 per cent, which indicates a moderately strong performance.
It also can pointed out that higher scores were achieved for the communist
corpus. This may be in part due to the fact, that communist articles came
from mainly two large sources. Therefore their style maybe more cohesive. The
national-socialist corpus on the other hand was more diverse, allowing for a more
broader spectrum of styles. Still, both types of fringe ideologies achieved similar
results.
Applied methods of attribute extraction shown that the most important
attributes in this task were the frequency of adjectives, adverbs, and proper
nouns. As shown in table 2 containing the standardized differences between the
mean values of these attributes, the frequency of adjectives was higher in both
fringe corpora in relation to the informative sources. In case of adverbs the effect
was visible only for Nazi materials. What is also interesting, articles belonging
to fringe ideologies contained less proper nouns than informative sources. This
maybe because ideological materials tend to be more vague, and connected with
more general processes and entities. However such hypothesis needs to be verified
separately.
Table 2. Standardized differences between mean values for adjective, adverb and
proper noun frequency
Nazi/News Communist/News
Adjectives% 0,24 0,39
Adverbs% 0,2 -0,04
Proper Nouns% -0,38 -0,33
3.5 Sentence prediction
Table 3. AUC of used algorithms
Acc. nazi AUC nazi Acc. communist AUC communist
Naive Bayes 61% 65% 61% 65%
k-NN 62% 69% 64% 64%
Neural Net 65% 72% 63% 69%
As shown in 3 the results tend to be between the threshold of 60 and 70%. No
larger differences have been observed when analyzing the Nazi and communist
corpora. It is also worth noting that classification conducted on the sentence
level provided weaker scores than the one done for the articles. This may be
due to the fact that sentences are shorter, therefor provide less data that the
machine learning algorithms can use.
Similarly to the article prediction task, this is also complemented by attribute
extraction scores from chi-square method. According to chi-squared method the
most important attributes include: sentence length, adjectives, adverbs, and
proper nouns frequency. The relative importance of the attributes used is shown
in table 4, containing the standardized differences between the their mean val-
ues. Only the attributes with the highest differences are presented in this table.
Positive values indicate that the mean value of a given attribute was higher
for the fringe corpus. Negatives indicate that the value was higher for the news
corpus.
Table 4. Standardized differences between mean values for adjective, adverb, noun
and proper noun frequency
Nazi/News Communist/News
Sentence length 0,12 - 0,14
Adjectives% 0,3 0,3
Adverbs% 0,2 0,04
Nouns% -0,18 0,36
Proper Nouns% -0,32 -0,4
The table show a pattern similar to that observed when classifying articles.
Fringe ideology sentences, both communist and Nazi, contained a higher amount
of adjectives and adverbs. They also had less proper nouns than informative
sources. What sets the sentence classification aside from the article one is the
relatively high difference in the frequency of nouns between communist material
and informative sources, as well as the fact that both corpora varied in regards
to average sentence length. The Nazi one tend to contain longer sentences, while
communist collection has shorter ones. In both cases the differences are not as
pronounced as in other attributes.
3.6 Sentence based article prediction
For this task, we decided to use the labels applied to sentences in the previ-
ous subsection in order to predict whether the entire articles belong to fringe
or informative class. Frequency of fringe sentences, as provided by the Neural
Net classifier has been calculated for each article from the corpora used in our
research. Afterwards, we calculated the optimal threshold for classification. For
Nazi materials the threshold value was 57% of fringe sentences per article, and
for communists it was 63%. The performance scores for these corpora are shown
in table 5.
Table 5. Performance
Accuracy AUC
Nazi 81% 80%
Communist 83% 81%
When compared with the scores of machine learning from the method used
in this subsection provides a higher accuracy (80% and 83% compared to 70%
and 77% respectively), with a relatively similar level of AUC. What is also worth
noting, the performance of this method is high even though the scores for sen-
tence prediction rarely exceeded the 70% threshold for either accuracy and AUC.
This observation indicates that even tough prediction on a sentence level can be
faulty, but, when aggregated, they provide a strong signal that allows for a more
successful identification of fringe sources. What is more this method proves to be
slightly more efficient than using Part-of-speech information on the article level.
Further work on the possibilities and limits of this approach will be pursued.
4 Conclusions and observations
Our research has led as to the following conclusions and observations:
– Article classification based on Part-of-Speech tagging has provided robust
scores, indicating that the chosen attributes can be used for identifying fringe
ideological sources. With a baseline of 50% for both accuracy and AUC, the
machine learning algorithms achieved scores exceeding 70% for accuracy, and
80% for AUC. Neural Net and K Nearest Neighbors algorithms produced the
best performance.
– The same task conducted on the level of sentences provided weaker scores,
mostly within the 60% - 70% range both for accuracy and AUC. However,
when the information about the frequency of fringe sentences was applied
to articles, the performance was increased to 80% for accuracy and AUC.
Consequently these score provide a more robust classification than the one
based on Part-of-Speech frequency on the article level. This may indicate
that sentence level classification is burdened with high noise which may be
countered by taking account of fringe sentences frequency in the article.
– In view of collected data hypothesis 3 cannot be considered as true. The
sentence length proved not to be an important attribute for determining
whether the sentence belongs to a fringe or informative source. The texts
from the Nazi corpus, on average, contained longer sentences. Conversely,
communist sources where constructed of shorter ones. However neither pro-
duced differences large enough to affect the machine learning algorithms
performance.
– Analysis on both the article and sentence level shows that adjectives and ad-
verbs frequency plays an important role in identifying fringe sources. More-
over in collected data allows us to consider hypotheses 1 and 2 as verified.
– Analysis has also shown that fringe materials contain less named entities
than informative sources. This may suggest that fringe text tend to be more
vague than those that are solely informative.
In summary, our research lends credence to the notion that such basic linguistic
cues as occurrence of Parts-of-Speech can be used for determining whether a
text belongs to the language of information or fringe ideology.
5 Future work
The results we came up with show, that there is room for further research. Using
basic stylistic cues for identifying specific narrations may be extended to include
language of political, religious or scientific discourse. In our future work we plan
to gather text corpora related to such types of language (marketing, religion,
science etc.) and test them with the use of shallow linguistic characteristics. We
also plan to include in our model such elements of propaganda as repetition,
and vagueness [19][15]. We plan to conduct more detailed analysis of the two
observations we made when researching the role of linguistic features: the dif-
ferences in named entities frequency in fringe and informative sources, and the
frequency of positive class sentences as a method of article prediction. Devising
a computational model may lead to the development of tools dedicated to au-
tomatic detection of specific forms of languages. So far we worked with modern
documents written in English which are focused mostly on one ideology. This is
why in our future work we aim to extend our focus the historical texts written
in different languages (Polish, German etc.). Therefore we will be able to show
whether or not the language of propaganda follows a rigid unchanging pattern,
or is it strongly culture-based. It is also important in order to identify linguistic
cues that can be applied to fringe ideologies only. To this end we plan to apply
the developed model for other fringe ideologies and pseudoscience.
Results we obtained with shallow methods of POS tagging show great promise.
The next natural step would be to use deeper NLP methods, such as syntax pars-
ing, or transcribing texts into logical formulas. In summary, our current work
serves as preliminary research into a field of computational analysis of discourse.
Therefore we can obtain an effective model, which can be extrapolated to var-
ious topical domains. This will prove useful as a way of constructing a more
general theory of computational models of recognizing language of fringe ide-
ologies (as well as other forms of written language such as language of religion,
science, marketing or various social classes). This will be especially important
for classification method based on the fringe sentence frequency in the article.
Providing a more general method of text corpora analysis, therefore allowing for
the design of more intricate automated system that could detect propaganda or
other forms of manipulative content. Focusing on structural aspects would make
it also more resistant to deceptive strategies on the part of content producers.
6 Acknowlegments
This work was financially supported by the European Community from the
European Social Fund within the INTERKADRA project.
It was also supported by the grant Reconcile: Robust Online Credibility Eval-
uation of Web Content from Switzerland through the Swiss Contribution to the
enlarged European Union.
References
1. Sharoff, Serge. ”Classifying Web corpora into domain and genre using automatic
feature identification.” Proceedings of the 3rd Web as Corpus Workshop. 2007.
2. Santini, Marina, Richard Power, and Roger Evans. ”Implementing a characteriza-
tion of genre for automatic genre identification of web pages.” Proceedings of the
COLING/ACL on Main conference poster sessions. Association for Computational
Linguistics, 2006.
3. Metaxas, Panagiotis Takis. ”Web spam, social propaganda and the evolution of
search engine rankings.” Web Information Systems and Technologies. Springer
Berlin Heidelberg, 2010. 170-182.
4. Chandy, Rishi. ”Wikiganda: Identifying propaganda through text analysis.” Cal-
tech Undergraduate Research Journal. Winter 2009 (2008).
5. Mihalcea, Rada, and Carlo Strapparava. ”The lie detector: Explorations in the
automatic recognition of deceptive language.” Proceedings of the ACL-IJCNLP
2009 Conference Short Papers. Association for Computational Linguistics, 2009.
6. Humpherys, Sean L., et al. ”Identification of fraudulent financial statements using
linguistic credibility analysis.” Decision Support Systems 50.3 (2011): 585-594.
7. Fornaciari, Tommaso, and Massimo Poesio. ”Lexical vs. surface features in de-
ceptive language analysis.” Proceedings of the ICAIL 2011 Workshop Applying
Human Language Technology to the Law, AHLTL. 2011.
8. Feng, Song, Ritwik Banerjee, and Yejin Choi. ”Syntactic stylometry for deception
detection.” Proceedings of the 50th Annual Meeting of the Association for Compu-
tational Linguistics: Short Papers-Volume 2. Association for Computational Lin-
guistics, 2012.
9. Ott, Myle, et al. ”Finding deceptive opinion spam by any stretch of the imag-
ination.” Proceedings of the 49th Annual Meeting of the Association for Com-
putational Linguistics: Human Language Technologies-Volume 1. Association for
Computational Linguistics, 2011.
10. Lee, A. M. and Lee(eds.), E. B. (1939). The Fine Art of Propaganda. The Institute
for Propaganda Analysis. Harcourt, Brace and Co.
11. Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with
Python. O’Reilly Media, Inc., 2009.
12. Pennebaker, James W., Martha E. Francis, and Roger J. Booth. ”Linguistic inquiry
and word count: LIWC 2001.” Mahway: Lawrence Erlbaum Associates 71 (2001):
2001.
13. Progressive Labour Party website. http://www.plp.org/
14. Communist Party of the United States of America website www.cpusa.org
15. G?fu, Daniela, and Dan Cristea. ”Towards an Automated Semiotic Analysis of the
Romanian Political Discourse.” Computer Science 21.1 (2013): 61.
16. Paik, Woojin, et al. ”Applying natural language processing (nlp) based metadata
extraction to automatically acquire user preferences.” Proceedings of the 1st inter-
national conference on Knowledge capture. ACM, 2001.
17. Gı̂fu, Daniela, and Ioan Constantin Dima. ”An operational approach of communi-
cational propaganda.” International Letters of Social and Humanistic Sciences 23
(2014): 29-38.
18. http://www.splcenter.org/
19. Lacoue-Labarthe, Philippe, and Jean-Luc Nancy. Le mythe nazi. Editions de
l’Aube, 1998.

More Related Content

PDF
International life Sciences
DOCX
Azucena_Manuscript.docx
PDF
Fake news Detection using Machine Learning
PDF
IRJET- Fake News Detection using Logistic Regression
PDF
QUANTUM CRITICISM: AN ANALYSIS OF POLITICAL NEWS REPORTING
PDF
Quantum Criticism: an Analysis of Political News Reporting
PDF
A Review Of Text Mining Techniques And Applications
PDF
NLP applicata a LIS
International life Sciences
Azucena_Manuscript.docx
Fake news Detection using Machine Learning
IRJET- Fake News Detection using Logistic Regression
QUANTUM CRITICISM: AN ANALYSIS OF POLITICAL NEWS REPORTING
Quantum Criticism: an Analysis of Political News Reporting
A Review Of Text Mining Techniques And Applications
NLP applicata a LIS

Similar to Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups (20)

PPTX
Global Media Monitor - Marko Grobelnik
PDF
Big data analysis of news and social media content
PDF
News Reliability Evaluation using Latent Semantic Analysis
PDF
Fake News Detection
PPT
The role of linguistic information for shallow language processing
PDF
Era of Sociology News Rumors News Detection using Machine Learning
PDF
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
PDF
The Process of Information extraction through Natural Language Processing
DOCX
Paper ThreeWe weren’t interested in doing a story about the.docx
DOC
text_mining.doc
PDF
A Web Interface For Analyzing Hate Speech
PDF
Natural Language Processing
PDF
call for papers, research paper publishing, where to publish research paper, ...
PDF
Assessing the quality of online news
PDF
Topic Tracking for Punjabi Language
PDF
Lesson 40
PDF
AI Lesson 40
PDF
Karlgren
PDF
Web classification of Digital Libraries using GATE Machine Learning  
PDF
Information extraction using discourse
Global Media Monitor - Marko Grobelnik
Big data analysis of news and social media content
News Reliability Evaluation using Latent Semantic Analysis
Fake News Detection
The role of linguistic information for shallow language processing
Era of Sociology News Rumors News Detection using Machine Learning
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
The Process of Information extraction through Natural Language Processing
Paper ThreeWe weren’t interested in doing a story about the.docx
text_mining.doc
A Web Interface For Analyzing Hate Speech
Natural Language Processing
call for papers, research paper publishing, where to publish research paper, ...
Assessing the quality of online news
Topic Tracking for Punjabi Language
Lesson 40
AI Lesson 40
Karlgren
Web classification of Digital Libraries using GATE Machine Learning  
Information extraction using discourse

More from Leonard Goudy (20)

PDF
Full Page Printable Lined Paper - Printable World Ho
PDF
Concept Paper Examples Philippines Educational S
PDF
How To Improve An Essay In 7 Steps Smartessayrewrit
PDF
INTERESTING THESIS TOPICS FOR HIGH SCHO
PDF
German Essays
PDF
Persuasive Essay Site That Writes Essays For You Free
PDF
Money Cant Buy Happiness But Happi
PDF
Example Of Methodology In Research Paper - Free Ess
PDF
Persuasive Essays Examples And Samples Es
PDF
Thesis Statement Thesis Essay Sa
PDF
A Multimedia Visualization Tool For Solving Mechanics Dynamics Problem
PDF
A3 Methodology Going Beyond Process Improvement
PDF
Asexuality In Disability Narratives
PDF
A Short Essay Of Three Research Methods In Qualitative
PDF
An Interactive Educational Environment For Preschool Children
PDF
An Apology For Hermann Hesse S Siddhartha
PDF
Applied Math
PDF
A Survey Of Unstructured Outdoor Play Habits Among Irish Children A Parents ...
PDF
An Exploration Of Corporate Social Responsibility (CSR) As A Lever For Employ...
PDF
A Major Project Report On Quot VEHICLE TRACKING SYSTEM USING GPS AND GSM Q...
Full Page Printable Lined Paper - Printable World Ho
Concept Paper Examples Philippines Educational S
How To Improve An Essay In 7 Steps Smartessayrewrit
INTERESTING THESIS TOPICS FOR HIGH SCHO
German Essays
Persuasive Essay Site That Writes Essays For You Free
Money Cant Buy Happiness But Happi
Example Of Methodology In Research Paper - Free Ess
Persuasive Essays Examples And Samples Es
Thesis Statement Thesis Essay Sa
A Multimedia Visualization Tool For Solving Mechanics Dynamics Problem
A3 Methodology Going Beyond Process Improvement
Asexuality In Disability Narratives
A Short Essay Of Three Research Methods In Qualitative
An Interactive Educational Environment For Preschool Children
An Apology For Hermann Hesse S Siddhartha
Applied Math
A Survey Of Unstructured Outdoor Play Habits Among Irish Children A Parents ...
An Exploration Of Corporate Social Responsibility (CSR) As A Lever For Employ...
A Major Project Report On Quot VEHICLE TRACKING SYSTEM USING GPS AND GSM Q...

Recently uploaded (20)

PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Classroom Observation Tools for Teachers
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Cell Structure & Organelles in detailed.
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Cell Types and Its function , kingdom of life
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Presentation on HIE in infants and its manifestations
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Final Presentation General Medicine 03-08-2024.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
Microbial diseases, their pathogenesis and prophylaxis
Classroom Observation Tools for Teachers
Microbial disease of the cardiovascular and lymphatic systems
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Pharma ospi slides which help in ospi learning
Cell Structure & Organelles in detailed.
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Cell Types and Its function , kingdom of life
102 student loan defaulters named and shamed – Is someone you know on the list?
Complications of Minimal Access Surgery at WLH
Presentation on HIE in infants and its manifestations

Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups

  • 1. Application of linguistic cues in the analysis of language of hate groups Bartlomiej Balcerzak1 , Wojciech Jaworski1,2 , and Adam Wierzbicki1 1 Polish-Japanese Institute of Information Technology Koszykowa 86 02-008 Warsaw, Poland 2 Institute of Informatics, University of Warsaw Banacha 2 02-097 Warsaw, Poland Abstract. Hate speech and fringe ideologies are social phenomena that seem to thrive on-line. Various members of the political and religious fringe are able, via the Internet, to propagate the idea with less effort compared to more traditional media. In this article we attempt to use linguistic cues such as parts of speech occurrence in order to distinguish the language of fringe groups from strictly informative sources. The aim of this research is to provide a preliminary model for identifying decep- tive and persuasive materials on-line. Examples of such would include aggressive marketing and hate speech. For the sake of this paper we aim to focus on the political aspect. Our research has shown that informa- tion about sentence length and the occurrence of adjectives and adverbs can provide information for the identification of differences between the language of fringe political groups and mainstream news media. The most efficient method involved the classification of fringe sentences (as detected by Part of Speech occurrence) within an article, and taking into account their frequency within the article. Key words: hate speech, natural language processing, propa- ganda, machine learning 1 Introduction In cyberspace millions of people send out and review terabites of data. At the same time, countless political, religious and ideological agendas can be propa- gated with nearly no limitations. This in turn leads to an increased risk of the successful spread of various types of ideologies. Many tools can be introduced, in order to combat this process and promote a critical approach to informa- tion available on-line. With the ever growing plethora of websites and forums, an automated method of recognizing such material seems a viable solution. In this paper we propose an approach to this problem, which can lead to the de- velopment of such tools. By applying methods of natural language processing, we aim to construct a classifier for identifying the language of political propa- ganda. For the sake of this research we decided to focus on written text that
  • 2. constitute the language of entities from the political extremes. Applying meth- ods that concentrate on the structure of text rather than word count may allow for a more generalized method of automated text processing. In this paper we want to use a selection of verbal cues as a method of distinguishing the language of political propaganda. These characteristics include basic information about sentence length, and its’ structure, and will be extracted with the use of Part of Speech taggers already available for the English language. Unlike the bag of words approach, which focuses on word concurrence within a set of documents, our approach aims to focus on a deeper level of analysis, one involving cues that describe the style of the text. By using this information, we plan to construct robust classifiers able to identify the language of fringe elements in the political spectrum. In order to perform our research we collected two corpora representing the extremes of the political spectrum. One, are texts gathered from a set websites connected with American groups promoting radical nationalistic ideologies (i.e. national socialism, white and black supremacy, racism, anti-immigrant combat- ants). The other corpus consists of text from websites that promote communism. We decided to split the fringe material in two corpora, in order to take account of potential differences between extreme ideologies. This collection will be com- pared with the language of mainstream political news, both from national and local news agencies and newspapers. If this model of linguistic cues that are based on part of speech occurrence provides viable results with such differing materials it would indicate that it can also be used in more subtle scenarios. Analysis of the given material has been conducted both on the article and sentence level. This involved the application of machine learning algorithms in order to identify sentences and articles belonging to political propaganda. Our main assumption is that since fringe ideologies are bent on changing the world in a radical way, materials that endorse such a sentiment would be mostly emotional and filled with statements that evaluate the current situation as well as the desired ideal world, the end game of fully implementing the tenets of the ideology. Therefore, when speaking in linguistic term, texts belonging to a fringe ideology corpus would have higher amounts of adjectives and adverbs than strictly informative sources. Hence, the following hypotheses regarding the role each of the selected linguistic attributes have been proposed for fringe ideology detection: Hypothesis 1 Text belonging to fringe ideologies contain more adjec- tives and adverbs than strictly informative texts. which can also be rephrased as: Hypothesis 2 Sentences belonging to fringe ideologies contain more adjectives and adverbs than strictly informative texts. We also propose an additional hypothesis regarding the average length of a sen- tence in both fringe and informative sources.
  • 3. Hypothesis 3 Texts that endorse fringe ideologies contain longer sen- tences than strictly informative sources. In our current research we focus both on the sentence and article level. We aim to verify these hypotheses by using machine learning techniques. Extracted features will serve as attributes used by the algorithm for training. This means that standard procedures of machine learning evaluation will be applied. We also propose a fourth hypothesis: Hypothesis 4 When predicting whether a text belongs to a fringe ide- ology or informative sources, information about the frequency of fringe sentences within it provides the highest measures of performance (clas- sification accuracy) In order to test these hypotheses, two main tasks have been conducted. First involved using machine learning techniques to identify whether the sentences or articles from our corpus belong to the fringe or informative class. Afterwards the output from the models applied to this tasks will be used on the set of articles from our corpora. The second task would be predicting, based on the frequency of propaganda sentences in an article, as given by the classifier, whether said article belongs to the category of fringe or information. After the performance of algorithms used in each task is evaluated, attribute extraction methods are used in order to determine which attributes are crucial for determining whether the source is fringe or information. 2 Related work Various fields of natural language processing can be connected with the problem of identifying the language of political fringe. One of these fields would be text genre detection, where NLP tools are used in order to identify the genre of a given text. Such tools involve a bag-of-words approach as well as Part-of-speech n-grams as used in [1]. Other applications of genre classification include the use of both linguistic cues and html structure of the web page[2]. In our work, how- ever we focus not only on detecting text genres, but also on identifying a very specific type of narration, which can be classified as manipulative ore deceptive. This leads to a second field of study that can be applied to the study of ma- terials from the political fringe. Analysis of such sources on the Internet had been present in the field of computer science for some time. The research done so far can be divided into three main areas of investigation. First of them can be described as studies into behavioral patterns of propagandists on-line. This approach focuses mostly on the agent distributing the content, rather than the content itself, it is very strongly inspired by research into spam detection [3]. Such research was mostly focused on micro-blogging platforms such as Twitter [?]lumezano) or open source projects such as Wikipedia [4]. What is notable is the fact that in these papers the main source of traits that could identify pro- paganda are mostly meta information, referring to frequency of commenting on
  • 4. a material, and presenting the same content repeatedly. The other main type of research is the field of deception detection which deals with the problem of distin- guishing whether the author of a text intended to fool the recipient. Researchers working in this field wanted to identify linguistic cues related with deceptive behavior. Most of these cues such as the ones used by [9] in experimental setups or [6] for fraudulent financial claims emphasized basic shallow characteristics such as sentence or noun phrase length that showed to be indicative of deceptive materials. A more sophisticated method was proposed by [8]. In their paper they introduced a concept of stylometry (analyzing larger syntactical structures) for deception detection and opinion spam. Their work focuses on the structure of syntax, the relation between larger elements of the sentence (ie. Noun phrases). A slightly different approach was tested by [5] in their essay experiment. With the aid of LIWC tool[12] they aimed to identify words that appear most often in deceptive materials. In our work we want to extend and test the findings of deception detection for the task of identifying textual materials forming the bulk of fringe ideologies. It is also worth noting that most of the research in this field was done for the English language. The only instance of deception detection in another language that we managed to find was study [7] that was done for Italian. 3 Hypothesis verification 3.1 Dataset In order to test our hypothesis we gathered a corpus consisting of both radical ideological material and balanced informative text. All of the corpora represent modern American English. We prepared a collection of texts taken from three distinguished sources: – Nazi corpus: This contains articles from websites belonging to groups in the US that promote national socialism and ideas of racial and ethnic superi- ority. When collecting the websites, we used the list of hate groups provided by the Southern Poverty Law Center[18]. Noticeable sources include Amer- ican National Socialist Party, National Socialist movement, Aryan Nations etc. In all, 100 web pages were extracted. An example of a text from this corpus is shown below: To a National Socialist, things like PRIDE, HONOR, LOYALTY, COURAGE, DISCIPLINE, and MORALITY - actually MEAN something. Like our fore- fathers, we too are willing to SACRIFICE to build a better world for our children whom we love deeply, and like them as well - we are willing to DO ANYTHING NECESSARY to ACHIEVE THAT GOAL. We are your brothers and your sisters, we are your fathers and your mothers, your friends and your co-workers - WE ARE WHITE AMERICA, just like YOU! Your enemies in control attempt with their constant ”anti-nazi” propaganda - to persuade you that ”they” are your ”real friends” - and that WE, your own
  • 5. kin, are your lifelong enemies. Yet, ask yourself THIS - ”WHO” have you to ”THANK” for ALL the PROBLEMS FACING YOU? The creatures who HAVE been in total CONTROL - or - those of us resisting the evil? The TRUTH is right there in front of you - DON’T be afraid to understand it and to ACT upon it! You hold the FUTURE - BRIGHT or DARK - in YOUR hands, and TIME IS RUNNING OUT! – Communist Corpus: This corpus contains pages from websites of Ameri- can groups and parties that describe themselves as communist. These include among others: Progresive Labour Party and The Communist party of the United States of America. As with the Nazi corpus, 100 web pages were ex- tracted. Example is presented below: Today’s action, organized by Good Jobs Nation, comes a year after it filed a complaint with the Department of Labor that accused food franchises at federal buildings of violating minimum-wage and overtime laws. They want Obama to sign an executive order requiring federal agencies to contract only with companies that engage in collective bargaining. The union leaders pulling the strings behind Good Jobs Nation are the same people who got us into this mess in the first place. Most contract jobs used to be full-time union jobs, and the unions did nothing to stop the bosses from eliminating them. Now the unions are trying to rebuild their ranks among low-wage workers who replaced their former members. We need to abolish wage slavery with com- munist revolution. And the struggle between reform and revolution must be waged within struggles like this one. – News Corpus: This corpus will serve as a reference point. It contains polit- ical news and opinion sections from various American news sources (CNN, FOX news, CNBC, and local media outlets), in order to construct a balanced corpus for analysis also 100 web pages are extracted for the task of training a machine learning algorithm. A typical article from this set looks like one shown below: President Barack Obama’s policy toward Syria – three years of red lines and calls for regime change – culminated Monday in a barrage of air strikes on terror targets there, marking a turning point for the conflict and thrusting the President further into it.The U.S. said Saudi Arabia, the United Arab Emirates, Qatar, Bahrain and Jordan had joined in the attack on ISIS tar- gets near Raqqa in Syria. The U.S. also launched air strikes against another terrorist organization, the Khorasan Group.
  • 6. 3.2 Dataset preprocessing After the working corpus has been prepared we divided each sentence in it into tokens and conducted Part-of-Speech tagging (with the use of NLTK default POS tagger). Sentence and word length have been calculated as well, hence creating a database containing the following attributes of all sentences: – Sentence Class: a variable determining, whether the sentence belongs to the informative or fringe corpus – Traits describing textual quantity: number of tokens in a sentence and average number of characters in words constituting a sentence. – Frequency of adjectives and adverbs We also included occurrence of other parts of speech in order to test if they have any impact on distinguishing fringe language from informative sources. After the database was prepared, machine learning algorithms have been implemented. In order to evaluate the performance of said algorithms in both of the tasks, the following measures will be used: – Accuracy: the percentage of all true positives and true negatives produced by the machine learning algorithm. – AUC (Area under ROC Curve): the measure of True Positive to False Posi- tive ratio for various thresholds provided by the machine learning algorithm. 3.3 Machine learning: sentence prediction The task given to the algorithms was to identify whether a sentence or article belongs to the fringe or news corpus. 5-fold cross validation procedure has been implemented in order to validate the algorithms’ performance. 3.4 Article prediction In this section, we analyze the performance of the article prediction based on the frequency of Parts of Speech. Since our corpora are balanced, the baseline measuring performance is 50%, both for accuracy and AUC. Three algorithms turned put to be the most effective, these being Naive Bayes, K-NN (K=7, cosine distance) and Neural Net. Table 1. AUC and Accuracy of used algorithms Algorithms applied Acc. Nazi AUC Nazi Acc. Comm AUC Comm Naive Bayes 71% 83% 75% 85% K-NN 60% 66% 59% 61% Neural Net 70% 75% 77% 87%
  • 7. As it is shown in table 1, the best scores were produced by the Naive Bayes and Neural Net Algorithms. The accuracy and AUC values for these were within the range of 70 to 80 per cent, which indicates a moderately strong performance. It also can pointed out that higher scores were achieved for the communist corpus. This may be in part due to the fact, that communist articles came from mainly two large sources. Therefore their style maybe more cohesive. The national-socialist corpus on the other hand was more diverse, allowing for a more broader spectrum of styles. Still, both types of fringe ideologies achieved similar results. Applied methods of attribute extraction shown that the most important attributes in this task were the frequency of adjectives, adverbs, and proper nouns. As shown in table 2 containing the standardized differences between the mean values of these attributes, the frequency of adjectives was higher in both fringe corpora in relation to the informative sources. In case of adverbs the effect was visible only for Nazi materials. What is also interesting, articles belonging to fringe ideologies contained less proper nouns than informative sources. This maybe because ideological materials tend to be more vague, and connected with more general processes and entities. However such hypothesis needs to be verified separately. Table 2. Standardized differences between mean values for adjective, adverb and proper noun frequency Nazi/News Communist/News Adjectives% 0,24 0,39 Adverbs% 0,2 -0,04 Proper Nouns% -0,38 -0,33 3.5 Sentence prediction Table 3. AUC of used algorithms Acc. nazi AUC nazi Acc. communist AUC communist Naive Bayes 61% 65% 61% 65% k-NN 62% 69% 64% 64% Neural Net 65% 72% 63% 69% As shown in 3 the results tend to be between the threshold of 60 and 70%. No larger differences have been observed when analyzing the Nazi and communist
  • 8. corpora. It is also worth noting that classification conducted on the sentence level provided weaker scores than the one done for the articles. This may be due to the fact that sentences are shorter, therefor provide less data that the machine learning algorithms can use. Similarly to the article prediction task, this is also complemented by attribute extraction scores from chi-square method. According to chi-squared method the most important attributes include: sentence length, adjectives, adverbs, and proper nouns frequency. The relative importance of the attributes used is shown in table 4, containing the standardized differences between the their mean val- ues. Only the attributes with the highest differences are presented in this table. Positive values indicate that the mean value of a given attribute was higher for the fringe corpus. Negatives indicate that the value was higher for the news corpus. Table 4. Standardized differences between mean values for adjective, adverb, noun and proper noun frequency Nazi/News Communist/News Sentence length 0,12 - 0,14 Adjectives% 0,3 0,3 Adverbs% 0,2 0,04 Nouns% -0,18 0,36 Proper Nouns% -0,32 -0,4 The table show a pattern similar to that observed when classifying articles. Fringe ideology sentences, both communist and Nazi, contained a higher amount of adjectives and adverbs. They also had less proper nouns than informative sources. What sets the sentence classification aside from the article one is the relatively high difference in the frequency of nouns between communist material and informative sources, as well as the fact that both corpora varied in regards to average sentence length. The Nazi one tend to contain longer sentences, while communist collection has shorter ones. In both cases the differences are not as pronounced as in other attributes. 3.6 Sentence based article prediction For this task, we decided to use the labels applied to sentences in the previ- ous subsection in order to predict whether the entire articles belong to fringe or informative class. Frequency of fringe sentences, as provided by the Neural Net classifier has been calculated for each article from the corpora used in our research. Afterwards, we calculated the optimal threshold for classification. For Nazi materials the threshold value was 57% of fringe sentences per article, and for communists it was 63%. The performance scores for these corpora are shown in table 5.
  • 9. Table 5. Performance Accuracy AUC Nazi 81% 80% Communist 83% 81% When compared with the scores of machine learning from the method used in this subsection provides a higher accuracy (80% and 83% compared to 70% and 77% respectively), with a relatively similar level of AUC. What is also worth noting, the performance of this method is high even though the scores for sen- tence prediction rarely exceeded the 70% threshold for either accuracy and AUC. This observation indicates that even tough prediction on a sentence level can be faulty, but, when aggregated, they provide a strong signal that allows for a more successful identification of fringe sources. What is more this method proves to be slightly more efficient than using Part-of-speech information on the article level. Further work on the possibilities and limits of this approach will be pursued. 4 Conclusions and observations Our research has led as to the following conclusions and observations: – Article classification based on Part-of-Speech tagging has provided robust scores, indicating that the chosen attributes can be used for identifying fringe ideological sources. With a baseline of 50% for both accuracy and AUC, the machine learning algorithms achieved scores exceeding 70% for accuracy, and 80% for AUC. Neural Net and K Nearest Neighbors algorithms produced the best performance. – The same task conducted on the level of sentences provided weaker scores, mostly within the 60% - 70% range both for accuracy and AUC. However, when the information about the frequency of fringe sentences was applied to articles, the performance was increased to 80% for accuracy and AUC. Consequently these score provide a more robust classification than the one based on Part-of-Speech frequency on the article level. This may indicate that sentence level classification is burdened with high noise which may be countered by taking account of fringe sentences frequency in the article. – In view of collected data hypothesis 3 cannot be considered as true. The sentence length proved not to be an important attribute for determining whether the sentence belongs to a fringe or informative source. The texts from the Nazi corpus, on average, contained longer sentences. Conversely, communist sources where constructed of shorter ones. However neither pro- duced differences large enough to affect the machine learning algorithms performance. – Analysis on both the article and sentence level shows that adjectives and ad- verbs frequency plays an important role in identifying fringe sources. More- over in collected data allows us to consider hypotheses 1 and 2 as verified.
  • 10. – Analysis has also shown that fringe materials contain less named entities than informative sources. This may suggest that fringe text tend to be more vague than those that are solely informative. In summary, our research lends credence to the notion that such basic linguistic cues as occurrence of Parts-of-Speech can be used for determining whether a text belongs to the language of information or fringe ideology. 5 Future work The results we came up with show, that there is room for further research. Using basic stylistic cues for identifying specific narrations may be extended to include language of political, religious or scientific discourse. In our future work we plan to gather text corpora related to such types of language (marketing, religion, science etc.) and test them with the use of shallow linguistic characteristics. We also plan to include in our model such elements of propaganda as repetition, and vagueness [19][15]. We plan to conduct more detailed analysis of the two observations we made when researching the role of linguistic features: the dif- ferences in named entities frequency in fringe and informative sources, and the frequency of positive class sentences as a method of article prediction. Devising a computational model may lead to the development of tools dedicated to au- tomatic detection of specific forms of languages. So far we worked with modern documents written in English which are focused mostly on one ideology. This is why in our future work we aim to extend our focus the historical texts written in different languages (Polish, German etc.). Therefore we will be able to show whether or not the language of propaganda follows a rigid unchanging pattern, or is it strongly culture-based. It is also important in order to identify linguistic cues that can be applied to fringe ideologies only. To this end we plan to apply the developed model for other fringe ideologies and pseudoscience. Results we obtained with shallow methods of POS tagging show great promise. The next natural step would be to use deeper NLP methods, such as syntax pars- ing, or transcribing texts into logical formulas. In summary, our current work serves as preliminary research into a field of computational analysis of discourse. Therefore we can obtain an effective model, which can be extrapolated to var- ious topical domains. This will prove useful as a way of constructing a more general theory of computational models of recognizing language of fringe ide- ologies (as well as other forms of written language such as language of religion, science, marketing or various social classes). This will be especially important for classification method based on the fringe sentence frequency in the article. Providing a more general method of text corpora analysis, therefore allowing for the design of more intricate automated system that could detect propaganda or other forms of manipulative content. Focusing on structural aspects would make it also more resistant to deceptive strategies on the part of content producers.
  • 11. 6 Acknowlegments This work was financially supported by the European Community from the European Social Fund within the INTERKADRA project. It was also supported by the grant Reconcile: Robust Online Credibility Eval- uation of Web Content from Switzerland through the Swiss Contribution to the enlarged European Union. References 1. Sharoff, Serge. ”Classifying Web corpora into domain and genre using automatic feature identification.” Proceedings of the 3rd Web as Corpus Workshop. 2007. 2. Santini, Marina, Richard Power, and Roger Evans. ”Implementing a characteriza- tion of genre for automatic genre identification of web pages.” Proceedings of the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics, 2006. 3. Metaxas, Panagiotis Takis. ”Web spam, social propaganda and the evolution of search engine rankings.” Web Information Systems and Technologies. Springer Berlin Heidelberg, 2010. 170-182. 4. Chandy, Rishi. ”Wikiganda: Identifying propaganda through text analysis.” Cal- tech Undergraduate Research Journal. Winter 2009 (2008). 5. Mihalcea, Rada, and Carlo Strapparava. ”The lie detector: Explorations in the automatic recognition of deceptive language.” Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 2009. 6. Humpherys, Sean L., et al. ”Identification of fraudulent financial statements using linguistic credibility analysis.” Decision Support Systems 50.3 (2011): 585-594. 7. Fornaciari, Tommaso, and Massimo Poesio. ”Lexical vs. surface features in de- ceptive language analysis.” Proceedings of the ICAIL 2011 Workshop Applying Human Language Technology to the Law, AHLTL. 2011. 8. Feng, Song, Ritwik Banerjee, and Yejin Choi. ”Syntactic stylometry for deception detection.” Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics: Short Papers-Volume 2. Association for Computational Lin- guistics, 2012. 9. Ott, Myle, et al. ”Finding deceptive opinion spam by any stretch of the imag- ination.” Proceedings of the 49th Annual Meeting of the Association for Com- putational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011. 10. Lee, A. M. and Lee(eds.), E. B. (1939). The Fine Art of Propaganda. The Institute for Propaganda Analysis. Harcourt, Brace and Co. 11. Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python. O’Reilly Media, Inc., 2009. 12. Pennebaker, James W., Martha E. Francis, and Roger J. Booth. ”Linguistic inquiry and word count: LIWC 2001.” Mahway: Lawrence Erlbaum Associates 71 (2001): 2001. 13. Progressive Labour Party website. http://www.plp.org/ 14. Communist Party of the United States of America website www.cpusa.org 15. G?fu, Daniela, and Dan Cristea. ”Towards an Automated Semiotic Analysis of the Romanian Political Discourse.” Computer Science 21.1 (2013): 61.
  • 12. 16. Paik, Woojin, et al. ”Applying natural language processing (nlp) based metadata extraction to automatically acquire user preferences.” Proceedings of the 1st inter- national conference on Knowledge capture. ACM, 2001. 17. Gı̂fu, Daniela, and Ioan Constantin Dima. ”An operational approach of communi- cational propaganda.” International Letters of Social and Humanistic Sciences 23 (2014): 29-38. 18. http://www.splcenter.org/ 19. Lacoue-Labarthe, Philippe, and Jean-Luc Nancy. Le mythe nazi. Editions de l’Aube, 1998.