SlideShare a Scribd company logo
Forensic Linguistics with
Apache Spark
Kostas Perifanos
@k_perifanos
Idiolect, sociolect, intertextuality
What?
- Idiolect: individual’s distinctive and unique use of language
- Sociolect : variety of language associated with a social group (socioeconomic,
ethnic, age)
- Intertextuality: the shaping of a text’s meaning by another text
Forensic Linguistics
"Forensic linguistics, legal linguistics, or language and the law, is the application of
linguistic knowledge, methods and insights to the forensic context of law,
language, crime investigation, trial, and judicial procedure. It is a branch of applied
linguistics.” [Wikipedia]
- Authorship Attribution
- Authorship Identification
- Gender/Age classification etc
Dataset
- 8m tweets between 18/06/2015 - 06/08/2015
- 92m words (white space tokenized)
- 190K users
- Key events during this period
- Referendum Announcement
- Capital Controls
- Referendum voting
Toolset
- Apache Spark 1.6.1
- RDD
- DataFrames / Spark SQL
- Word2vec, KMeans
- Apache Zeppelin
- Gephi
Basic Data Exploration - Counting
Check for trends:
- Lowercase vs Uppercase ratios
- Relative frequencies of important (propaganda) words
- Average text length (per day)
- Average word length (per day)
Counting - lowercase / uppercase ratio
Counting - Propaganda
- Build a word2vec model, treat @mentions as vocabulary words
- Find top-N “synonyms” using seed accounts, keep all starting with “@”
- @handle1: @handle2, @handle3, ...
- @handle32: @handle5, @handle3, ...
- Visualize the graph
Similarities & user interactions
Similarities & interactions graph [Gephi]
Similarities & interactions graph [Gephi]
Gephi : Modularity analysis, 9 communities detected
Communities:
- “Yes”, black
- “No”, magenta
- media, red
- celebrities, dark green
- “Romantic twitter”, orange
- ....
- Choose top N most frequent words [1]
- Build frequency vectors for all users
- Compare user signatures [eg Cosine Similarity]
- Identified double-account user among 180K candidates (so much for anonymity)
[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/
2. Idiolect : Style signatures
2. Idiolect : Style signatures
- Apply clustering on signature vectors
- KMeans on signatures
- KMeans on word2vec vectors:
- Transform words to vectors, sum and average
- Also works very well for metaphor detection
Sociolect: Clustering
- User generates texts by sampling a number of topics
- “Similar” users will tend to have similar topic distributions
- Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)]
Challenges
Noise
“Random events”
Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one
more presentation :) ]
Intertextuality: LDA + signatures
- User - Topic Classification
- Gender classification
- Age
- Personality, stress, anxiety etc
- Try Deep Learning approaches
Next steps
Thank you!
Questions?
@k_perifanos - http://github.com/kperi

More Related Content

PPT
Digital Library Infrastructure for a Million Books
PPT
HIT project - Humanities Integration Technology
PDF
Librarian Legal Literacies for Text Data Mining
PPTX
New Directions in Information Organization: A Linked Data Model with BIBFRAME
PPTX
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
PPT
63demo dfa
PDF
Zurich iCON
PDF
Apache storm vs. Spark Streaming
Digital Library Infrastructure for a Million Books
HIT project - Humanities Integration Technology
Librarian Legal Literacies for Text Data Mining
New Directions in Information Organization: A Linked Data Model with BIBFRAME
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
63demo dfa
Zurich iCON
Apache storm vs. Spark Streaming

Similar to Forensic linguistics with Apache Spark (20)

PPTX
SocioViz : Social Network Analysis made easy
PPT
ICAME 2010
PDF
Using cognitive computing to better analyze human communication
PDF
Topic models, vector semantics and applications
PPTX
PDF
Vuorikari Multilingual Tagging behaviour by teachers
PPT
Detecting discourse creativity in chat conversations
PPT
IVACS 2010
PDF
OpenMinTeD: Making Sense of Large Volumes of Data
PPTX
Digital Reasoning at AirSummit 2014
PPTX
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
PPT
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
PPTX
Identifying cyclic words with the help of google
PDF
Intro to sentiment analysis
PDF
Tds — big science dec 2021
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
DOCX
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
PDF
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
PPT
Facilitating Dialogue - Using Semantic Web Technology for eParticipation
PDF
A Friendly Localized Platform for Multilingual Semantic Communication
SocioViz : Social Network Analysis made easy
ICAME 2010
Using cognitive computing to better analyze human communication
Topic models, vector semantics and applications
Vuorikari Multilingual Tagging behaviour by teachers
Detecting discourse creativity in chat conversations
IVACS 2010
OpenMinTeD: Making Sense of Large Volumes of Data
Digital Reasoning at AirSummit 2014
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
Identifying cyclic words with the help of google
Intro to sentiment analysis
Tds — big science dec 2021
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
Facilitating Dialogue - Using Semantic Web Technology for eParticipation
A Friendly Localized Platform for Multilingual Semantic Communication
Ad

More from Sheamus McGovern (8)

PDF
Knime customer intelligence on social edia
PDF
Schierz ODSC Meetup pdf
PDF
Jon Sedar Topic Modelling
PDF
Boris IoT slides
PDF
Deep Learning Frameworks slides
PDF
Transfer Wise Data Talk 2
PDF
Ajit jaokar slides
PDF
Ian Huston - "Deploying your data driven web app on Cloud Foundry"
Knime customer intelligence on social edia
Schierz ODSC Meetup pdf
Jon Sedar Topic Modelling
Boris IoT slides
Deep Learning Frameworks slides
Transfer Wise Data Talk 2
Ajit jaokar slides
Ian Huston - "Deploying your data driven web app on Cloud Foundry"
Ad

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to Business Data Analytics.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Lecture1 pattern recognition............
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Fluorescence-microscope_Botany_detailed content
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Introduction to Business Data Analytics.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Lecture1 pattern recognition............
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Supervised vs unsupervised machine learning algorithms
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Miokarditis (Inflamasi pada Otot Jantung)
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
IB Computer Science - Internal Assessment.pptx
Fluorescence-microscope_Botany_detailed content

Forensic linguistics with Apache Spark

  • 1. Forensic Linguistics with Apache Spark Kostas Perifanos @k_perifanos
  • 2. Idiolect, sociolect, intertextuality What? - Idiolect: individual’s distinctive and unique use of language - Sociolect : variety of language associated with a social group (socioeconomic, ethnic, age) - Intertextuality: the shaping of a text’s meaning by another text
  • 3. Forensic Linguistics "Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.” [Wikipedia] - Authorship Attribution - Authorship Identification - Gender/Age classification etc
  • 4. Dataset - 8m tweets between 18/06/2015 - 06/08/2015 - 92m words (white space tokenized) - 190K users - Key events during this period - Referendum Announcement - Capital Controls - Referendum voting
  • 5. Toolset - Apache Spark 1.6.1 - RDD - DataFrames / Spark SQL - Word2vec, KMeans - Apache Zeppelin - Gephi
  • 6. Basic Data Exploration - Counting Check for trends: - Lowercase vs Uppercase ratios - Relative frequencies of important (propaganda) words - Average text length (per day) - Average word length (per day)
  • 7. Counting - lowercase / uppercase ratio
  • 9. - Build a word2vec model, treat @mentions as vocabulary words - Find top-N “synonyms” using seed accounts, keep all starting with “@” - @handle1: @handle2, @handle3, ... - @handle32: @handle5, @handle3, ... - Visualize the graph Similarities & user interactions
  • 11. Similarities & interactions graph [Gephi] Gephi : Modularity analysis, 9 communities detected Communities: - “Yes”, black - “No”, magenta - media, red - celebrities, dark green - “Romantic twitter”, orange - ....
  • 12. - Choose top N most frequent words [1] - Build frequency vectors for all users - Compare user signatures [eg Cosine Similarity] - Identified double-account user among 180K candidates (so much for anonymity) [1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/ 2. Idiolect : Style signatures
  • 13. 2. Idiolect : Style signatures
  • 14. - Apply clustering on signature vectors - KMeans on signatures - KMeans on word2vec vectors: - Transform words to vectors, sum and average - Also works very well for metaphor detection Sociolect: Clustering
  • 15. - User generates texts by sampling a number of topics - “Similar” users will tend to have similar topic distributions - Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)] Challenges Noise “Random events” Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one more presentation :) ] Intertextuality: LDA + signatures
  • 16. - User - Topic Classification - Gender classification - Age - Personality, stress, anxiety etc - Try Deep Learning approaches Next steps
  • 17. Thank you! Questions? @k_perifanos - http://github.com/kperi