SlideShare a Scribd company logo
A Multimodal Ensemble Model for Detecting
Unreliable Information on Vietnamese SNS
Phạm Quang Nhật Minh
AImesoft JSC, Vietnam
minhpham0902@gmail.com
December 18, 2020
Nguyễn Mạnh Đức Tuân
Toyo University, Japan
ductuan024@gmail.com
7th International Workshop on
Vietnamese Language and Speech Processing (VLSP 2020)
What is Fake News?
2
n “Fake news is a news article that is intentionally and
veritably false.” (Shu et al., 2017)
Why Fake News Detection?
3
n Fake news negatively affects to society
n Fake news spreads like a real virus, especially via
social medias
¨ https://engineering.stanford.edu/magazine/article/how-
fake-news-spreads-real-virus
n Fake news detection is useful to increase the
credibility of information of medias, and prevent
spreading of fake contents
Why Multimodal is Important?
4
n In a addition to texts, images and videos are popular in
social medias
¨ Visual information is helpful in detecting rumors
n Other metadata information is useful: number of likes,
shares, retweets, time stamps, etc
Our Approach
5
Text contentsImages Metadata features
VGG 19
Fully-connected layer
BERT + CNN
Making Classification
Main Findings
6
n The proposed attention mechanism used to get the
representation of images is useful
n Adding residual connections in blocks leads to
performance improvement
n System accuracy is improved with our proposed
ensemble model
Proposed Method in Detail
7
n Data processing
n Model architecture
n Experiments and results
Data Format
8
n Each piece of information includes 6 main
attributes:
¨ The anonymized id of the owner
¨ Text contents
¨ Timestamp
¨ Number of likes
¨ Number of comments
¨ Number of shares
n Each news may contain zero or more than one
image
Text pre-processing
9
n Convert emojis such as =]], :( into sentiment
words "happy" or "sad" in Vietnamese.
n Converted words and tokens that have been
lengthened into short form.
¨ “coool” to “cool”
n Changed different terms about COVID-19 into
one term for consistency.
¨ “covid”, “ncov”
Data Imputation
10
n Mean values to fill missing values.
n For the timestamp, we applied the MICE
imputation method (Azuret al., 2011)
General Model
11
Given the representation of an image and a
text, we learn which parts of the impage we
should give more attention
Model 1
12
1D-CNN layers with filter sizes 2, 3, 4, 5 follow the
BERT module, and then a fully connected layer
with Batch Normalization follow 1D-CNN layers
Model 2&3
13
Model 2&3 used three additional
1D-CNN layers
Model 3 used residual connections
for additional 1D-CNN layers
Feature Design (1)
14
n Timestamp feature is converted into:
¨ Day
¨ Month
¨ Year
¨ Hour
¨ Weekday
n Text-based features:
¨ Number of hashtags
¨ Number of URLs
¨ Number of characters
¨ Number of words
¨ Number of question-marks
¨ Number of exclaim-marks
¨ A Boolean variable to indicate that post contains images or not
Feature Design (2)
15
n User-based features:
¨ Number of unreliable news
¨ Number of reliable news
¨ Ratio between two numbers, to indicate the sharing behavior
n All the above features will be standardized by subtracting the mean and
scaling to unit variance, except for the Boolean feature.
Multi Image Posts
16
n Some posts contain more than one image
n Two strategies:
¨ Use one image as input
¨ Multiple images (4 images at most) as input.
Proposed Ensemble Model
17
n Choose two best models among three models
n Calculate averages of probabilities returned by two
models
Experiments & Results
18
n Evaluation measure: ROC AUC
n We conducted experiments in order to evaluate
¨ The effect of pre-trained BERT models
¨ Text preprocessing strategies
¨ The effectiveness of the attention mechanism
PhoBERT vs NlpHUST/vibert4news
19
n Bert4news uses syllable-based tokenization
¨ Trained on 20GB of news texts
n PhoBERT uses word-level/subword tokenization
¨ Trained on 20GB of texts including Wikipedia and news
Pre-trained model Result on private test (AUC)
PhoBERT 0.921
bert4news 0.928
Effectiveness of Attention Mechanism
20
n Using attention mechanism significantly
improved the result
n Images and texts are co-related.
¨ Images and texts of reliable news are often
related
¨ Someone may use images that do not relate to
the content of the news for the click-bait purpose
Models Result on private test (AUC)
w/o attention 0.928
attention 0.940
Incorrect vs correct form words
21
n “sá.thại” vs “sát hại”
¨ Contain violent contents or ex-treme words.
¨ Can bypass the social media’s filtering function.
n Keeping is better!
¨ Partly reflect the sentiment of the text.
¨ Unreliable contents tend to use more subjective or extreme words to
convey a particular perspective.
Models (PhoBERT) Result on private test (AUC)
Words in correct form 0.918
Words in incorrect form 0.921
Results
22
Run Result on private test (AUC)
Model 1 0.939
Model 2 0.919
Model 3 0.940
Ensemble 0.945
n Results on the private test
Future work
23
n Use external data for fake news detection
n The natural way to make a judgement in fake
news detection task is to compare with
different information sources to find out
relevant evidences of fake news.
Thank you very much for listening!
24

More Related Content

PPTX
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
PDF
Mimeme Attribute Classification using LDV Ensemble Multimodel Learning
PDF
MIMEME ATTRIBUTE CLASSIFICATION USING LDV ENSEMBLE MULTIMODEL LEARNING
DOCX
Deepfake Detection on Social Media Leveraging Deep Learning and FastText Embe...
PDF
A study of cyberbullying detection using Deep Learning and Machine Learning T...
PDF
A study of cyberbullying detection using Deep Learning and Machine Learning T...
PPTX
NLP_MPR_PPT Hate Speech Recognition.pptx
PDF
IRJET - Profanity Statistical Analyzer
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Mimeme Attribute Classification using LDV Ensemble Multimodel Learning
MIMEME ATTRIBUTE CLASSIFICATION USING LDV ENSEMBLE MULTIMODEL LEARNING
Deepfake Detection on Social Media Leveraging Deep Learning and FastText Embe...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
NLP_MPR_PPT Hate Speech Recognition.pptx
IRJET - Profanity Statistical Analyzer

Similar to A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnamese SNS (20)

PDF
IRJET - Fake News Detection using Machine Learning
PDF
Predicting cyber bullying on t witter using machine learning
PPTX
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
PDF
Portofolio Muhammad Afrizal Septiansyah 2024
PDF
Audubon's UX portfolio
PDF
Topic Mining based on Fine-Tuningsentence-BERT and LDA
PDF
An evolutionary approach to comparative analysis of detecting Bangla abusive ...
PDF
IRJET- Segmenting, Multimedia Summarizing and Query based Retrieval of New...
PDF
Graph embedding approach to analyze sentiments on cryptocurrency
PDF
A Intensified Approach On Enhanced Transformer Based Models Using Natural Lan...
PDF
A benchmark study of machine learning models for online fake news detection
PDF
The Identification of Depressive Moods from Twitter Data by Using Convolution...
PPTX
[DSC Croatia 22] Experience in collaboration between academia and industry: N...
PPTX
From Research to Applications: What Can We Extract with Social Media Sensing?
PDF
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
PDF
my model genuines.
PDF
Fake News Detection Using Machine Learning
PDF
News Reliability Evaluation using Latent Semantic Analysis
PDF
How can text-mining leverage developments in Deep Learning? Presentation at ...
PDF
Business Communication Today 14th Edition Bovee Test Bank
IRJET - Fake News Detection using Machine Learning
Predicting cyber bullying on t witter using machine learning
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
Portofolio Muhammad Afrizal Septiansyah 2024
Audubon's UX portfolio
Topic Mining based on Fine-Tuningsentence-BERT and LDA
An evolutionary approach to comparative analysis of detecting Bangla abusive ...
IRJET- Segmenting, Multimedia Summarizing and Query based Retrieval of New...
Graph embedding approach to analyze sentiments on cryptocurrency
A Intensified Approach On Enhanced Transformer Based Models Using Natural Lan...
A benchmark study of machine learning models for online fake news detection
The Identification of Depressive Moods from Twitter Data by Using Convolution...
[DSC Croatia 22] Experience in collaboration between academia and industry: N...
From Research to Applications: What Can We Extract with Social Media Sensing?
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
my model genuines.
Fake News Detection Using Machine Learning
News Reliability Evaluation using Latent Semantic Analysis
How can text-mining leverage developments in Deep Learning? Presentation at ...
Business Communication Today 14th Edition Bovee Test Bank
Ad

More from Minh Pham (14)

PDF
Học tập suốt đời – Chìa khóa để thích ứng với sự bất định
PDF
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
PDF
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
PDF
Research methods for engineering students (v.2020)
PDF
Giới thiệu về AIML
PDF
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Deep Contexualized Representation
PDF
Research Methods in Natural Language Processing (2018 version)
PDF
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
PDF
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
PDF
Research Methods in Natural Language Processing
PDF
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
PDF
Introduction to natural language processing
Học tập suốt đời – Chìa khóa để thích ứng với sự bất định
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
Research methods for engineering students (v.2020)
Giới thiệu về AIML
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Deep Contexualized Representation
Research Methods in Natural Language Processing (2018 version)
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Research Methods in Natural Language Processing
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Introduction to natural language processing
Ad

Recently uploaded (20)

PPT
protein biochemistry.ppt for university classes
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
2. Earth - The Living Planet earth and life
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Microbiology with diagram medical studies .pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
BIOMOLECULES PPT........................
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
Sciences of Europe No 170 (2025)
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
protein biochemistry.ppt for university classes
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Cell Membrane: Structure, Composition & Functions
AlphaEarth Foundations and the Satellite Embedding dataset
2. Earth - The Living Planet earth and life
TOTAL hIP ARTHROPLASTY Presentation.pptx
Microbiology with diagram medical studies .pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
HPLC-PPT.docx high performance liquid chromatography
Biophysics 2.pdffffffffffffffffffffffffff
BIOMOLECULES PPT........................
The KM-GBF monitoring framework – status & key messages.pptx
2. Earth - The Living Planet Module 2ELS
Classification Systems_TAXONOMY_SCIENCE8.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Sciences of Europe No 170 (2025)
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...

A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnamese SNS

  • 1. A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnamese SNS Phạm Quang Nhật Minh AImesoft JSC, Vietnam minhpham0902@gmail.com December 18, 2020 Nguyễn Mạnh Đức Tuân Toyo University, Japan ductuan024@gmail.com 7th International Workshop on Vietnamese Language and Speech Processing (VLSP 2020)
  • 2. What is Fake News? 2 n “Fake news is a news article that is intentionally and veritably false.” (Shu et al., 2017)
  • 3. Why Fake News Detection? 3 n Fake news negatively affects to society n Fake news spreads like a real virus, especially via social medias ¨ https://engineering.stanford.edu/magazine/article/how- fake-news-spreads-real-virus n Fake news detection is useful to increase the credibility of information of medias, and prevent spreading of fake contents
  • 4. Why Multimodal is Important? 4 n In a addition to texts, images and videos are popular in social medias ¨ Visual information is helpful in detecting rumors n Other metadata information is useful: number of likes, shares, retweets, time stamps, etc
  • 5. Our Approach 5 Text contentsImages Metadata features VGG 19 Fully-connected layer BERT + CNN Making Classification
  • 6. Main Findings 6 n The proposed attention mechanism used to get the representation of images is useful n Adding residual connections in blocks leads to performance improvement n System accuracy is improved with our proposed ensemble model
  • 7. Proposed Method in Detail 7 n Data processing n Model architecture n Experiments and results
  • 8. Data Format 8 n Each piece of information includes 6 main attributes: ¨ The anonymized id of the owner ¨ Text contents ¨ Timestamp ¨ Number of likes ¨ Number of comments ¨ Number of shares n Each news may contain zero or more than one image
  • 9. Text pre-processing 9 n Convert emojis such as =]], :( into sentiment words "happy" or "sad" in Vietnamese. n Converted words and tokens that have been lengthened into short form. ¨ “coool” to “cool” n Changed different terms about COVID-19 into one term for consistency. ¨ “covid”, “ncov”
  • 10. Data Imputation 10 n Mean values to fill missing values. n For the timestamp, we applied the MICE imputation method (Azuret al., 2011)
  • 11. General Model 11 Given the representation of an image and a text, we learn which parts of the impage we should give more attention
  • 12. Model 1 12 1D-CNN layers with filter sizes 2, 3, 4, 5 follow the BERT module, and then a fully connected layer with Batch Normalization follow 1D-CNN layers
  • 13. Model 2&3 13 Model 2&3 used three additional 1D-CNN layers Model 3 used residual connections for additional 1D-CNN layers
  • 14. Feature Design (1) 14 n Timestamp feature is converted into: ¨ Day ¨ Month ¨ Year ¨ Hour ¨ Weekday n Text-based features: ¨ Number of hashtags ¨ Number of URLs ¨ Number of characters ¨ Number of words ¨ Number of question-marks ¨ Number of exclaim-marks ¨ A Boolean variable to indicate that post contains images or not
  • 15. Feature Design (2) 15 n User-based features: ¨ Number of unreliable news ¨ Number of reliable news ¨ Ratio between two numbers, to indicate the sharing behavior n All the above features will be standardized by subtracting the mean and scaling to unit variance, except for the Boolean feature.
  • 16. Multi Image Posts 16 n Some posts contain more than one image n Two strategies: ¨ Use one image as input ¨ Multiple images (4 images at most) as input.
  • 17. Proposed Ensemble Model 17 n Choose two best models among three models n Calculate averages of probabilities returned by two models
  • 18. Experiments & Results 18 n Evaluation measure: ROC AUC n We conducted experiments in order to evaluate ¨ The effect of pre-trained BERT models ¨ Text preprocessing strategies ¨ The effectiveness of the attention mechanism
  • 19. PhoBERT vs NlpHUST/vibert4news 19 n Bert4news uses syllable-based tokenization ¨ Trained on 20GB of news texts n PhoBERT uses word-level/subword tokenization ¨ Trained on 20GB of texts including Wikipedia and news Pre-trained model Result on private test (AUC) PhoBERT 0.921 bert4news 0.928
  • 20. Effectiveness of Attention Mechanism 20 n Using attention mechanism significantly improved the result n Images and texts are co-related. ¨ Images and texts of reliable news are often related ¨ Someone may use images that do not relate to the content of the news for the click-bait purpose Models Result on private test (AUC) w/o attention 0.928 attention 0.940
  • 21. Incorrect vs correct form words 21 n “sá.thại” vs “sát hại” ¨ Contain violent contents or ex-treme words. ¨ Can bypass the social media’s filtering function. n Keeping is better! ¨ Partly reflect the sentiment of the text. ¨ Unreliable contents tend to use more subjective or extreme words to convey a particular perspective. Models (PhoBERT) Result on private test (AUC) Words in correct form 0.918 Words in incorrect form 0.921
  • 22. Results 22 Run Result on private test (AUC) Model 1 0.939 Model 2 0.919 Model 3 0.940 Ensemble 0.945 n Results on the private test
  • 23. Future work 23 n Use external data for fake news detection n The natural way to make a judgement in fake news detection task is to compare with different information sources to find out relevant evidences of fake news.
  • 24. Thank you very much for listening! 24