Exploiting User Comments for Audio-visual
Content Indexing and Retrieval

Carsten Eickhoff, Wen Li and Arjen P. de Vries
March 25, 2013




          Delft
          University of
          Technology

          Challenge the future
Overview


• Introduction and statistics

• Harnessing user comments for content indexing

• Dealing with noise

• Retrieval experiments




                          User Comments for Content Indexing and Retrieval   2
Example




          User Comments for Content Indexing and Retrieval   3
Content Annotation



• Audio-visual content retrieval relies on textual meta data

• Author-provided titles and descriptions are often not enough

• Collaborative tagging can provide more information




                         User Comments for Content Indexing and Retrieval   4
Available Annotation Sources

• Tagging content is a tedious task

• To make it more interesting, tagging is sometimes integrated in
  games and reputation schemes

• Still, 58% of a 10,000-video sample from YouTube are annotated
  with less than 140 characters of text each

• At the same time, comment threads are massive…




                        User Comments for Content Indexing and Retrieval   5
Automatic term extraction
           You will get kissed on the nearest
           possible Friday by the love of your   omg i luv
                                                 that stuff
           life.Tomorrow will be the best day
           of your life.However,if you don't
           post this comment to at least 3
           videos,you will die within 2
           days.Now uv started reading dis
           dunt stop…




                               lol luv it luv
                                              Cute
                               snoopy



                 User Comments for Content Indexing and Retrieval   6
Types of Noise

1. Uninformative comments
                                                   omg i luv
                                                   that stuff




                     User Comments for Content Indexing and Retrieval   7
Types of Noise

1. Uninformative comments                        You will get kissed on the nearest
                                                 possible Friday by the love of your
                                                 life.Tomorrow will be the best day
2. Unrelated comments (incl. spam)               of your life.However,if you don't
                                                 post this comment to at least 3
                                                 videos,you will die within 2
                                                 days.Now uv started reading dis
                                                 dunt stop…




                      User Comments for Content Indexing and Retrieval             8
Types of Noise

1. Uninformative comments
                                                  OMG YEAH
2. Unrelated comments (incl. spam)                LOL1!1!!! i luv
                                                  that part u like
3. Misspellings and chat speak                    robot chicken?




                       User Comments for Content Indexing and Retrieval   9
Types of Noise

1. Uninformative comments

2. Unrelated comments (incl. spam)               Snoopy est
                                                 si mignon!!
3. Misspellings and chat speak

4. Foreign language utterances




                       User Comments for Content Indexing and Retrieval   10
LM-based Keyword extraction

• Find those terms that have a locally higher likelihood of
  occurrence than globally in the collection




• Similar notion as tf/idf but within the LM framework




                         User Comments for Content Indexing and Retrieval   11
Bursts

• Peaks in commenting activity may contain interesting information




                        User Comments for Content Indexing and Retrieval   12
Bursts

• Peaks in commenting activity may contain interesting information




[External]:
Actor wins
 an award




                        User Comments for Content Indexing and Retrieval   13
Bursts

• Peaks in commenting activity may contain interesting information




                                                        [Internal]:
                                                       Controversial
                                                         comment


                        User Comments for Content Indexing and Retrieval   14
Generalized Burst Detection

• Kleinberg [1] measured bursts per term

• We need a more general representation of activity peaks




[1] John Kleinberg. Bursty and Hierarchical Structure in Streams, 2003

                              User Comments for Content Indexing and Retrieval   15
Burst and Cause

• Capturing bursts seems to help

• But we also need its cause

• A mixture of language models
  accounts for burst and pre-
  burst term likelihoods




                        User Comments for Content Indexing and Retrieval   16
Vocabulary Regularization

• Currently: Discriminative terms are good

• As a result: Misspellings and non-English terms are recommended

• Wikipedia can help identify such cases:




     Snoopy




                        User Comments for Content Indexing and Retrieval   17
Vocabulary Regularization

• Currently: Discriminative terms are good

• As a result: Misspellings and non-English terms are recommended

• Wikipedia can help identify such cases:




   Yeah!!1%                                                    Wait, that’s
                                                               not a word…



                        User Comments for Content Indexing and Retrieval   18
Data Set


• 10,000 YouTube videos crawled in 2009/10

• 20 seed queries, following “related videos” link

• 4.7 M user comments

• On average 360 comments per video (σ = 984)




                         User Comments for Content Indexing and Retrieval   19
Retrieval experiments

• TREC-style retrieval experiment

• 40 manually constructed topics

• Pooled top 10 results evaluated via crowdsourcing

• BM25F models with fields per source (title, description, etc.)




                         User Comments for Content Indexing and Retrieval   20
Retrieval performance




             User Comments for Content Indexing and Retrieval   21
Retrieval performance




             User Comments for Content Indexing and Retrieval   22
Retrieval performance




             User Comments for Content Indexing and Retrieval   23
Retrieval performance




• 40% gain in MAP


                    User Comments for Content Indexing and Retrieval   24
Retrieval performance




• 40% gain in MAP


                    User Comments for Content Indexing and Retrieval   25
Experiments under Sparsity

• 58% of all video descriptions are shorter than 140 characters

• 50% of all titles are shorter than 35 characters

• We limit our corpus to videos with short titles and/or descriptors

• This affects 77% of all videos in our sample…




                         User Comments for Content Indexing and Retrieval   26
Retrieval performance (sparse)




             User Comments for Content Indexing and Retrieval   27
Retrieval performance (sparse)




• 54% gain in MAP



                    User Comments for Content Indexing and Retrieval   28
Closing the Circle




             User Comments for Content Indexing and Retrieval   29
Conclusion

• User comments can enhance content annotation if we deal with
  the domain-inherent noise appropriately

• Modeling commenting activity bursts, we can find informative
  on-topic comments

• Through the use of Wikipedia, misspellings and foreign language
  utterances can be reliably identified




                        User Comments for Content Indexing and Retrieval   30
Future Directions

• Additional regularization resources (e.g., Delicious, WordNet)

• New domains (e.g., social media streams linked to TV)

• Content-aware term extraction

• Cold start problem

• Cross-language ability




                           User Comments for Content Indexing and Retrieval   31
Thank You!




 User Comments for Content Indexing and Retrieval   32

More Related Content

PDF
Word representation: SVD, LSA, Word2Vec
PPT
Twente ir-course 20-10-2010
PDF
Natural Language Processing: L01 introduction
PDF
Natural Language Processing: L02 words
PDF
Thai Text processing by Transfer Learning using Transformer (Bert)
PPTX
KiwiPyCon 2014 talk - Understanding human language with Python
PDF
Pycon India 2018 Natural Language Processing Workshop
PPT
Natural language procssing
Word representation: SVD, LSA, Word2Vec
Twente ir-course 20-10-2010
Natural Language Processing: L01 introduction
Natural Language Processing: L02 words
Thai Text processing by Transfer Learning using Transformer (Bert)
KiwiPyCon 2014 talk - Understanding human language with Python
Pycon India 2018 Natural Language Processing Workshop
Natural language procssing

Viewers also liked (12)

PDF
FIDEVIC - 2dos Pisos
DOC
89_lehenengo galdera.doc
PPT
Invitation - 40ans Flo
PPT
Leprosy part 2 - a presentation at www.eyenirvaan.com
PPTX
Cuadro comparativo psicologia laboral jmc
PDF
PDF
05 developing training materials
DOCX
ShaabanMahran
PDF
DOCX
TEORÍAS Y MODELOS EDUCATIVOS - 2a Parte
PPTX
R intro
PDF
Rapport de projet Odoo - gestion de projet et gestion de ressources humaines
FIDEVIC - 2dos Pisos
89_lehenengo galdera.doc
Invitation - 40ans Flo
Leprosy part 2 - a presentation at www.eyenirvaan.com
Cuadro comparativo psicologia laboral jmc
05 developing training materials
ShaabanMahran
TEORÍAS Y MODELOS EDUCATIVOS - 2a Parte
R intro
Rapport de projet Odoo - gestion de projet et gestion de ressources humaines
Ad

Similar to Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECIR'13) (20)

PPTX
1908 working memory
PDF
Tutorial 13 (explicit ugc + sentiment analysis)
PDF
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
KEY
Introduction to MTM-4005
PDF
Immersive Recommendation
PPTX
Socialcom2011 discussionactivityprediction
DOCX
Franklin university humn 240 assignment help
PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
PPT
Level2 lesson2
PDF
Data Acquisition for Sentiment Analysis
PPT
PPT
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
PDF
TaaS Workshop 2014, Terminology Trends- First-hand Experience as a Blogger, M...
KEY
Write a better FM
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Sediment analysis: what is Sediment analysis
PPTX
NLP Project Presentation
PPT
4117817.ppt
PDF
Designing the Future of Broadcasting
PPT
Dmk audioviz
1908 working memory
Tutorial 13 (explicit ugc + sentiment analysis)
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
Introduction to MTM-4005
Immersive Recommendation
Socialcom2011 discussionactivityprediction
Franklin university humn 240 assignment help
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
Level2 lesson2
Data Acquisition for Sentiment Analysis
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
TaaS Workshop 2014, Terminology Trends- First-hand Experience as a Blogger, M...
Write a better FM
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sediment analysis: what is Sediment analysis
NLP Project Presentation
4117817.ppt
Designing the Future of Broadcasting
Dmk audioviz
Ad

More from Carsten Eickhoff (8)

PDF
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
PDF
Web2Text: Deep Structured Boilerplate Removal
PDF
Cognitive Biases in Crowdsourcing
PDF
Evaluating Music Recommender Systems for Groups
PDF
Active Content-Based Crowdsourcing Task Selection
PDF
Efficient Parallel Learning of Word2Vec
PDF
An Eye-Tracking Study of Query Reformulation
PPTX
Introduction to Information Retrieval
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Web2Text: Deep Structured Boilerplate Removal
Cognitive Biases in Crowdsourcing
Evaluating Music Recommender Systems for Groups
Active Content-Based Crowdsourcing Task Selection
Efficient Parallel Learning of Word2Vec
An Eye-Tracking Study of Query Reformulation
Introduction to Information Retrieval

Recently uploaded (20)

PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
HVAC Specification 2024 according to central public works department
PDF
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
My India Quiz Book_20210205121199924.pdf
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PDF
International_Financial_Reporting_Standa.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
Empowerment Technology for Senior High School Guide
PDF
semiconductor packaging in vlsi design fab
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
Race Reva University – Shaping Future Leaders in Artificial Intelligence
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Uderstanding digital marketing and marketing stratergie for engaging the digi...
A powerpoint presentation on the Revised K-10 Science Shaping Paper
HVAC Specification 2024 according to central public works department
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
My India Quiz Book_20210205121199924.pdf
Share_Module_2_Power_conflict_and_negotiation.pptx
International_Financial_Reporting_Standa.pdf
Introduction to pro and eukaryotes and differences.pptx
Cambridge-Practice-Tests-for-IELTS-12.docx
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Empowerment Technology for Senior High School Guide
semiconductor packaging in vlsi design fab
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Race Reva University – Shaping Future Leaders in Artificial Intelligence
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...

Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECIR'13)

  • 1. Exploiting User Comments for Audio-visual Content Indexing and Retrieval Carsten Eickhoff, Wen Li and Arjen P. de Vries March 25, 2013 Delft University of Technology Challenge the future
  • 2. Overview • Introduction and statistics • Harnessing user comments for content indexing • Dealing with noise • Retrieval experiments User Comments for Content Indexing and Retrieval 2
  • 3. Example User Comments for Content Indexing and Retrieval 3
  • 4. Content Annotation • Audio-visual content retrieval relies on textual meta data • Author-provided titles and descriptions are often not enough • Collaborative tagging can provide more information User Comments for Content Indexing and Retrieval 4
  • 5. Available Annotation Sources • Tagging content is a tedious task • To make it more interesting, tagging is sometimes integrated in games and reputation schemes • Still, 58% of a 10,000-video sample from YouTube are annotated with less than 140 characters of text each • At the same time, comment threads are massive… User Comments for Content Indexing and Retrieval 5
  • 6. Automatic term extraction You will get kissed on the nearest possible Friday by the love of your omg i luv that stuff life.Tomorrow will be the best day of your life.However,if you don't post this comment to at least 3 videos,you will die within 2 days.Now uv started reading dis dunt stop… lol luv it luv Cute snoopy User Comments for Content Indexing and Retrieval 6
  • 7. Types of Noise 1. Uninformative comments omg i luv that stuff User Comments for Content Indexing and Retrieval 7
  • 8. Types of Noise 1. Uninformative comments You will get kissed on the nearest possible Friday by the love of your life.Tomorrow will be the best day 2. Unrelated comments (incl. spam) of your life.However,if you don't post this comment to at least 3 videos,you will die within 2 days.Now uv started reading dis dunt stop… User Comments for Content Indexing and Retrieval 8
  • 9. Types of Noise 1. Uninformative comments OMG YEAH 2. Unrelated comments (incl. spam) LOL1!1!!! i luv that part u like 3. Misspellings and chat speak robot chicken? User Comments for Content Indexing and Retrieval 9
  • 10. Types of Noise 1. Uninformative comments 2. Unrelated comments (incl. spam) Snoopy est si mignon!! 3. Misspellings and chat speak 4. Foreign language utterances User Comments for Content Indexing and Retrieval 10
  • 11. LM-based Keyword extraction • Find those terms that have a locally higher likelihood of occurrence than globally in the collection • Similar notion as tf/idf but within the LM framework User Comments for Content Indexing and Retrieval 11
  • 12. Bursts • Peaks in commenting activity may contain interesting information User Comments for Content Indexing and Retrieval 12
  • 13. Bursts • Peaks in commenting activity may contain interesting information [External]: Actor wins an award User Comments for Content Indexing and Retrieval 13
  • 14. Bursts • Peaks in commenting activity may contain interesting information [Internal]: Controversial comment User Comments for Content Indexing and Retrieval 14
  • 15. Generalized Burst Detection • Kleinberg [1] measured bursts per term • We need a more general representation of activity peaks [1] John Kleinberg. Bursty and Hierarchical Structure in Streams, 2003 User Comments for Content Indexing and Retrieval 15
  • 16. Burst and Cause • Capturing bursts seems to help • But we also need its cause • A mixture of language models accounts for burst and pre- burst term likelihoods User Comments for Content Indexing and Retrieval 16
  • 17. Vocabulary Regularization • Currently: Discriminative terms are good • As a result: Misspellings and non-English terms are recommended • Wikipedia can help identify such cases: Snoopy User Comments for Content Indexing and Retrieval 17
  • 18. Vocabulary Regularization • Currently: Discriminative terms are good • As a result: Misspellings and non-English terms are recommended • Wikipedia can help identify such cases: Yeah!!1% Wait, that’s not a word… User Comments for Content Indexing and Retrieval 18
  • 19. Data Set • 10,000 YouTube videos crawled in 2009/10 • 20 seed queries, following “related videos” link • 4.7 M user comments • On average 360 comments per video (σ = 984) User Comments for Content Indexing and Retrieval 19
  • 20. Retrieval experiments • TREC-style retrieval experiment • 40 manually constructed topics • Pooled top 10 results evaluated via crowdsourcing • BM25F models with fields per source (title, description, etc.) User Comments for Content Indexing and Retrieval 20
  • 21. Retrieval performance User Comments for Content Indexing and Retrieval 21
  • 22. Retrieval performance User Comments for Content Indexing and Retrieval 22
  • 23. Retrieval performance User Comments for Content Indexing and Retrieval 23
  • 24. Retrieval performance • 40% gain in MAP User Comments for Content Indexing and Retrieval 24
  • 25. Retrieval performance • 40% gain in MAP User Comments for Content Indexing and Retrieval 25
  • 26. Experiments under Sparsity • 58% of all video descriptions are shorter than 140 characters • 50% of all titles are shorter than 35 characters • We limit our corpus to videos with short titles and/or descriptors • This affects 77% of all videos in our sample… User Comments for Content Indexing and Retrieval 26
  • 27. Retrieval performance (sparse) User Comments for Content Indexing and Retrieval 27
  • 28. Retrieval performance (sparse) • 54% gain in MAP User Comments for Content Indexing and Retrieval 28
  • 29. Closing the Circle User Comments for Content Indexing and Retrieval 29
  • 30. Conclusion • User comments can enhance content annotation if we deal with the domain-inherent noise appropriately • Modeling commenting activity bursts, we can find informative on-topic comments • Through the use of Wikipedia, misspellings and foreign language utterances can be reliably identified User Comments for Content Indexing and Retrieval 30
  • 31. Future Directions • Additional regularization resources (e.g., Delicious, WordNet) • New domains (e.g., social media streams linked to TV) • Content-aware term extraction • Cold start problem • Cross-language ability User Comments for Content Indexing and Retrieval 31
  • 32. Thank You! User Comments for Content Indexing and Retrieval 32