SlideShare a Scribd company logo
Automatically Selecting Striking
Images for Social Cards
Shawn M. Jones *† · Martin Klein † · Michele C. Weigle * · Michael L. Nelson *
* Old Dominion University, Web Science and Digital Libraries Research Group
† Los Alamos National Laboratory, Research Library Prototyping Team
2
@shawnmjones
This work is part of the
Dark and Stormy Archives (DSA) project
Web archive collection of
1000s of documents
Automated
Solution
A story that conveys
understanding at a glance
3
@shawnmjones
Social cards provide a visual summary of the
content behind a URL
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,
+New+Mexico/@35 .3644614,-109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1
s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URL:
The same URL
represented by
a social card:
4
@shawnmjones
Social cards consist of different units
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
5
@shawnmjones
Social cards allow resources to
compete for clicks.
Nature article shared on Twitter
In addition to summarizing the resource, social cards drive clicks to the resource, answering the
question of What does the underlying page contain?
Which of these is more appealing?
This is also a case of ”The Truth Is Paywalled but the Lies Are Free”
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
disinformation source
shared on Twitter
6
@shawnmjones
Cards are generated based on the
HTML metadata that authors provide
og:title
-or-
twitter:title
-or-
<title>
og:description
-or-
twitter:description
-or-
description
og:image
-or-
twitter:image
Without twitter:card and og:title or twitter:title, Twitter typically gives up and does
not generate a card.
Facebook parses the <title> and produces a card with just a title.
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
7
@shawnmjones
RQ1: What are the distributions of HTML
metadata elements (general and social card
elements) in news articles (over time) and
scholarly publications published on the web?
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
If the metadata is prevalent and of high quality, then we can rely on it.
If not, then to create good cards, we need to develop methods to fill in
the missing metadata.
8
@shawnmjones
We analyzed 198,523 news articles captured by
the Internet Archive from 1998 to 2016, and found
different rates
of metadata adoption
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
released
2009
released
2010
released
2011
released
1998
est. released
1995
released
2010
est. released
2009
proposed
2009
released
2011
est. released
2012
released
2014
released
2009
est. released
2006
est. released
2006
est. released
2011
est. released
2010
OGP = Open Graph Protocol
Facebook Cards
150 billion documents
in the Internet Archive
were captured before
2010
9
@shawnmjones
We evaluated the HTML pages of 110,900
scholarly articles from the PubMed Central
dataset – 100 articles each from 1,109 journals
These are not archived pages, but how these articles were presented in 2020.
77.86% looks good, until we look at
the images presented in these cards...
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
10
@shawnmjones
74% of scholarly publications use publisher and
journal logos as striking images; 52% reuse the
same image for all articles
101 articles
200 articles
200 articles each
300 articles
107 articles
200 articles 274 articles
274 articles
(blank)
2034 articles
400 articles
300 articles
11
@shawnmjones
For news articles, most striking images are of article
content, and those that repeat across articles tend
to be author photos
48 articles
11 articles 7 articles
3 articles
15 articles
65 articles
24 articles
47 articles 3 articles
15,823 articles
131 articles
54 articles
3 articles 73 articles
3 articles
59 articles
3 articles
43 articles
Publisher logo
Author Photos
12
@shawnmjones
RQ2: What approaches and image features
are best suited to automatically select striking
images from news articles and scholarly
publications, and do the approaches differ for
both resource types?
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
Good example for news
Good example for scholarly publication
13
@shawnmjones
If no metadata
exists, we can
select a striking
image from the
images available
in the document
Which of the images
outlined in red is the striking
one chosen by the author?
How would a machine know
which one to choose if there
were no striking image
specified in the metadata?
14
@shawnmjones
Our generic
selection approach
has 3 steps
1. Score each image in the
document by some
approach (e.g., ML
probability, feature value)
2. Sort the list of images by
descending score (e.g.,
highest ML probability is
first, image with most
colors is first)
3. Choose the image at the
beginning of the list
154,131
colors
Sorted by color count
Sorted by
classifier probability
48,020
colors
44,737
colors
30,940
colors
0.3623
0.1948
0.1259
3,816
colors
0.1116
0.11
(resized)
(cropped)
(resized)
(cropped)
(larger)
15
@shawnmjones
NEWSROOM Dataset sample
We sampled from two datasets to determine
which approaches worked best for selecting
striking images
PLOS ONE dataset sample
• News articles tend to select images that
represent their stories
• 37,522 news articles
• Submission guidelines encourage authors to
choose their own striking images after
acceptance
• 198,523 scholarly articles
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
News Articles Scholarly Publications
In both, the metadata gives us the ground truth
the image that an author chose for their article.
16
@shawnmjones
A social card creation service needs to be able to select a striking
image in close to real time, so we considered base features that
are quickly calculated by image libraries
byte size: 71,934 bytes
width: 320 pixels
height: 242 pixels
negative space: 53 histogram cols = 0
size in pixels: 77,440 pixels
aspect ratio: 1.3223
number of colors: 13,891
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
Base features
17
@shawnmjones
The consistent structure of scholarly
publications allows us to quickly calculate
additional features for each image
Figure position
features
• figure position
• figure position
(scaled)
Section features
• section index
• scaled section index
• character position in section
• word position in section
Caption features
• Caption TF rank
• Caption TF rank
(scaled)
• Jaccard distance
of title and caption
character position: 7,508
word position: 1,196
section index: 2
scaled section index: 0.333
figure position: 6
figure position
(scaled): 0.857
Caption TF rank: 3
Caption TF rank (scaled): 0.429
Jaccard: 0.85
18
@shawnmjones
We evaluate our approaches with
P@1 and MRR
• Precision@1 (P@1): Does the prediction
approach choose the right image?
− P@1 = 1.0 if yes, 0 if no
• Mean Reciprocal Rank (MRR): If it failed,
how far off was it?
− the mean of the reciprocal ranks of all results
− e.g., if approach ranks the ground truth
striking image as #5, then RR = 0.2
− MRR of 1.0 is desirable
• But how do we know what the correct
image is?
− Did the image have the same URL as the one
in the metadata?
− If not, was it perceptually the same?
Image Color
Count
154,131
colors
48,020
colors
44,737
colors
30,940
colors
3,816
colors
P@1 = 0
RR = 1/2 = 0.5
Image chosen by
approach: most colors
Image chosen by
author (ground truth)
Perceptually the same
as the image chosen
by author, as
determined by pHash
19
@shawnmjones
37,522 news articles from NEWSROOM 198,523 scholarly articles from PLOS ONE
Different features work best to predict the striking
image for news articles vs. scholarly publications
P@1=0.83
MRR=0.88
P@1=0.78
MRR=0.86
20
@shawnmjones
Conclusions
• News articles quickly adopted social cards
• Prior to 2010, there were no standards, corresponding
to 150 billion documents in the Internet Archive that
need automatic summarization
• News article metadata have striking images drawn from
the article
• Scholarly publishers favor company or journal logos for
their striking images, not summarizing the document
• For predicting striking images based on the content of
the document:
− Random Forest with base features performed best for
news articles (P@1=0.83)
− Random Forest with base features and figure position
performed best for scholarly publications (P@1=0.78)
• For more information, see the
Dark and Stormy Archives Project:
https://oduwsdl.github.io/dsa/
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
released
2010
released
2011
news articles scholarly publications

More Related Content

PDF
Introduction to Computational Social Science
PPTX
Social Network Analysis: applications for education research
PDF
Social Network Analysis
PDF
Social network analysis & Big Data - Telecommunications and more
ZIP
Social Networks and Computer Science
PPTX
Social Network Analysis Introduction including Data Structure Graph overview.
PPT
How to conduct a social network analysis: A tool for empowering teams and wor...
PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Introduction to Computational Social Science
Social Network Analysis: applications for education research
Social Network Analysis
Social network analysis & Big Data - Telecommunications and more
Social Networks and Computer Science
Social Network Analysis Introduction including Data Structure Graph overview.
How to conduct a social network analysis: A tool for empowering teams and wor...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...

What's hot (20)

PDF
Introduction to Social Network Analysis
PDF
Introduction to Social Network Analysis
PPT
Social network analysis course 2010 - 2011
PDF
Social Network Analysis (SNA) Made Easy
PPTX
A comparative study of social network analysis tools
PPT
Prof. Hendrik Speck - Social Network Analysis
PPTX
12 Network Experiments and Interventions: Studying Information Diffusion and ...
PDF
Practical Applications for Social Network Analysis in Public Sector Marketing...
PPTX
Social Network Visualization 101
PPTX
Social Network Analysis (SNA) 2018
PDF
FRIEND SUGGESTION SYSTEM FOR THE SOCIAL NETWORK BASED ON USER BEHAVIOR
PDF
Social Network Analysis & an Introduction to Tools
PPTX
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
PPTX
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
PPT
Social Network Analysis (SNA) and its implications for knowledge discovery in...
PPTX
Node XL - features and demo
PPT
2009 Node XL Overview: Social Network Analysis in Excel 2007
PPT
The Basics of Social Network Analysis
PPT
Social Network Analysis
PDF
Social network analysis intro part I
Introduction to Social Network Analysis
Introduction to Social Network Analysis
Social network analysis course 2010 - 2011
Social Network Analysis (SNA) Made Easy
A comparative study of social network analysis tools
Prof. Hendrik Speck - Social Network Analysis
12 Network Experiments and Interventions: Studying Information Diffusion and ...
Practical Applications for Social Network Analysis in Public Sector Marketing...
Social Network Visualization 101
Social Network Analysis (SNA) 2018
FRIEND SUGGESTION SYSTEM FOR THE SOCIAL NETWORK BASED ON USER BEHAVIOR
Social Network Analysis & an Introduction to Tools
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Node XL - features and demo
2009 Node XL Overview: Social Network Analysis in Excel 2007
The Basics of Social Network Analysis
Social Network Analysis
Social network analysis intro part I
Ad

Similar to Automatically Selecting Striking Images for Social Cards (20)

PDF
Exploring Machine Learning for Libraries and Archives: Present and Future
PDF
Deep learning and reasoning: Recent advances
PDF
A picture and a thousand words: Mixing modalities to tackle new multimedia i...
PPT
SSRI_pt1.ppt
PPTX
Combining Social Media Storytelling With Web Archives
PPTX
20120622 web sci12-won-marc smith-semantic and social network analysis of …
PPTX
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
PDF
How and why study big cultural data v2
PDF
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
PPTX
Storytelling With Web Archives
PDF
Mining the Social Web - Lecture 2 - T61.6020
PPTX
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
PPTX
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
PPTX
20120301 strata-marc smith-mapping social media networks with no coding using...
PPTX
20111103 con tech2011-marc smith
PPTX
LSS'11: Charting Collections Of Connections In Social Media
PDF
Introduction to Data Visualization
PDF
Quality assurance for document image collections in digital preservation
PDF
Thesis xu han final
PPTX
People's mode of online engagement: The Many Faces of Digital Visitors and Re...
Exploring Machine Learning for Libraries and Archives: Present and Future
Deep learning and reasoning: Recent advances
A picture and a thousand words: Mixing modalities to tackle new multimedia i...
SSRI_pt1.ppt
Combining Social Media Storytelling With Web Archives
20120622 web sci12-won-marc smith-semantic and social network analysis of …
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
How and why study big cultural data v2
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Storytelling With Web Archives
Mining the Social Web - Lecture 2 - T61.6020
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
20120301 strata-marc smith-mapping social media networks with no coding using...
20111103 con tech2011-marc smith
LSS'11: Charting Collections Of Connections In Social Media
Introduction to Data Visualization
Quality assurance for document image collections in digital preservation
Thesis xu han final
People's mode of online engagement: The Many Faces of Digital Visitors and Re...
Ad

More from Shawn Jones (15)

PPTX
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
PPTX
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
PDF
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
PPTX
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
PPTX
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
PPTX
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
PPTX
The Off-Topic Memento Toolkit
PPTX
The Many Shapes of Archive-It
PPTX
Improving Collection Understanding in Web Archives
PPTX
Reference Rot
PPTX
Where Can We Post Stories Summarizing Web Archive Collections
PPTX
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
PPTX
Continuous Integration: Finding problems soonest
PPTX
A Brief Introduction to Test-Driven Development
PPTX
Reconstructing the past with media wiki
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
The Off-Topic Memento Toolkit
The Many Shapes of Archive-It
Improving Collection Understanding in Web Archives
Reference Rot
Where Can We Post Stories Summarizing Web Archive Collections
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Continuous Integration: Finding problems soonest
A Brief Introduction to Test-Driven Development
Reconstructing the past with media wiki

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Machine Learning_overview_presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Electronic commerce courselecture one. Pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Machine Learning_overview_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Electronic commerce courselecture one. Pdf
sap open course for s4hana steps from ECC to s4
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf

Automatically Selecting Striking Images for Social Cards

  • 1. Automatically Selecting Striking Images for Social Cards Shawn M. Jones *† · Martin Klein † · Michele C. Weigle * · Michael L. Nelson * * Old Dominion University, Web Science and Digital Libraries Research Group † Los Alamos National Laboratory, Research Library Prototyping Team
  • 2. 2 @shawnmjones This work is part of the Dark and Stormy Archives (DSA) project Web archive collection of 1000s of documents Automated Solution A story that conveys understanding at a glance
  • 3. 3 @shawnmjones Social cards provide a visual summary of the content behind a URL https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory, +New+Mexico/@35 .3644614,-109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URL: The same URL represented by a social card:
  • 4. 4 @shawnmjones Social cards consist of different units S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
  • 5. 5 @shawnmjones Social cards allow resources to compete for clicks. Nature article shared on Twitter In addition to summarizing the resource, social cards drive clicks to the resource, answering the question of What does the underlying page contain? Which of these is more appealing? This is also a case of ”The Truth Is Paywalled but the Lies Are Free” S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] disinformation source shared on Twitter
  • 6. 6 @shawnmjones Cards are generated based on the HTML metadata that authors provide og:title -or- twitter:title -or- <title> og:description -or- twitter:description -or- description og:image -or- twitter:image Without twitter:card and og:title or twitter:title, Twitter typically gives up and does not generate a card. Facebook parses the <title> and produces a card with just a title. S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
  • 7. 7 @shawnmjones RQ1: What are the distributions of HTML metadata elements (general and social card elements) in news articles (over time) and scholarly publications published on the web? S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] If the metadata is prevalent and of high quality, then we can rely on it. If not, then to create good cards, we need to develop methods to fill in the missing metadata.
  • 8. 8 @shawnmjones We analyzed 198,523 news articles captured by the Internet Archive from 1998 to 2016, and found different rates of metadata adoption S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] released 2009 released 2010 released 2011 released 1998 est. released 1995 released 2010 est. released 2009 proposed 2009 released 2011 est. released 2012 released 2014 released 2009 est. released 2006 est. released 2006 est. released 2011 est. released 2010 OGP = Open Graph Protocol Facebook Cards 150 billion documents in the Internet Archive were captured before 2010
  • 9. 9 @shawnmjones We evaluated the HTML pages of 110,900 scholarly articles from the PubMed Central dataset – 100 articles each from 1,109 journals These are not archived pages, but how these articles were presented in 2020. 77.86% looks good, until we look at the images presented in these cards... S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]
  • 10. 10 @shawnmjones 74% of scholarly publications use publisher and journal logos as striking images; 52% reuse the same image for all articles 101 articles 200 articles 200 articles each 300 articles 107 articles 200 articles 274 articles 274 articles (blank) 2034 articles 400 articles 300 articles
  • 11. 11 @shawnmjones For news articles, most striking images are of article content, and those that repeat across articles tend to be author photos 48 articles 11 articles 7 articles 3 articles 15 articles 65 articles 24 articles 47 articles 3 articles 15,823 articles 131 articles 54 articles 3 articles 73 articles 3 articles 59 articles 3 articles 43 articles Publisher logo Author Photos
  • 12. 12 @shawnmjones RQ2: What approaches and image features are best suited to automatically select striking images from news articles and scholarly publications, and do the approaches differ for both resource types? S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] Good example for news Good example for scholarly publication
  • 13. 13 @shawnmjones If no metadata exists, we can select a striking image from the images available in the document Which of the images outlined in red is the striking one chosen by the author? How would a machine know which one to choose if there were no striking image specified in the metadata?
  • 14. 14 @shawnmjones Our generic selection approach has 3 steps 1. Score each image in the document by some approach (e.g., ML probability, feature value) 2. Sort the list of images by descending score (e.g., highest ML probability is first, image with most colors is first) 3. Choose the image at the beginning of the list 154,131 colors Sorted by color count Sorted by classifier probability 48,020 colors 44,737 colors 30,940 colors 0.3623 0.1948 0.1259 3,816 colors 0.1116 0.11 (resized) (cropped) (resized) (cropped) (larger)
  • 15. 15 @shawnmjones NEWSROOM Dataset sample We sampled from two datasets to determine which approaches worked best for selecting striking images PLOS ONE dataset sample • News articles tend to select images that represent their stories • 37,522 news articles • Submission guidelines encourage authors to choose their own striking images after acceptance • 198,523 scholarly articles S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] News Articles Scholarly Publications In both, the metadata gives us the ground truth the image that an author chose for their article.
  • 16. 16 @shawnmjones A social card creation service needs to be able to select a striking image in close to real time, so we considered base features that are quickly calculated by image libraries byte size: 71,934 bytes width: 320 pixels height: 242 pixels negative space: 53 histogram cols = 0 size in pixels: 77,440 pixels aspect ratio: 1.3223 number of colors: 13,891 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] Base features
  • 17. 17 @shawnmjones The consistent structure of scholarly publications allows us to quickly calculate additional features for each image Figure position features • figure position • figure position (scaled) Section features • section index • scaled section index • character position in section • word position in section Caption features • Caption TF rank • Caption TF rank (scaled) • Jaccard distance of title and caption character position: 7,508 word position: 1,196 section index: 2 scaled section index: 0.333 figure position: 6 figure position (scaled): 0.857 Caption TF rank: 3 Caption TF rank (scaled): 0.429 Jaccard: 0.85
  • 18. 18 @shawnmjones We evaluate our approaches with P@1 and MRR • Precision@1 (P@1): Does the prediction approach choose the right image? − P@1 = 1.0 if yes, 0 if no • Mean Reciprocal Rank (MRR): If it failed, how far off was it? − the mean of the reciprocal ranks of all results − e.g., if approach ranks the ground truth striking image as #5, then RR = 0.2 − MRR of 1.0 is desirable • But how do we know what the correct image is? − Did the image have the same URL as the one in the metadata? − If not, was it perceptually the same? Image Color Count 154,131 colors 48,020 colors 44,737 colors 30,940 colors 3,816 colors P@1 = 0 RR = 1/2 = 0.5 Image chosen by approach: most colors Image chosen by author (ground truth) Perceptually the same as the image chosen by author, as determined by pHash
  • 19. 19 @shawnmjones 37,522 news articles from NEWSROOM 198,523 scholarly articles from PLOS ONE Different features work best to predict the striking image for news articles vs. scholarly publications P@1=0.83 MRR=0.88 P@1=0.78 MRR=0.86
  • 20. 20 @shawnmjones Conclusions • News articles quickly adopted social cards • Prior to 2010, there were no standards, corresponding to 150 billion documents in the Internet Archive that need automatic summarization • News article metadata have striking images drawn from the article • Scholarly publishers favor company or journal logos for their striking images, not summarizing the document • For predicting striking images based on the content of the document: − Random Forest with base features performed best for news articles (P@1=0.83) − Random Forest with base features and figure position performed best for scholarly publications (P@1=0.78) • For more information, see the Dark and Stormy Archives Project: https://oduwsdl.github.io/dsa/ S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021] released 2010 released 2011 news articles scholarly publications