Automatically Selecting Striking Images for Social Cards

Automatically Selecting Striking
Images for Social Cards
Shawn M. Jones *† · Martin Klein † · Michele C. Weigle * · Michael L. Nelson *
* Old Dominion University, Web Science and Digital Libraries Research Group
† Los Alamos National Laboratory, Research Library Prototyping Team

2
@shawnmjones
This work is part of the
Dark and Stormy Archives (DSA) project
Web archive collection of
1000s of documents
Automated
Solution
A story that conveys
understanding at a glance

3
@shawnmjones
Social cards provide a visual summary of the
content behind a URL
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,
+New+Mexico/@35 .3644614,-109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1
s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URL:
The same URL
represented by
a social card:

4
@shawnmjones
Social cards consist of different units
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social
Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899.pdf [to be published June 2021]

5
@shawnmjones
Social cards allow resources to
compete for clicks.
Nature article shared on Twitter
In addition to summarizing the resource, social cards drive clicks to the resource, answering the
question of What does the underlying page contain?
Which of these is more appealing?
This is also a case of ”The Truth Is Paywalled but the Lies Are Free”
disinformation source
shared on Twitter

6
@shawnmjones
Cards are generated based on the
HTML metadata that authors provide
og:title
-or-
twitter:title
-or-
<title>
og:description
-or-
twitter:description
-or-
description
og:image
-or-
twitter:image
Without twitter:card and og:title or twitter:title, Twitter typically gives up and does
not generate a card.
Facebook parses the <title> and produces a card with just a title.

7
@shawnmjones
RQ1: What are the distributions of HTML
metadata elements (general and social card
elements) in news articles (over time) and
scholarly publications published on the web?
If the metadata is prevalent and of high quality, then we can rely on it.
If not, then to create good cards, we need to develop methods to fill in
the missing metadata.

8
@shawnmjones
We analyzed 198,523 news articles captured by
the Internet Archive from 1998 to 2016, and found
different rates
of metadata adoption
released
2009
released
2010
released
2011
released
1998
est. released
1995
released
2010
est. released
2009
proposed
2009
released
2011
est. released
2012
released
2014
released
2009
est. released
2006
est. released
2006
est. released
2011
est. released
2010
OGP = Open Graph Protocol
Facebook Cards
150 billion documents
in the Internet Archive
were captured before
2010

9
@shawnmjones
We evaluated the HTML pages of 110,900
scholarly articles from the PubMed Central
dataset – 100 articles each from 1,109 journals
These are not archived pages, but how these articles were presented in 2020.
77.86% looks good, until we look at
the images presented in these cards...

10
@shawnmjones
74% of scholarly publications use publisher and
journal logos as striking images; 52% reuse the
same image for all articles
101 articles
200 articles
200 articles each
300 articles
107 articles
200 articles 274 articles
274 articles
(blank)
2034 articles
400 articles
300 articles

11
@shawnmjones
For news articles, most striking images are of article
content, and those that repeat across articles tend
to be author photos
48 articles
3 articles
15 articles
65 articles
24 articles
15,823 articles
131 articles
54 articles
3 articles
59 articles
3 articles
43 articles
Publisher logo
Author Photos

12
@shawnmjones
RQ2: What approaches and image features
are best suited to automatically select striking
images from news articles and scholarly
publications, and do the approaches differ for
both resource types?
Good example for news
Good example for scholarly publication

13
@shawnmjones
If no metadata
exists, we can
select a striking
image from the
images available
in the document
Which of the images
outlined in red is the striking
one chosen by the author?
How would a machine know
which one to choose if there
were no striking image
specified in the metadata?

14
@shawnmjones
Our generic
selection approach
has 3 steps
1. Score each image in the
document by some
approach (e.g., ML
probability, feature value)
2. Sort the list of images by
descending score (e.g.,
highest ML probability is
first, image with most
colors is first)
3. Choose the image at the
beginning of the list
154,131
colors
Sorted by color count
Sorted by
classifier probability
48,020
colors
44,737
colors
30,940
colors
0.3623
0.1948
0.1259
3,816
colors
0.1116
0.11
(resized)
(cropped)
(resized)
(cropped)
(larger)

15
@shawnmjones
NEWSROOM Dataset sample
We sampled from two datasets to determine
which approaches worked best for selecting
striking images
PLOS ONE dataset sample
• News articles tend to select images that
represent their stories
• 37,522 news articles
• Submission guidelines encourage authors to
choose their own striking images after
acceptance
• 198,523 scholarly articles
News Articles Scholarly Publications
In both, the metadata gives us the ground truth
the image that an author chose for their article.

16
@shawnmjones
A social card creation service needs to be able to select a striking
image in close to real time, so we considered base features that
are quickly calculated by image libraries
byte size: 71,934 bytes
width: 320 pixels
height: 242 pixels
negative space: 53 histogram cols = 0
size in pixels: 77,440 pixels
aspect ratio: 1.3223
number of colors: 13,891
Base features

17
@shawnmjones
The consistent structure of scholarly
publications allows us to quickly calculate
additional features for each image
Figure position
features
• figure position
• figure position
(scaled)
Section features
• section index
• scaled section index
• character position in section
• word position in section
Caption features
• Caption TF rank
• Caption TF rank
(scaled)
• Jaccard distance
of title and caption
character position: 7,508
word position: 1,196
section index: 2
scaled section index: 0.333
figure position: 6
figure position
(scaled): 0.857
Caption TF rank: 3
Caption TF rank (scaled): 0.429
Jaccard: 0.85

18
@shawnmjones
We evaluate our approaches with
P@1 and MRR
• Precision@1 (P@1): Does the prediction
approach choose the right image?
− P@1 = 1.0 if yes, 0 if no
• Mean Reciprocal Rank (MRR): If it failed,
how far off was it?
− the mean of the reciprocal ranks of all results
− e.g., if approach ranks the ground truth
striking image as #5, then RR = 0.2
− MRR of 1.0 is desirable
• But how do we know what the correct
image is?
− Did the image have the same URL as the one
in the metadata?
− If not, was it perceptually the same?
Image Color
Count
154,131
colors
48,020
colors
44,737
colors
30,940
colors
3,816
colors
P@1 = 0
RR = 1/2 = 0.5
Image chosen by
approach: most colors
Image chosen by
author (ground truth)
Perceptually the same
as the image chosen
by author, as
determined by pHash

19
@shawnmjones
37,522 news articles from NEWSROOM 198,523 scholarly articles from PLOS ONE
Different features work best to predict the striking
image for news articles vs. scholarly publications
P@1=0.83
MRR=0.88
P@1=0.78
MRR=0.86

20
@shawnmjones
Conclusions
• News articles quickly adopted social cards
• Prior to 2010, there were no standards, corresponding
to 150 billion documents in the Internet Archive that
need automatic summarization
• News article metadata have striking images drawn from
the article
• Scholarly publishers favor company or journal logos for
their striking images, not summarizing the document
• For predicting striking images based on the content of
the document:
− Random Forest with base features performed best for
news articles (P@1=0.83)
− Random Forest with base features and figure position
performed best for scholarly publications (P@1=0.78)
• For more information, see the
Dark and Stormy Archives Project:
https://oduwsdl.github.io/dsa/
released
2010
released
2011
news articles scholarly publications

Automatically Selecting Striking Images for Social Cards

More Related Content

What's hot (20)

Similar to Automatically Selecting Striking Images for Social Cards (20)

More from Shawn Jones (15)

Recently uploaded (20)

Automatically Selecting Striking Images for Social Cards