Duplicate Content Myths Types and Ways To Make It Work For You

@dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
DUPLICATE CONTENT: MYTHS, TYPES &
WAYS TO MAKE IT WORK FOR YOU

Duplicate Content Penalty ‘Myth’… It Just Won’t Die
Query Refinement Suggestion
Next Probable Queries on “near
duplicate urls can cause”
2017

At least 30% of the
web is a duplicate
of other pages on
the web

“And that’s
OK”

The Duplicate Content ‘Penalty’ Myth
‘Real’ duplicates (matching
content checksum) filtered and
not indexed
“Each content filter sends the
retrieved web pages to Dupserver
to determine if they are duplicates
of other web pages”
http://www.google.ch/patents/US20120317089

Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleFilters

Handling Near-Duplicate Content Attracted Lots of Research
§Dennis Fetterly
§Marc Najork
§Mark Manasse
§Ziv Bar-‐Yossef
§Monica Henzinger
§William Pugh
§Andrei Broder
Some Notable ‘Spot the
Difference’ Researchers
DETECTING DUPLICATES & NEAR-‐DUPLICATES
EARLY SAVES ON RESOURCES / EFFICIENCY

Because… Near Duplicate Content is More Difficult to Detect
than Exact Duplicates
’Detecting Duplicate and
Near Duplicate Files’
IT’S AN ONGOING REAL
WORLD CHALLENGE
(Henzinger / Pugh, 2003, 2009, 2011,
2012, 2011, 2016)
These Google patents in the series
keep being ‘tweaked’ (A is not the
same as B)

A lot of busy Googlebots & potential for duplicates
• The web doubled in
size 2010 – 2012
• Another 1/3 by 2015
• Finite search engine
resources
• Processes automated
for scale
“I just never have
any ‘me’-‐time’
any more”

Near
Duplicates
Do Not
Change
Often
SO… WHY
WASTE
RESOURCES
CRAWLING
THEM?

DENNIS FETTERLY

The Slow Page Evolution of Near Duplicates
“Clusters of near-‐duplicate documents
are fairly stable: Two documents that
are near-‐duplicates of one another are
very likely to still be near-‐duplicates 10
weeks later”
(Fetterly & Najork, 2003)

… The Raters Guidelines still ask raters to catch ‘dupes’
In Fact… There’s
a whole section
of the guidelines
dedicated to
them
2017

Mostly Stable For Years… But… The Web is Always Changing
2017

Near-Dupes are still doing strange things
John Mu at International Search Summit § Nearly the same but not
the same still causes
confusion
§ Particularly problematic
on internationalization
§ But applies to all sites
with pages not the same
but ’nearly-‐the-‐same’
2017

BUT…Different
types of
‘duplicate
content’

• Full duplication
• Partial duplication
• Document inclusion
• In-‐document duplication
• (Local duplication (in-‐same-‐site))

All types may not
be treated the
same

PERFECT
DUPLICATES

Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleFiltered before indexing

D.U.S.T. (DIFFERENT URL, SIMILAR TEXT)

DUSTBUSTER - Do Not Crawl in The Dust… Ziv Bar-Yossef
Reduce crawling and wasted
resources to low importance pages
CAVEAT: IT IS NOT
KNOWN WHETHER
THIS IS BEING USED
AT ALL. RESEARCH
AND THEORY
§ Builds crawling ‘rules’
§ Detects duplicate content
URL patterns
§ From small ‘sampling’ visits
§ Swerves ‘DUST’
§ DUSTBUSTER
§ Saves crawling resources
§ Potentially Popular CMS
configurations URL
parameters detect ‘DUST’
2003

Cookie Cutter Sites

Never hit query run-time
auction
Because…
They’re not indexed…
Filtered

Tripping Flags
NEAR DUP == TRUE
NEAR DUP == FALSE

Query Agnostic Nature of Near-Duplicate Clustering

Single URL Content Fingerprint

But… The Single
URL Fingerprint
May Not Be The
One You Choose

BOILERPLATE ISSUES

TOKENS, VECTORS & SHINGLING

(w) Shingling
A rose is a rose is a rose

Shingling
A rose is a rose is a rose
N-‐Gram
(Where ‘n’ is no.
words (tokens) in
snapshot)
[A rose is a]
[rose is a
rose] [is a
rose is] (4)

SHINGLE
VECTORS
SUPERSHINGLE
MEGASHINGLE
Shingles, Supershingles & Megashingles
WORD ==
TOKEN

Click To Edit Presentation SubtitleClick To Edit Presentation Subtitlehttp://corpus.tools/wiki/Onion

http://corpus.tools/wiki/Onion
N-‐gram
length
(word
string)
POTENTIAL EXAMPLE

http://corpus.tools/wiki/Onion
Dup Content
Threshold
e.g. 0.5
(50%)
POTENTIAL EXAMPLE

Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic clustering of the web. Computer
Networks and ISDN Systems, 29(8-13), pp.1157-1166.

“We have developed an
efficient way to determine
the syntactic similarity of
files and have applied it to
every document on the
World Wide Web”
(Broder et al, 1997)
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic clustering of the web. Computer
Networks and ISDN Systems, 29(8-13), pp.1157-1166.

Documents grouped
together to meet
similar queries
equally (in a cluster)

Multiple Title
Candidates For A
Query
DYNAMIC,
CONTEXTUAL
SEARCH

Quilting Web Pages

DUPLICATE CONTENT
TYPE – NEAR DUPE
(QUILTING)
UNIQUE
PARAGRAPH
EXTERNAL
SYNDICATED
EXTERNAL
SYNDICATED
EXTERNAL
SYNDICATED
HEADER - TEMPLATE
FOOTER - TEMPLATE
UNIQUE
PARAGRAPH
A
S
I
D
E

CONTENT INCLUSION
CMS’

UNIQUE
MAIN
CONTENT
BUT ITS
CONTENT IS
INCLUDED
ELSEWHERE
TEASER ‘INCLUDED’ ELSEWHERE

May… or May NOT be Filtered before
indexing
?

Pages that look very different
but meet the same user
information need equally

Cross Over, Query Class & Semantic Collisions

Possible Treatment of Near-Duplicate Query Candidates
“If more than one candidate is
determined to be part of a
’search query cluster’, the most
important one based on factors
such as relevance, freshness,
importance is returned. The
others are eliminated.”
(Henzinger / Pugh, 2012,2016)
Last updated
2016

May… or May NOT be Filtered
before indexing
?

URL Parameter-driven Ecommerce Platforms

How Are Choosing Strategies Catered For in Ecommerce?
§ FACETED NAVIGATION
& WEBSITE FILTERS ==
Allows for ‘Elimination by
Aspects’
§ PAGINATION == Reduces
‘Too Much Choice’ effects
§ SORTING == Caters for
‘FIRST / BEST’ choosing
strategies
CHOICE-‐
ASSISTING
FUNCTIONALITY
HEURISTICS

And with these choice-assisting functionalities come…
“Exponentially
multiplicative
URLs”

Exponentially Multiplicative URLs From Faceted Navigation…
100 DRESSES
5 COLOURS
10 SIZES
2 LENGTHS
4 SUPPLIERS
100 x 5 x 10 x
2 x 4 =
40,000
URLs

And that’s without HTTPS, WWW/non or internationalization
100 DRESSES
5 COLOURS
10 SIZES
2 LENGTHS
4 SUPPLIERS
100 x 5 x 10 x
2 x 4 =
40,000
URLs
X 2 BECAUSE…
HTTPS VERSION
80,000
URLs
X 2… BECAUSE…
WWW / NON
WWW VERSION 160,000
URLs
X 5…
BECAUSE…
EN / FR / ES /
DE / IT (e.g.)
800,000
URLs

THAT’S A LOT
OF URLs FOR
100 DRESSES
Bored Googlebot
(Unrelated to speed)

When You
Stop Boring
Googlebot

When You Stop Boring Googlebot
NOT
SPEED
RELATED

CANONICALIZATION

The Canonical Tag - Otherwise Known As… RFC 6596
‘THE
CANONICAL
LINK
RELATION’
2012

‘The Canonical Link Relation’ – RFC6596 Is Still Adhered To
2017

50% OF SEO’S
“SEARCH ENGINES HAVE
IGNORED CANONICAL TAGS
THEY HAD IMPLEMENTED”
2017

A lot can go wrong with mixed signals…

There Are Many Signals To Consider In Canonicalization
404 & 410
301
302, 303, 307
Valid canonical from ‘context’ URL to valid target
Fall back to default pre ‘Canonical Link
Relation’duplicate handling signals
Valid href lang (if present and applicable)
Manual Action SUPER STRONG
STRONG -‐ DIRECTIVE
STRONG -‐ HINT
STRONG -‐ HINT
DEFAULT
ALL NEED TO BE IN UNISON
HTTPS (Google Specific)

“REL=NEXT / REL =
PREV” IS NOTA FORM
OF CANONICALIZATION
2017

PAGINATION & SORTING

Rel =”next” Rel = “prev” RFC 5988 (Web Linking)

2011… -> We’re Still Unclear

View-all’ Search Experience

If a canonical is not deemed to be valid
there is likelihood the pre-‐RFC6596
Canonical Link Relation treatment of
duplicates and near-‐duplicates will be
applied:
Such as ‘internal links’
COMMON CANONICAL MISTAKES

“301s AND 302s ARE
BOTH A FORM OF
CANONICALIZATION”
2017

Don’t canonicalize from an
”index” to a “noindex or vice-‐
versa because this means the
pages are NOT the same.
The canonical will likely be
ignored
COMMON CANONICAL MISTAKES
If “href lang” references an
alternative which does not
match a canonical link the
canonical will likely be
ignored

HOW TO RESOLVE?

Instead of
‘remove’
consider
‘degroup’

DUPLICATE CONTENT TYPE – NEAR DUPE
(ADDING VALUE)

Hubs &
Authorities
BOWTIE OF THE WEB
Build Strongly Connected Components

SORT OUT YOUR LIBRARY
SYSTEM & QUERY CLASSES

SEARCH ENGINES
LOVE
CATEGORIZATION

Focused
Crawling
CRAWLING
CONTENT ON A
SPECIFIC
TOPIC FOR
EFFICIENCY

The ‘Mere’ Categorization Effect (Phenomenon) FTW
Simply by labelling products /
items as being part of a
category regardless of label
appears to increase
perception of variety &
positive experience (Mogliner
et al, 2003)
HUMANS LOVE CATEGORIES
TOO… IT IS A PHENOMENON

Homonyms contribute to need for query refinement
HOMONYMS –WORDS THAT ARE SPELT OR PRONOUNCED
THE SAME BUT HAVE DIFFERENT MEANINGS
ROSE
EVENING WATCH
SINK
BACK
ARMS
BOW
CHECK
STRENGTHEN DIFFERENTIAL
CONTEXT

INTELLIGENT INTERNAL LINKING

LOCAL NAVIGATION RELEVANCE
TABLE OF CONTENTS STYLE IN
PAGE NAVIGATIONAL HEURISTIC
FOR SEARCH ENGINE AND
HUMAN
PAGINATED TAB THROUGH ON
SECTIONS OF REVIEW
GRANULAR
RELEVANCE

Parameter Handling
“IS PARAMETER-‐HANDLING A
WAY TO HELP GOOGLE BUILD A
SET OF ‘DUSTBUSTER
CRAWLING RULES’ EARLY?”
MAKE THE
RULES

ADD VALUE TO NEAR DUPES
(INFORMATIONAL VIEWS (INFORMATION
ARCHITECTURE)

INFORMATION VIEWS
ADDING VALUE AND
PASSING STRENGTH TO
CANONICAL TARGETS

Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleSitemaps Below Surface

BUILD STRONG SECTIONS
REVIEWS
BLOG
BUYING
GUIDES
COST
CALCULATORS
COMMERCE
MAIN SITE THEME (ONTOLOGY
SEMANTICS RULE
UGC

Power Mapper

Related Content Mostly Adds Value To Other Content
That content is ‘stitched’ from elsewhere
But it is VERY useful overall & helps with
searcher ‘foraging’
To create context for what it links out to

Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleNOT Filtered before indexing
Doc IDs meeting
contextual
information needs
-‐ 1, or 2 pages
(max) chosen at
query run-‐time

Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleNOT Filtered before indexing
Fighting with each
other to be ‘THE’
result
Seems like
‘dilution’

BE VERY CAREFUL WITH ‘PRUNING’

CAN YOU ‘IMPROVE’, ‘DE-GROUP’ OR
‘REMORPH’ … RATHER THAN ‘REMOVE?

THAT’S A LOT
OF URLs FOR
100 DRESSES
Is The Difference Substantively Different To Queries?

Does The
Repurposed or
Collated Content Add
‘Additional’ Value??

Content
meeting
informational
needs equally
treated
different TO
DUPLICATES

GOTCHAS

Gotchas – Velvet
Blues Update
(SOME) URLs (WP
PLUGIN)

BETTER
SEARCH
REPLACE
PLUGIN
REVIEWS

WP For AMP
Internal
Linking
Canonical
Issues

MAGENTO GOTCHA WITH CANONICALS

Understand The Canonical Link Relation Rules – RFC6596
The target (canonical) IRI
MUST identify content that
is either duplicative or a
superset of the content at
the context (referring) IRI.

Google’s Maile Ohye on ‘How To Hire An SEO’

Without &filter=0 Appended to end of Query
https://www.google.co.uk/search?q=red+dress
es+size+10+long+sleeves&oq=red+dresses+siz
e+10+long+sleeves&aqs=chrome.0.69i59.1257
0j0j7&sourceid=chrome&ie=UTF-‐8
NOBODY HAS MORE
THAN ONE LISTING

With filter=0 Appended to end of Query
https://www.google.co.uk/search?q=red+size+
10+dresses+long+sleeves&oq=red+size+10+dr
esses+long+sleeves&aqs=chrome..69i57.13605
j0j7&sourceid=chrome&ie=UTF-‐8&filter=0
ALL SITES HAVE AT
LEAST 2 LISTINGS
MISSED
OPPORTUNITIES

ASOS.com
18%

Quadruple
Listings

Similar Content – Query Refinement SERPs
NOT FILTERED
NOT NEAR-‐DUPES
Does the searcher
want ‘gas
engineers, heating
engineers, central
heating?’

CONFUSING DUPLICATE, NEAR-‐
DUPLICATE (DUST) AND SIMILAR
CONTENT COULD COST YOU
DEARLY
Maybe a lot of people are confused by duplicates?
§ Be careful about canonicalizing
when unnecessary
§ True duplicate content & near-‐
dupes are query and category
agnostic
§ Similar is not duplicate
§ You may still have the answers
to different queries based on a
small important difference
§ AT LEAST 4 TYPES OF
DUPLICATE CONTENT
2017

Thank
You

APPENDIX

Problems With The Many ‘Faces’ of Faceted Navigation
https://webmasters.googleblog.com/2014/02/faceted-‐navigation-‐best-‐
and-‐5-‐of-‐worst.html -‐ Wednesday, February 12, 2014
Example of faceted navigation:
http://www.example.com/category.php?category=gummy-candies&price=5-
10&price=over-10
Facet means ‘little faces’ (USEFUL TRIVIA)

Relation Links – ‘Web Linking’
https://tools.ietf.org/html/rfc5988
Web LINKING – RFC 5988
INTERNET ENGINEERING TASK FORCE

Internationalization – An Additional Layer of Complexity
‘TAGS FOR IDENTIFYING
LANGUAGES – rfc 5646
INTERNET ENGINEERING TASK
FORCE

A Solution - The Introduction of Href Lang
Wikipedia page on href lang
Rules on href lang
https://support.google.com/webmasters/answer/182192?hl=en&ref_topic=2370587 -‐
MULTINATIONAL & MULTILINGUAL SITES AND HREF LANG
https://support.google.com/webmasters/topic/2370587?hl=en&ref_topic=4598733 -‐
HREF LANG Google
USE A SITEMAP FOR HREF LANG
LOCALE AWARE WITH GOOGLEBOT
CRAWLING

INTERNATIONALIZED RESOURCE INDICATOR
IRI
Internationalized Resource Identifiers (IRIs)
RFC 3987
INTERNET ENGINEERING TASK FORCE

REFERENCES

References & Sources
Fetterly, D., Manasse, M. and Najork, M., 2003. On the evolution of clusters
of near-‐duplicate web pages. Journal of Web Engineering, 2(4), pp.228-‐246.
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic
clustering of the web. Computer Networks and ISDN Systems, 29(8-‐13),
pp.1157-‐1166.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A. and Wiener, J., 2000. Graph structure in the web. Computer
networks, 33(1), pp.309-‐320.
Mogilner, C., Rudnick, T. and Iyengar, S.S., 2008. The mere categorization
effect: How the presence of categories increases choosers' perceptions of
assortment variety and outcome satisfaction. Journal of Consumer
Research, 35(2), pp.202-‐215.

http://www.seobythesea.com/2008/02/new-‐google-‐process-‐for-‐detecting-‐
near-‐duplicate-‐content/
Pugh, W. and Henzinger, M.H., Google Inc., 2016. Detecting duplicate and
near-‐duplicate files. U.S. Patent 9,275,143.
Alonso, O., Fetterly, D. and Manasse, M., 2013, December. Duplicate news
story detection revisited. In Asia Information Retrieval Symposium (pp. 203-‐
214). Springer Berlin Heidelberg.
RFC 5988 – The Canonical Relation Link -‐ https://tools.ietf.org/html/rfc5988
Fetterly, D., Manasse, M. and Najork, M., 2003. On the evolution of clusters
of near-‐duplicate web pages. Journal of Web Engineering, 2(4), pp.228-‐246.

Najork, M., 2012, August. Detecting quilted web pages at scale.
In Proceedings of the 35th international ACM SIGIR conference on
Research and development in information retrieval (pp. 385-‐394). ACM
Source: Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,
Stata, R., Tomkins, A. and Wiener, J., 2000. Graph structure in the
web. Computer networks, 33(1), pp.309-320.
.

Duplicate Content Myths Types and Ways To Make It Work For You

More Related Content

What's hot (20)

Similar to Duplicate Content Myths Types and Ways To Make It Work For You (20)

More from Dawn Anderson MSc DigM (20)

Recently uploaded (20)

Duplicate Content Myths Types and Ways To Make It Work For You