SlideShare a Scribd company logo
Web Pages Visual Similarity
Hi everyone!
I’m Giacomo, R&D director at Merj.
Merj blends digital marketing and engineering expertise to solve
complex challenges for enterprise companies.
We act as an embedded team for our customers, providing
transformative solutions that merge strategy, data, automation
and technology.
*We are hiring! More on that later…
What we are going to cover today
1. Understand the context of the research
2. Explain why visual similarity is useful
3. How to define visual similarity
4. Implementation overview
5. Additional use cases
The project
Scenario
A company we were working with wanted to consolidate its
brands onto a unified technical stack.
They wanted to know if this would be a problem for users and
search engines.
Can you define similarity?
Similarity means different things to different people.
We approach this concept by defining it in terms of text
similarity and visual similarity.
Text similarity part was already completed
I joined Merj when this project was already underway, with the
team in the final stages of completing the text similarity
component.
Text similarity is a well-established process with numerous
documented approaches to tackle it effectively.
In the following slides, we will turn our attention to visual
similarity.
Visual Similarity
Why visual similarity?
Jakob's Law of Internet User Experience
“Users spend most of their time on other sites. This means that
users prefer your site to work the same way as all the other sites
they already know. Design for patterns for which users are
accustomed.”
- Jakob Nielsen
Source: https://www.nngroup.com/videos/jakobs-law-internet-ux/
When I’m familiar with it, I know how it works
Icon source: Icons from flaticon.com
When being similar is too much?
When websites are too similar, the advantage of having multiple
brands can be diminished. Users may perceive the websites as
identical, which could negatively impact business metrics.
By comparing multiple internal brands and competitors, and
incorporating business metrics, we wanted to define the
optimal threshold for visual similarity.
Looking for a different approach for Visual Similarity
The team had already explored several machine learning
approaches. However, implementing them at the scale we
required turned out to be both slow and costly.
We started looking at a different approach.
Project Status
What we had:
- List of domains to compare (internal brands and
competitors)
- List of categories per each domain (HTML templates)
- List of web pages for each category
- Web Crawler (utilising a Headless Browser to render web
pages)
Browser Rendering Process 101
Browser Rendering Process
Image source: https://web.dev/articles/howbrowserswork
Parsing HTML
to construct
the DOM tree
Render tree
construction
Layout of the
render tree
Painting the
render tree
#1 Parsing
Image source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
#2 Style calculation
Image source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
#3 Layout
Image source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
#4 Painting
Images source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
How can we use this for the project?
Web Pages Visual Similarity - Search Central Live Zurich 2024
Web Pages Visual Similarity - Search Central Live Zurich 2024
Web Pages Visual Similarity - Search Central Live Zurich 2024
Hypothesis
“Two web pages can be compared for visual similarity by
evaluating the elements that share similar coordinates and
dimensions.”
Implementation overview
Headless Chrome
Headless Chrome was shipped in Chrome 59 in 2017.
Automated Browser Actions
There are many libraries that provide a high-level API for
controlling Chrome, abstracting the DevTools Protocol.
For lower-level tasks, we can directly use the DevTools Protocol
(CDP), a protocol designed to automate actions on Chromium,
Chrome, and other Blink-based browsers.
Source: https://chromedevtools.github.io/devtools-protocol/
CDP’s DOMSnapshot.captureSnapshot
Using Chrome DevTools
Protocol we can get a snapshot
of all nodes (elements)
rendered on the page, their
content, positions, and
dimensions.
Source: https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot
Code example
// Launch Puppeteer and open a new page
const browser = await puppeteer.launch({ headless: true });
const context = await browser.createBrowserContext();
const page = await context.newPage();
// Connect to the DevTools protocol
const CDPclient = await page.createCDPSession();
// Enable DOMSnapshot domain
await CDPclient.send('DOMSnapshot.enable');
// Navigate to the target URL
const url = 'https://example.com';
await page.goto(url, { waitUntil: 'networkidle2' });
// Capture a DOM snapshot
const snapshot = await CDPclient.send('DOMSnapshot.captureSnapshot', {
// Define the computed style to return
computedStyles: ['visibility', 'display', 'z-index', 'color','background-color']
});
INFORMATION ABOUT THE ELEMENT
nodeType:1
nodeName:"DIV"
attributes:[{"name":"class","value":"P6T2B6 _244GCM"}]
...
LAYOUT TREE INFORMATION
"X":868.390625,"Y":81,"width":22,"height":22
Parsing and normalising the DOMSnapshot output
Source: https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot
Nodes coordinates and dimensions
y
X
width
height
Sets of elements (nodes) for two web pages
PAGE A
"X":832.375,"y":84.296875,"width":16,"height":16
"X":797.734375,"y":83.296875,"width":30.640625,"height":17
"X":860.390625,"y":75,"width":78.890625,"height":36
"X":868.390625,"y":81,"width":22,"height":22
…
PAGE B
"X":868.390625,"y":81,"width":22,"height":22
"X":898.390625,"y":83.28125,"width":32.890625,"height":17
"X":943.28125,"y":75,"width":60.71875,"height":36
"X":24,"y":181,"width":976,"height":128
…
Algorithm v0.1
ThresholdWidth = PageWidth * X%
ThresholdHeight = PageHeight * X%
BoxA = "X":832.375,"Y":84.296875,"width":16,"height":16
BoxB = "X":868.390625,"Y":81,"width":22,"height":22
if BoxB coordinates are included in BoxA coordinates + thresholds
if BoxB dimensions are included in BoxA dimension + thresholds
then Boxes are similar (add to the list of similar boxes)
else
Boxes are not similar
…
Continue comparing BoxA with …
Jaccard Index
Eventually, to calculate the similarity between two pages we can
use the Jaccard Index.
Where A ∩ B (the intersection) is the set of similar elements on
the two pages, and A ∪ B (the union) is the set of all unique
elements from both pages combined.
Example #1
Visual Similarity = 1.0
Example #2
Visual Similarity = 0.4
Example #3
Visual Similarity = 0
Optimisations
After the v0.1 version, we improved the process by adding
multiple optimisations:
- Including only visible nodes
- Background colors
- Merging overlapping nodes
- Considering the z-index of nodes
- Using more performant data structures
…and many more.
We delivered this and..
While we can't share many details, the company was highly
satisfied with both the process and the outcomes.
The resulting text and visual similarity metrics were integrated
into the annual business goals as control metrics.
Max threshold of visual similarity in this case was around 40%,
but this may vary depending on websites, type of pages, and
goals.
Other people come up with similar ideas!
Updating these slides I’ve found that
a team of researchers from Harbin
Institute of Technology, Harbin,
China and Cyberspace Security
Research Center, Peng Cheng
Laboratory, Shenzhen, China come
up with a similar method to
detecting phishing web sites.
Source: https://www.researchgate.net/publication/336377602_Algorithm_of_web_page_similarity_comparison_based_on_visual_block
Layout Tree, what else?
Additional use cases (1)
1:1 website migrations: Ensure pages are migrated correctly and
100% visually similar (should be 100% similar across the
domains).
Additional use cases (2)
Above-the-Fold content analysis:
Verify that key content is visible and
prioritised above the fold.
Additional use cases (3)
Intrusive interstitials and dialogs:
Identify if web pages have intrusive
interstitials and dialogs that may
interfere with search engines to
understanding of the content.
Image source: https://developers.google.com/search/docs/appearance/avoid-intrusive-interstitials
Additional use cases (4)
Web Page element rendering verification: Confirm that web
pages rendered by search engines or other systems
successfully handle and position specific elements as intended,
detecting misalignments or unexpected behaviours.
We’re looking for experienced technical SEO
consultants.
If you’d like to discuss coming to work at Merj…
https://www.linkedin.com/in/ryansiddle/
MERJ Ltd,
7 Pancras Square,
London, N1C 4AG
+44 (0) 20 3322 2660 contact@merj.com merj.com
Thank you for your time and attention!

More Related Content

PPTX
Style based
PDF
CIRCUIT 2015 - Glimpse of perceptual diff
PPTX
Style based
PPTX
Thumbnail Summarization Techniques For Web Archives
PDF
Identifying Auxiliary Web Images Using Combinations of Analyses
PDF
Beyond the Standards
PDF
AI-Powered Testing Strategies for the Seasonal Shopping Surge.pdf
PPTX
Testing CSS - Front end ops by Arpit Maheshwari
Style based
CIRCUIT 2015 - Glimpse of perceptual diff
Style based
Thumbnail Summarization Techniques For Web Archives
Identifying Auxiliary Web Images Using Combinations of Analyses
Beyond the Standards
AI-Powered Testing Strategies for the Seasonal Shopping Surge.pdf
Testing CSS - Front end ops by Arpit Maheshwari

Similar to Web Pages Visual Similarity - Search Central Live Zurich 2024 (20)

PDF
Building a game engine with jQuery
PPTX
Multimedia searching
PDF
Extreme Web Performance for Mobile Devices
PPTX
Advanced Automated Visual Testing
PPTX
HTML5 on Mobile(For Designer)
PDF
Extreme Web Performance for Mobile Devices - Velocity NY
PDF
Web Scraping
PDF
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
PDF
[convergese] Adaptive Images in Responsive Web Design
PDF
Extreme Web Performance for Mobile Devices - Velocity Barcelona 2014
PDF
Introduction to Computer Vision
PDF
Structural profiling of web sites in the wild
PDF
Responsive Image Strategies
PDF
Are We Fast Yet? HTML & Javascript Performance - UtahJS
PPT
04/29 regular meeting paper
PPT
04/29 regular meeting paper
PPTX
Selenium-based Visual Test Automation
PDF
Finding harmony in web development
PDF
Marketplace affiliates potential analysis using cosine similarity and vision-...
PDF
"Responsive Web Design: Clever Tips and Techniques". Vitaly Friedman, Smashin...
Building a game engine with jQuery
Multimedia searching
Extreme Web Performance for Mobile Devices
Advanced Automated Visual Testing
HTML5 on Mobile(For Designer)
Extreme Web Performance for Mobile Devices - Velocity NY
Web Scraping
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
[convergese] Adaptive Images in Responsive Web Design
Extreme Web Performance for Mobile Devices - Velocity Barcelona 2014
Introduction to Computer Vision
Structural profiling of web sites in the wild
Responsive Image Strategies
Are We Fast Yet? HTML & Javascript Performance - UtahJS
04/29 regular meeting paper
04/29 regular meeting paper
Selenium-based Visual Test Automation
Finding harmony in web development
Marketplace affiliates potential analysis using cosine similarity and vision-...
"Responsive Web Design: Clever Tips and Techniques". Vitaly Friedman, Smashin...
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Getting Started with Data Integration: FME Form 101
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
OMC Textile Division Presentation 2021.pptx
Tartificialntelligence_presentation.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Getting Started with Data Integration: FME Form 101
A comparative study of natural language inference in Swahili using monolingua...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
TLE Review Electricity (Electricity).pptx
A novel scalable deep ensemble learning framework for big data classification...
1. Introduction to Computer Programming.pptx
Hindi spoken digit analysis for native and non-native speakers
DP Operators-handbook-extract for the Mautical Institute
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
WOOl fibre morphology and structure.pdf for textiles
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
1 - Historical Antecedents, Social Consideration.pdf
Ad

Web Pages Visual Similarity - Search Central Live Zurich 2024

  • 1. Web Pages Visual Similarity
  • 2. Hi everyone! I’m Giacomo, R&D director at Merj. Merj blends digital marketing and engineering expertise to solve complex challenges for enterprise companies. We act as an embedded team for our customers, providing transformative solutions that merge strategy, data, automation and technology. *We are hiring! More on that later…
  • 3. What we are going to cover today 1. Understand the context of the research 2. Explain why visual similarity is useful 3. How to define visual similarity 4. Implementation overview 5. Additional use cases
  • 5. Scenario A company we were working with wanted to consolidate its brands onto a unified technical stack. They wanted to know if this would be a problem for users and search engines.
  • 6. Can you define similarity? Similarity means different things to different people. We approach this concept by defining it in terms of text similarity and visual similarity.
  • 7. Text similarity part was already completed I joined Merj when this project was already underway, with the team in the final stages of completing the text similarity component. Text similarity is a well-established process with numerous documented approaches to tackle it effectively. In the following slides, we will turn our attention to visual similarity.
  • 9. Why visual similarity? Jakob's Law of Internet User Experience “Users spend most of their time on other sites. This means that users prefer your site to work the same way as all the other sites they already know. Design for patterns for which users are accustomed.” - Jakob Nielsen Source: https://www.nngroup.com/videos/jakobs-law-internet-ux/
  • 10. When I’m familiar with it, I know how it works Icon source: Icons from flaticon.com
  • 11. When being similar is too much? When websites are too similar, the advantage of having multiple brands can be diminished. Users may perceive the websites as identical, which could negatively impact business metrics. By comparing multiple internal brands and competitors, and incorporating business metrics, we wanted to define the optimal threshold for visual similarity.
  • 12. Looking for a different approach for Visual Similarity The team had already explored several machine learning approaches. However, implementing them at the scale we required turned out to be both slow and costly. We started looking at a different approach.
  • 13. Project Status What we had: - List of domains to compare (internal brands and competitors) - List of categories per each domain (HTML templates) - List of web pages for each category - Web Crawler (utilising a Headless Browser to render web pages)
  • 15. Browser Rendering Process Image source: https://web.dev/articles/howbrowserswork Parsing HTML to construct the DOM tree Render tree construction Layout of the render tree Painting the render tree
  • 16. #1 Parsing Image source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
  • 17. #2 Style calculation Image source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
  • 18. #3 Layout Image source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
  • 19. #4 Painting Images source: https://developers.google.com/web/updates/2018/09/inside-browser-part3
  • 20. How can we use this for the project?
  • 24. Hypothesis “Two web pages can be compared for visual similarity by evaluating the elements that share similar coordinates and dimensions.”
  • 26. Headless Chrome Headless Chrome was shipped in Chrome 59 in 2017.
  • 27. Automated Browser Actions There are many libraries that provide a high-level API for controlling Chrome, abstracting the DevTools Protocol. For lower-level tasks, we can directly use the DevTools Protocol (CDP), a protocol designed to automate actions on Chromium, Chrome, and other Blink-based browsers. Source: https://chromedevtools.github.io/devtools-protocol/
  • 28. CDP’s DOMSnapshot.captureSnapshot Using Chrome DevTools Protocol we can get a snapshot of all nodes (elements) rendered on the page, their content, positions, and dimensions. Source: https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot
  • 29. Code example // Launch Puppeteer and open a new page const browser = await puppeteer.launch({ headless: true }); const context = await browser.createBrowserContext(); const page = await context.newPage(); // Connect to the DevTools protocol const CDPclient = await page.createCDPSession(); // Enable DOMSnapshot domain await CDPclient.send('DOMSnapshot.enable'); // Navigate to the target URL const url = 'https://example.com'; await page.goto(url, { waitUntil: 'networkidle2' }); // Capture a DOM snapshot const snapshot = await CDPclient.send('DOMSnapshot.captureSnapshot', { // Define the computed style to return computedStyles: ['visibility', 'display', 'z-index', 'color','background-color'] });
  • 30. INFORMATION ABOUT THE ELEMENT nodeType:1 nodeName:"DIV" attributes:[{"name":"class","value":"P6T2B6 _244GCM"}] ... LAYOUT TREE INFORMATION "X":868.390625,"Y":81,"width":22,"height":22 Parsing and normalising the DOMSnapshot output Source: https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot
  • 31. Nodes coordinates and dimensions y X width height
  • 32. Sets of elements (nodes) for two web pages PAGE A "X":832.375,"y":84.296875,"width":16,"height":16 "X":797.734375,"y":83.296875,"width":30.640625,"height":17 "X":860.390625,"y":75,"width":78.890625,"height":36 "X":868.390625,"y":81,"width":22,"height":22 … PAGE B "X":868.390625,"y":81,"width":22,"height":22 "X":898.390625,"y":83.28125,"width":32.890625,"height":17 "X":943.28125,"y":75,"width":60.71875,"height":36 "X":24,"y":181,"width":976,"height":128 …
  • 33. Algorithm v0.1 ThresholdWidth = PageWidth * X% ThresholdHeight = PageHeight * X% BoxA = "X":832.375,"Y":84.296875,"width":16,"height":16 BoxB = "X":868.390625,"Y":81,"width":22,"height":22 if BoxB coordinates are included in BoxA coordinates + thresholds if BoxB dimensions are included in BoxA dimension + thresholds then Boxes are similar (add to the list of similar boxes) else Boxes are not similar … Continue comparing BoxA with …
  • 34. Jaccard Index Eventually, to calculate the similarity between two pages we can use the Jaccard Index. Where A ∩ B (the intersection) is the set of similar elements on the two pages, and A ∪ B (the union) is the set of all unique elements from both pages combined.
  • 38. Optimisations After the v0.1 version, we improved the process by adding multiple optimisations: - Including only visible nodes - Background colors - Merging overlapping nodes - Considering the z-index of nodes - Using more performant data structures …and many more.
  • 39. We delivered this and.. While we can't share many details, the company was highly satisfied with both the process and the outcomes. The resulting text and visual similarity metrics were integrated into the annual business goals as control metrics. Max threshold of visual similarity in this case was around 40%, but this may vary depending on websites, type of pages, and goals.
  • 40. Other people come up with similar ideas! Updating these slides I’ve found that a team of researchers from Harbin Institute of Technology, Harbin, China and Cyberspace Security Research Center, Peng Cheng Laboratory, Shenzhen, China come up with a similar method to detecting phishing web sites. Source: https://www.researchgate.net/publication/336377602_Algorithm_of_web_page_similarity_comparison_based_on_visual_block
  • 42. Additional use cases (1) 1:1 website migrations: Ensure pages are migrated correctly and 100% visually similar (should be 100% similar across the domains).
  • 43. Additional use cases (2) Above-the-Fold content analysis: Verify that key content is visible and prioritised above the fold.
  • 44. Additional use cases (3) Intrusive interstitials and dialogs: Identify if web pages have intrusive interstitials and dialogs that may interfere with search engines to understanding of the content. Image source: https://developers.google.com/search/docs/appearance/avoid-intrusive-interstitials
  • 45. Additional use cases (4) Web Page element rendering verification: Confirm that web pages rendered by search engines or other systems successfully handle and position specific elements as intended, detecting misalignments or unexpected behaviours.
  • 46. We’re looking for experienced technical SEO consultants. If you’d like to discuss coming to work at Merj… https://www.linkedin.com/in/ryansiddle/
  • 47. MERJ Ltd, 7 Pancras Square, London, N1C 4AG +44 (0) 20 3322 2660 contact@merj.com merj.com Thank you for your time and attention!