Effective Unsupervised Matching of
Product Titles with k-Combinations
and Permutations
Leonidas Akritidis, Panayiotis Bozanis
Department of Electrical and Computer Engineering
University of Thessaly, Greece
L. Akritidis, P. Bozanis 1IEEE INISTA 2018
The problem (1)
• We are given a set of F={f1,f2,…fN} product feeds
(usually in XML format).
• Each feed fi originates from an electronic store ei
and contains product records.
• Each product record p may contain multiple fields
(title, description, price, brand, category, etc).
• A product cannot appear more than once in the
same feed.
• But it may appear in multiple feeds.
The problem (2)
• A product may be described differently in these
feeds (i.e. it appears under different titles).
• E.g. “Apple iPhone 7” and “iPhone 7” are different
titles which refer to the same product.
• The problem: Match the product titles and
identify if they describe the same product.
• Useful for:
– Price comparison applications & platforms.
– Reviews merging & aggregation.
– Users who desire to compare characteristics & prices.
Similarity/Distance Metrics
• “Apple iPhone 7” and “iPhone 7” are different
titles which refer to the same product.
– Even though a whole word is missing from the second
title (small similarity/distance).
• “Apple iPhone 7” and “Apple iPhone 6” are titles
which DO NOT refer to the same product.
– Even though they only differ by a single character
(higher similarity/distance).
• Similarity/Distance metrics (cosine, Jaccard, edit
distance, etc.) do not work well in this problem.
Supervised Clustering
• For the same reason, the supervised machine
learning clustering approaches (kNN, naïve Bayes,
linear/logistic regression) also do not work well.
– Smaller distances/higher probabilities should not
necessarily be clustered to the same entity.
– Higher distances/smaller probabilities should not
necessarily be clustered in different entities.
State-of-the-art (1)
• V. Gopalakrishnan, SP. Iyengar, A. Madaan, R.
Rastogi, S. Sengamedu. Matching product titles
using web-based enrichment. In Proceedings of
the 21st ACM international conference on
Information and knowledge management, pp.
605-614, 2012.
• N. Londhe, V. Gopalakrishnan, A. Zhang, HQ Ngo,
R. Srihari. Matching titles with cross title web-
search enrichment and community detection. In
Proceedings of the VLDB Endowment, pp. 1167-
1178, 2014.
State-of-the-art (2)
• These approaches are similar:
– They enrich each product title by injecting several
missing words.
– They treat each word in the products’ titles differently,
i.e. each word is assigned an importance score.
– After these two preprocessing phases, they apply the
cosine similarity measure (with an over simplistic
blocking method).
– They create clusters which consist of the same
products.
State-of-the-art - Disadvantages
• One query submitted to a SE per product:
– this approach is infeasible for large-scale datasets.
• In their experiments they use only 2 feeds.
– Most platforms include thousands of electronic stores
(i.e. product feeds).
• They employ the cosine similarity metric.
– which does not perform well in this problem.
Our approach is…
• Standalone: It does not rely on external data
sources (i.e. Web search engines, Web sites, ect).
• Unsupervised: No requirement to manually train
a classifier, or split the dataset in training and
testing data subsets.
• Efficient: Faster than the adversary approach; it
makes use of in-memory data structures.
• Flexible: It facilitates product classification into
multiple clusters.
Overview (1)
• Our proposed method operates in 2 phases:
• Phase 1: construction of two primary data
structures:
– A lexicon which consists of all the k-
combinations of the titles’ words, along with a
frequency value and some statistics.
• Each k-combination is a candidate product cluster.
– A forward index: An array which stores for each
product, a list of pointers to the respective title
k-combinations (we use pointers to avoid saving
the same data twice).
Overview (2)
• Phase 2: We employ these
two data structures to
assign scores to each k-
combination of each
product.
• The k-combinations are
then sorted by decreasing
score value and the highest
scoring combination
represents the cluster.
k-combinations
• k-combinations are combinations of the
words of the product title.
• Length (number of words) = k.
• Without repetition.
• Without care for word ordering.
• We compute the K-combinations of each
product title,
• Number of combinations for a
title which consists of n words:  

 
nkK
k knk
n,
2 !!
!
 6,2K
Phase 1 (1)
Data Structures - Lexicon
• We employ a lexicon structure L to store the
combinations. We also store two statistics:
• A frequency value which represents the number
of documents which contain this combination.
– Frequent combinations are more likely to be declared
cluster labels.
• A distance value which stores the average
distance of the combination from the beginning
of the titles.
– The most important terms in a product description
appear early in the titles.
Data Structures – Forward Index
• We also employ a forward index I which for each
product p, stores a pointer to each combination.
• We assign a score value to each combination in I.
Distance
• Some frequent terms in the titles have no
informational value (i.e. they do not describe the
product, but they contain offers, specs, etc).
– E.g. many products have in their titles the terms “EU”, “OEM”,
“Retail”, etc.
– Therefore, in some cases we get wrong cluster labels, e.g.
“Apple iPhone EU”.
– Similar problems can also be caused by other words: colors
(black, white, red, etc), sizes (large, small, etc) and others.
• Key observation: These terms usually appear
late in the title (i.e. in high position).
Phase 1 (2)
Permutations (3)
• In case a combination is not found in the lexicon,
we compute all its permutations.
• We search for each permutation in the lexicon.
• In case it is found, we increase the frequency of
the corresponding combination and we stop
searching.
• In case it is not found, we do not insert it
• We shall insert the corresponding combination
instead, after all the permutations have been
examined.
Phase 1 (3)
Phase 2
• In phase 2 we compute the scores of each k-
combination of each product.
• To achieve this goal we use the forward index.
• We sort the forward list in decreasing score order.
• The first element of the sorted list is the cluster.
An indicative score function
• Score function
where l(c) is the length of the combination/label, N(c)
is the frequency, and d(c,t) is the average distance of
the combination from the beginning of the string.
   
 
 cN
tcda
cl
cS log
,

Results
• We deployed a focused crawler on skroutz.gr and we
collected 16208 products (mobile phones) classified
in 922 clusters.
• Vendors: 320
• Average number of words in a title: 9
• We consider the classification of skroutz.gr as the
ground truth and we compare the effectiveness of
our algorithm (UMaP) against this.
Effectiveness – F1 measure
Efficiency

More Related Content

PPTX
Machine learning - session 2
PPT
June 2015
PDF
Db lec 02_new
PPT
D I T211 Chapter 3
PDF
Advance database system(part 5)
PPTX
PPT
Ch 6 Logical D B Design
PDF
Dbms 7: ER Diagram Design Issue
Machine learning - session 2
June 2015
Db lec 02_new
D I T211 Chapter 3
Advance database system(part 5)
Ch 6 Logical D B Design
Dbms 7: ER Diagram Design Issue

What's hot (19)

PPT
Lecture 03 data abstraction and er model
PPTX
Logical database design and the relational model(database)
PPTX
Chapter 8
PPT
Relational Database & Database Management System
PPTX
Datacleaning.ppt
PPTX
Data modeling
PPTX
Database management systems 3 - Data Modelling
PPT
Ch 5 O O Data Modeling
PPT
Entity relationship diagram (erd)
PDF
Advantages and disadvantages of er model in DBMS. Types of database models ..
PPTX
E-R diagram in Database
PPTX
Relational database
PPTX
ER DIAGRAM & ER MODELING IN DBMS
PPTX
ER Modeling and Introduction to RDBMS
PPT
Ch 5 O O Data Modeling Class
PPTX
DOCX
Pg. 03 question two 1introduction to databasesi
PPT
Erd chapter 3
Lecture 03 data abstraction and er model
Logical database design and the relational model(database)
Chapter 8
Relational Database & Database Management System
Datacleaning.ppt
Data modeling
Database management systems 3 - Data Modelling
Ch 5 O O Data Modeling
Entity relationship diagram (erd)
Advantages and disadvantages of er model in DBMS. Types of database models ..
E-R diagram in Database
Relational database
ER DIAGRAM & ER MODELING IN DBMS
ER Modeling and Introduction to RDBMS
Ch 5 O O Data Modeling Class
Pg. 03 question two 1introduction to databasesi
Erd chapter 3
Ad

Similar to Effective Unsupervised Matching of Product Titles (20)

PDF
Table Retrieval and Generation
PPTX
Apex code (Salesforce)
PPTX
Text based search engine on a fixed corpus and utilizing indexation and ranki...
PDF
Explain Yourself: Why You Get the Recommendations You Do
PDF
Weka
PDF
Recommender Systems and Linked Open Data
PDF
Reference Scope Identification of Citances Using Convolutional Neural Network
PDF
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PDF
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
PDF
10 Years of Multi-Label Learning
PDF
Named Entity Recognition from Online News
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
Workbook-I-Quantitative-Analysis. for researcher and other
PDF
New c sharp3_features_(linq)_part_v
DOCX
Compare and Contrast Essay AssignmentLength At least three fu.docx
PDF
APPLIED PRODUCTIVITY TOOLS WITH ADVANCED APPLICATION TECHNIQUES CUSTOM ANIMAT...
PDF
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
PDF
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
PDF
CS6660-COMPILER DESIGN-1368874055-CD NOTES.pdf
PPTX
Memory models in c#
Table Retrieval and Generation
Apex code (Salesforce)
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Explain Yourself: Why You Get the Recommendations You Do
Weka
Recommender Systems and Linked Open Data
Reference Scope Identification of Citances Using Convolutional Neural Network
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
10 Years of Multi-Label Learning
Named Entity Recognition from Online News
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Workbook-I-Quantitative-Analysis. for researcher and other
New c sharp3_features_(linq)_part_v
Compare and Contrast Essay AssignmentLength At least three fu.docx
APPLIED PRODUCTIVITY TOOLS WITH ADVANCED APPLICATION TECHNIQUES CUSTOM ANIMAT...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
CS6660-COMPILER DESIGN-1368874055-CD NOTES.pdf
Memory models in c#
Ad

More from Leonidas Akritidis (8)

PDF
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
PDF
A Self-Pruning Classification Model for News
PDF
Effective Products Categorization with Importance Scores and Morphological An...
PDF
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
PDF
A Supervised Machine Learning Algorithm for Research Articles
PDF
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
PPT
Positional Data Organization and Compression in Web Inverted Indexes
PDF
Identifying Influential Bloggers: Time Does Matter
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
A Self-Pruning Classification Model for News
Effective Products Categorization with Importance Scores and Morphological An...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
A Supervised Machine Learning Algorithm for Research Articles
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Positional Data Organization and Compression in Web Inverted Indexes
Identifying Influential Bloggers: Time Does Matter

Recently uploaded (20)

PDF
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
PPTX
Measurement Uncertainty and Measurement System analysis
PDF
Applications of Equal_Area_Criterion.pdf
PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
PRASUNET_20240614003_231416_0000[1].pptx
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
PDF
Cryptography and Network Security-Module-I.pdf
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
Computer System Architecture 3rd Edition-M Morris Mano.pdf
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PDF
Design of Material Handling Equipment Lecture Note
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PDF
Computer organization and architecuture Digital Notes....pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PPTX
CyberSecurity Mobile and Wireless Devices
PPT
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
PPTX
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
Measurement Uncertainty and Measurement System analysis
Applications of Equal_Area_Criterion.pdf
Amdahl’s law is explained in the above power point presentations
PRASUNET_20240614003_231416_0000[1].pptx
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
August 2025 - Top 10 Read Articles in Network Security & Its Applications
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
Cryptography and Network Security-Module-I.pdf
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Computer System Architecture 3rd Edition-M Morris Mano.pdf
MLpara ingenieira CIVIL, meca Y AMBIENTAL
Design of Material Handling Equipment Lecture Note
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
August -2025_Top10 Read_Articles_ijait.pdf
Computer organization and architecuture Digital Notes....pdf
Information Storage and Retrieval Techniques Unit III
CyberSecurity Mobile and Wireless Devices
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
Chapter 2 -Technology and Enginerring Materials + Composites.pptx

Effective Unsupervised Matching of Product Titles

  • 1. Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations Leonidas Akritidis, Panayiotis Bozanis Department of Electrical and Computer Engineering University of Thessaly, Greece L. Akritidis, P. Bozanis 1IEEE INISTA 2018
  • 2. The problem (1) • We are given a set of F={f1,f2,…fN} product feeds (usually in XML format). • Each feed fi originates from an electronic store ei and contains product records. • Each product record p may contain multiple fields (title, description, price, brand, category, etc). • A product cannot appear more than once in the same feed. • But it may appear in multiple feeds.
  • 3. The problem (2) • A product may be described differently in these feeds (i.e. it appears under different titles). • E.g. “Apple iPhone 7” and “iPhone 7” are different titles which refer to the same product. • The problem: Match the product titles and identify if they describe the same product. • Useful for: – Price comparison applications & platforms. – Reviews merging & aggregation. – Users who desire to compare characteristics & prices.
  • 4. Similarity/Distance Metrics • “Apple iPhone 7” and “iPhone 7” are different titles which refer to the same product. – Even though a whole word is missing from the second title (small similarity/distance). • “Apple iPhone 7” and “Apple iPhone 6” are titles which DO NOT refer to the same product. – Even though they only differ by a single character (higher similarity/distance). • Similarity/Distance metrics (cosine, Jaccard, edit distance, etc.) do not work well in this problem.
  • 5. Supervised Clustering • For the same reason, the supervised machine learning clustering approaches (kNN, naïve Bayes, linear/logistic regression) also do not work well. – Smaller distances/higher probabilities should not necessarily be clustered to the same entity. – Higher distances/smaller probabilities should not necessarily be clustered in different entities.
  • 6. State-of-the-art (1) • V. Gopalakrishnan, SP. Iyengar, A. Madaan, R. Rastogi, S. Sengamedu. Matching product titles using web-based enrichment. In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 605-614, 2012. • N. Londhe, V. Gopalakrishnan, A. Zhang, HQ Ngo, R. Srihari. Matching titles with cross title web- search enrichment and community detection. In Proceedings of the VLDB Endowment, pp. 1167- 1178, 2014.
  • 7. State-of-the-art (2) • These approaches are similar: – They enrich each product title by injecting several missing words. – They treat each word in the products’ titles differently, i.e. each word is assigned an importance score. – After these two preprocessing phases, they apply the cosine similarity measure (with an over simplistic blocking method). – They create clusters which consist of the same products.
  • 8. State-of-the-art - Disadvantages • One query submitted to a SE per product: – this approach is infeasible for large-scale datasets. • In their experiments they use only 2 feeds. – Most platforms include thousands of electronic stores (i.e. product feeds). • They employ the cosine similarity metric. – which does not perform well in this problem.
  • 9. Our approach is… • Standalone: It does not rely on external data sources (i.e. Web search engines, Web sites, ect). • Unsupervised: No requirement to manually train a classifier, or split the dataset in training and testing data subsets. • Efficient: Faster than the adversary approach; it makes use of in-memory data structures. • Flexible: It facilitates product classification into multiple clusters.
  • 10. Overview (1) • Our proposed method operates in 2 phases: • Phase 1: construction of two primary data structures: – A lexicon which consists of all the k- combinations of the titles’ words, along with a frequency value and some statistics. • Each k-combination is a candidate product cluster. – A forward index: An array which stores for each product, a list of pointers to the respective title k-combinations (we use pointers to avoid saving the same data twice).
  • 11. Overview (2) • Phase 2: We employ these two data structures to assign scores to each k- combination of each product. • The k-combinations are then sorted by decreasing score value and the highest scoring combination represents the cluster.
  • 12. k-combinations • k-combinations are combinations of the words of the product title. • Length (number of words) = k. • Without repetition. • Without care for word ordering. • We compute the K-combinations of each product title, • Number of combinations for a title which consists of n words:      nkK k knk n, 2 !! !  6,2K
  • 14. Data Structures - Lexicon • We employ a lexicon structure L to store the combinations. We also store two statistics: • A frequency value which represents the number of documents which contain this combination. – Frequent combinations are more likely to be declared cluster labels. • A distance value which stores the average distance of the combination from the beginning of the titles. – The most important terms in a product description appear early in the titles.
  • 15. Data Structures – Forward Index • We also employ a forward index I which for each product p, stores a pointer to each combination. • We assign a score value to each combination in I.
  • 16. Distance • Some frequent terms in the titles have no informational value (i.e. they do not describe the product, but they contain offers, specs, etc). – E.g. many products have in their titles the terms “EU”, “OEM”, “Retail”, etc. – Therefore, in some cases we get wrong cluster labels, e.g. “Apple iPhone EU”. – Similar problems can also be caused by other words: colors (black, white, red, etc), sizes (large, small, etc) and others. • Key observation: These terms usually appear late in the title (i.e. in high position).
  • 18. Permutations (3) • In case a combination is not found in the lexicon, we compute all its permutations. • We search for each permutation in the lexicon. • In case it is found, we increase the frequency of the corresponding combination and we stop searching. • In case it is not found, we do not insert it • We shall insert the corresponding combination instead, after all the permutations have been examined.
  • 20. Phase 2 • In phase 2 we compute the scores of each k- combination of each product. • To achieve this goal we use the forward index. • We sort the forward list in decreasing score order. • The first element of the sorted list is the cluster.
  • 21. An indicative score function • Score function where l(c) is the length of the combination/label, N(c) is the frequency, and d(c,t) is the average distance of the combination from the beginning of the string.        cN tcda cl cS log , 
  • 22. Results • We deployed a focused crawler on skroutz.gr and we collected 16208 products (mobile phones) classified in 922 clusters. • Vendors: 320 • Average number of words in a title: 9 • We consider the classification of skroutz.gr as the ground truth and we compare the effectiveness of our algorithm (UMaP) against this.