Effective Unsupervised Matching of Product Titles

Effective Unsupervised Matching of
Product Titles with k-Combinations
and Permutations
Leonidas Akritidis, Panayiotis Bozanis
Department of Electrical and Computer Engineering
University of Thessaly, Greece
L. Akritidis, P. Bozanis 1IEEE INISTA 2018

The problem (1)
• We are given a set of F={f1,f2,…fN} product feeds
(usually in XML format).
• Each feed fi originates from an electronic store ei
and contains product records.
• Each product record p may contain multiple fields
(title, description, price, brand, category, etc).
• A product cannot appear more than once in the
same feed.
• But it may appear in multiple feeds.

The problem (2)
• A product may be described differently in these
feeds (i.e. it appears under different titles).
• E.g. “Apple iPhone 7” and “iPhone 7” are different
titles which refer to the same product.
• The problem: Match the product titles and
identify if they describe the same product.
• Useful for:
– Price comparison applications & platforms.
– Reviews merging & aggregation.
– Users who desire to compare characteristics & prices.

Similarity/Distance Metrics
• “Apple iPhone 7” and “iPhone 7” are different
titles which refer to the same product.
– Even though a whole word is missing from the second
title (small similarity/distance).
• “Apple iPhone 7” and “Apple iPhone 6” are titles
which DO NOT refer to the same product.
– Even though they only differ by a single character
(higher similarity/distance).
• Similarity/Distance metrics (cosine, Jaccard, edit
distance, etc.) do not work well in this problem.

Supervised Clustering
• For the same reason, the supervised machine
learning clustering approaches (kNN, naïve Bayes,
linear/logistic regression) also do not work well.
– Smaller distances/higher probabilities should not
necessarily be clustered to the same entity.
– Higher distances/smaller probabilities should not
necessarily be clustered in different entities.

State-of-the-art (1)
• V. Gopalakrishnan, SP. Iyengar, A. Madaan, R.
Rastogi, S. Sengamedu. Matching product titles
using web-based enrichment. In Proceedings of
the 21st ACM international conference on
Information and knowledge management, pp.
605-614, 2012.
• N. Londhe, V. Gopalakrishnan, A. Zhang, HQ Ngo,
R. Srihari. Matching titles with cross title web-
search enrichment and community detection. In
Proceedings of the VLDB Endowment, pp. 1167-
1178, 2014.

State-of-the-art (2)
• These approaches are similar:
– They enrich each product title by injecting several
missing words.
– They treat each word in the products’ titles differently,
i.e. each word is assigned an importance score.
– After these two preprocessing phases, they apply the
cosine similarity measure (with an over simplistic
blocking method).
– They create clusters which consist of the same
products.

State-of-the-art - Disadvantages
• One query submitted to a SE per product:
– this approach is infeasible for large-scale datasets.
• In their experiments they use only 2 feeds.
– Most platforms include thousands of electronic stores
(i.e. product feeds).
• They employ the cosine similarity metric.
– which does not perform well in this problem.

Our approach is…
• Standalone: It does not rely on external data
sources (i.e. Web search engines, Web sites, ect).
• Unsupervised: No requirement to manually train
a classifier, or split the dataset in training and
testing data subsets.
• Efficient: Faster than the adversary approach; it
makes use of in-memory data structures.
• Flexible: It facilitates product classification into
multiple clusters.

Overview (1)
• Our proposed method operates in 2 phases:
• Phase 1: construction of two primary data
structures:
– A lexicon which consists of all the k-
combinations of the titles’ words, along with a
frequency value and some statistics.
• Each k-combination is a candidate product cluster.
– A forward index: An array which stores for each
product, a list of pointers to the respective title
k-combinations (we use pointers to avoid saving
the same data twice).

Overview (2)
• Phase 2: We employ these
two data structures to
assign scores to each k-
combination of each
product.
• The k-combinations are
then sorted by decreasing
score value and the highest
scoring combination
represents the cluster.

k-combinations
• k-combinations are combinations of the
words of the product title.
• Length (number of words) = k.
• Without repetition.
• Without care for word ordering.
• We compute the K-combinations of each
product title,
• Number of combinations for a
title which consists of n words:  

 
nkK
k knk
n,
2 !!
!
 6,2K

Data Structures - Lexicon
• We employ a lexicon structure L to store the
combinations. We also store two statistics:
• A frequency value which represents the number
of documents which contain this combination.
– Frequent combinations are more likely to be declared
cluster labels.
• A distance value which stores the average
distance of the combination from the beginning
of the titles.
– The most important terms in a product description
appear early in the titles.

Data Structures – Forward Index
• We also employ a forward index I which for each
product p, stores a pointer to each combination.
• We assign a score value to each combination in I.

Distance
• Some frequent terms in the titles have no
informational value (i.e. they do not describe the
product, but they contain offers, specs, etc).
– E.g. many products have in their titles the terms “EU”, “OEM”,
“Retail”, etc.
– Therefore, in some cases we get wrong cluster labels, e.g.
“Apple iPhone EU”.
– Similar problems can also be caused by other words: colors
(black, white, red, etc), sizes (large, small, etc) and others.
• Key observation: These terms usually appear
late in the title (i.e. in high position).

Permutations (3)
• In case a combination is not found in the lexicon,
we compute all its permutations.
• We search for each permutation in the lexicon.
• In case it is found, we increase the frequency of
the corresponding combination and we stop
searching.
• In case it is not found, we do not insert it
• We shall insert the corresponding combination
instead, after all the permutations have been
examined.

Phase 2
• In phase 2 we compute the scores of each k-
combination of each product.
• To achieve this goal we use the forward index.
• We sort the forward list in decreasing score order.
• The first element of the sorted list is the cluster.

An indicative score function
• Score function
where l(c) is the length of the combination/label, N(c)
is the frequency, and d(c,t) is the average distance of
the combination from the beginning of the string.
   
 
 cN
tcda
cl
cS log
,


Results
• We deployed a focused crawler on skroutz.gr and we
collected 16208 products (mobile phones) classified
in 922 clusters.
• Vendors: 320
• Average number of words in a title: 9
• We consider the classification of skroutz.gr as the
ground truth and we compare the effectiveness of
our algorithm (UMaP) against this.

Effective Unsupervised Matching of Product Titles

More Related Content

What's hot (19)

Similar to Effective Unsupervised Matching of Product Titles (20)

More from Leonidas Akritidis (8)

Recently uploaded (20)

Effective Unsupervised Matching of Product Titles