Mining competitors from large unstructured data

MINING COMPETITORS FROM LARGE
UNSTRUCTURED DATA
7/24/2020 1MIT WPU, Department of Computer Science
and Engineering, Pune
Presented by : Tejas Salunkhe
Guided by: Mrs Sushila Aghav

Contents
• Introduction
• Motivation
• Objectives
• Literature review
• Comparative study
• Research gap
• Problem Statement
• Data-sets
• System Architecture
• Algorithms
• Implementation Results
• Conclusion and Future scope
• References
• Publications
7/24/2020 MIT WPU, Department of Computer Science
2

Introduction
• Large competitors competing for market share
• Users often get confused what to buy? Where to buy? Which service to avail?
• Mining competitors gives users immediate result on which service they can avail
• It creates a link between “What user wants” vs “What company offers”
3

Motivation
• Users usually gets confused which product to be used from large group of
products
• Businesses strive to deliver right product to right set of customers
• The main motivation here is identifying a right balance between the User and the
Business so that the user gets right set of services or product and Business get
right set of customers
4

Literature Review
5
Sr.
No.
Title of Paper Author
Conference/ Journals
Description Advantages / Limitation
1. Trust aware
recommendation in Social
Networks
Shuihuang Deng , Longtao
Huang , Guandog Xu,
Xindong Wu
In this paper the researchers
implemented a trust aware
recommendation approach
called TRA.
*. TRA can be used to find out
the trusted reviews, hence can
be termed useful in finding out
right reviews from verified
sources
#. Process of obtaining data is
cubersome and refer numerous
computation which fail to
provide solid solutions on large
datasets
2. Mining Competitors George Valkanas, Theodoros
Lappas, and Dimitrios
Gunopulos
In this paper the researchers
desciribed various ways of
finding Top Competitors
*. Describe various ways of
mining competitors across
various datasets
#. They naively compute the
competitiveness of every single
item in the corpus with respect
to the target item.

Literature Review
6
Sr.
No.
Title of Paper Author
Conference/
Journals
Description Advantages / Limitation
3. User-service rating
prediction by exploring
social user rating behaviour
Guoshuai ,
Xueming , Xing
Xie
This paper proposes a user-service rating
prediction approach by exploring users’ rating
behaviors with considering four social network
factors: user personal interest (related to user
and the item’s topics), interpersonal interest
similarity (related to user interest),
interpersonal rating behavior similarity
(related to users’ rating habits), and
interpersonal rating behavior diffusion (related
to users’ behavior diffusions).
*. Designed a approach that could
automatically provide User service
rating prediction
#. More factors could be
considered for the purpose of
analysis
4. Understanding short texts by
harvesting and analyzing
semantic knowledge
Wen Hua ,
Zhongyuan,Haixu
n,Xiaofang Zhou
In this work, they propose a generalized
framework to understand short texts
effectively and efficiently. More specifically,
they divide the task of short text understanding
into three subtasks: text
segmentation, type detection, and concept
labeling.
*. Algorithm that automatically
learns the best way to understand
the short texts
#. Fails to analyze the text which
dont fall in the data dictionary

Research Gap
• Formal defination of Competitiveness between two items was not previously
addressed and validated both quantitatively and qualitatively
• The results and outcomes obtained via Data Mining are very hard to understand
• The formalizations currently present cannot be applicable across various domains
7

Problem Statement
• In any Business success is based on ability to make item more appealing to
customer than the competitor. Even the customer struggle to find a right set of
product as per their requirements, many times they end up buying the product
which is not required or doesn't fullfill their needs. Various data mining
techniques can be used to improve the user experience and also may turn out
beneficial from Business point of view.
8

Data Set
1. We use Hotel dataset which has about 13 different parameters like :
• Name, Address, Latitude, Longitude, Facility, Hotel Star's,Vendor Id, Reviews, User Id,
User Email, User Rating, Vendor Id, Vendor Email
2. How did we extract this Dataset?
• This DataSet was scraped from websites like TripAdvisor, MakemyTrip,
Trivago
9

Data Set
• Tools like Octoparse and DataMiner were used for Data Scraping
• We Scraped 1200 hotel records which have the above given fields(except Vendor Id,
Vendor Email, User Id, User Email, User Rating)
• The Latitude and Longitude captured was used to showcase the location of the Hotels
with respect to the users current location...
10

System Architecture
11
System Architecture for Mining Competitors

Algorithm
Input: I = User requirements
Output: P = Top K competitors for the given user requirement
Begin
1. Get the user requirements from the User along with the keyset of requirements
2. Get K value from the user so as to know exactly how many recommendations
does the user need for given set of data
3. After receiving the value of K also need how exactly the user requires the
recommendation based on Ratings, reviews or location
4. Get the requirements and apply Cminer ++ algorithm on the Database
12

Algorithm
5. Map the requirement given by the user with that from Database using Cminer++
6. Provide a list of Top K competitors
13

Hardware and Software Requirements
 1. Software Requirement:
1. Technology: Java
2. Tools: Eclipse Luna, Octoparse, Data Miner
3. Operating System: Windows 10
4. Server - Apache Tomcat 8.0
5. Database - MySQL 5.0
 2. Hardware Requirement:
Hard disk : 1TB
RAM : 8GB
Processor : Intel Core i5 or above
14

Defining Competitivess
7/24/2020 MIT WPU, Department of Computer Science and
Engineering, Pune
15
Name Bar Breakfast Gym Parking Pool Wifi
Hilton Yes No Yes Yes Yes Yes
Marriot Yes Yes No Yes Yes Yes
Westin No Yes Yes Yes No Yes
ID Size Features
q1 100 (parking, wifi)
q2 50 (parking)
q3 60 (wifi)
q4 120 (gym, wifi)
q5 250 (breakfast, parking)
q6 80 (gym, bar, breakfast)

Defining Competitiveness
16
Restaurant Pairs Common Segments Common %
Hilton, Marriot (q1, q2, q3) 32%
Hilton, Westin (q1, q2, q3, q4) 50%
Marriot, Westin (q1, q2, q3, q5) 70%
Observations:
•Lowest competitiveness is observed by Hilton, Marriot even though these hotels are quite similar by the
feature
•This shows similarity is not a good proxy for competitveness

Defining Competitiveness
17
VC
q
jifqf
qpji ,^2
*)(),(  

We define the Competitiveness between i and j in market with a feature subset f as follows:
Cf(i,j) : represents probability that two items are included in the consideration set of random users
p(q) : percentage of users represented by query q
V : Pairwise coverage
•If a random user U shows interest in item i, then he is also most likely to be interested in items with highest
Cf(i,j) values,

Pairwise Coverage
• Pairwise coverage of a feature f is the percentage of all possible values of
f that can be covered by both i,j
• Lets consider the Pairwise coverage for :
1. Binary and Categorical Features
2. Numeric Features
3. Ordinal Features
18
V
q
ji ,
V
q
ji ,

Binary and Categorical Features
• Categorical feature takes one or more values from finite space
• Single value features include feature like eg: Brand of Camera, Location of Hotel
etc
• Multi-value features include amenities offered by Hotel etc
• Any categorical feataure can be encoded via set of Binary features with each
binary feature lacking coverage of original one
• A feature can be fully covered if f[i] = f[j] = 1 or equivalently f[i] * f[j] = 1 or not
covered at all
19

Binary and Categorical Features
• Binary features equation:
20
][*][,
jfifV
f
ji


Numeric Features
• Takes value from predefined range
• Numeric features takes value in [0,1] range with higher values being more
preferred
• For eg: Consider a two hotels i, j with values 0.5 and 0.8 for the feature food
quality. Their pairwise coverage is 0.5 in this case. Conceptually they will fight
for all the customers who accept food quality <= 0.5. Also the customers with
high requirement for food quality would ignore i as a option.
21

Numeric Features
• Numeric features equation
22
])[],[min(,
jfifV
f
ji


Ordinal Features
• Takes values from finite ordered list
• Characteristic example is popular 5 star product scale used to evaluate quality of
service or product
• Consider there are 2 hotels with ratings like 2, 3 and Customer demands atleast 4
star rating then in such case he wont consider both the hotels , while a customer
who requires atleast 3 star rating would consider the second hotel
• As in the case of Numeric feature we consider pairwise coverage as lowest value
amongst the competitor hotels
23

Ordinal Features
• Two items compete for 2 out of 5 levels of the ordinal scale, the competitiveness
is proportional to 2/5 =0.4
• Pairwise coverage for Ordinal Features can be given as follows:
24
||
))(),(min(
, f
f
ji
V
jfif
V 

Extending Competitive Definition
• Feature Uniformity:
1) Users demanding quality in [0,0.1] might be different than those demanding a
value in [0.4,0.5]
• Feature Importance:
1. A common assumption in our research is that all the features in the query have
equal importance if not specifically mentioned
2. However a user who submits the query q=(f1,f2) may care more about f1 than f2
25

Computational Analysis
26
Sr. No Algorithm Advantages Disadvantages
1. Naive Provides a consistent computational time
regardless of k
Naively computes
competativeness of
every single items in the
corpus with respect to
Target item
2. G miner Performs well for datasets with distinct
queries
Time Consuming but
performs better than
Naive
3. C miner Performs well for datasets with populas
queries
Considers Large
numbers of
computations for larger
value of k

Computational Analysis
27
Sr. No Algorithm Advantages Disadvantages
4 C Miner ++ Provides improvised results with increased
value of k, this is due to Pruning feature of
the Cminer++
If discarding or
evaluating candidates is
improvised the results
can be even more
improvised

Summary of Comparitive Study
• Naive fails to provide the similar computations when compared to G miner , C
miner and Cminer ++
• Naive < G miner < C miner < C miner++
• C miner ++ provides a improvised pruning which enables improved quality
results with increased value of k
28

Computations
29
ALGORITHM Execution Times Sum of Number of Competitors
C Miner 463
0.004 3
0.04 10
0.6 150
1.2 300
C Miner ++ 463
0.003 3
0.03 10
0.45 150
0.9 300
G Miner 463
0.005 3
0.05 10
0.75 150
1.5 300
Naive 463
0.25 463
Grand Total 1852

Accuracy
30
Algorithm Accuracy
Naive 60%
G Miner 72%
C Miner 83%
C Miner ++ 87%

Accuracy
31
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C Miner C Miner ++ G Miner Naive
Total

Contributions
• Pruning Efﬁciency
• Reduced the Number of Considered Queries
• Used Tools like Octoparse, Data Miner for Data Scraping
• Boosted C Miner algorithm by adding Query Ordering & by adding
UPDATETOPK() and GETSLAVES()
32

Conclusion and Future Scope
• Formal definition of competitiveness between two item is validated
• The formalization is applicable across large number of domains
• We addressed a computationally challenging problem of finding Top K competitors
of a given item
• The evaluations revealed that even small number of reviews is sufficient to estimate
the given market and find interesting observations
33

Conclusion and Future Scope
• The evaluations carried out can be used across various domains and thus can be
considered as a future scope
34

References
• George Valkanas, Theodoros Lappas, and Dimitrios Gunopulos, “Mining Competitors from Large
Unstructured Datasets”, IEEE Transactions on Knowledge and Data Engineering, 2017.
• Deng, Shuiguang, Longtao Huang, Guandong Xu, Xindong Wu and Zhaohui Wu. “On Deep Learning
for Trust-Aware Recommendations in Social Networks.” IEEE Transactions on Neural Networks and
Learning Systems 28 (2017): 1164-1177.
• Qingchao Kong , Wenji Mao , Guandan Chen , Daniel Zeng “Exploring trends and patterns of
popularity stage evolution in social media.” IEEE Transactions on Knowledge and Data Engineering,
2018
35

References
• Deng, Shuiguang, Longtao Huang, Guandong Xu, Xindong Wu and Zhaohui Wu. “On Deep Learning
for Trust-Aware Recommendations in Social Networks.” IEEE Transactions on Neural Networks and
Learning Systems 28 (2017): 1164-1177.
• Hua, Wen, Zhongyuan Wang, Haixun Wang, Kai Zheng and Xiaofang Zhou. “Understand Short Texts
by Harvesting and Analyzing Semantic Knowledge.” IEEE Transactions on Knowledge and Data
Engineering 29 (2017): 499-512.
• Zhao, Guoshuai, Xueming Qian and Xing Xie. “User-Service Rating Prediction by Exploring Social
Users' Rating Behaviors.” IEEE Transactions on Multimedia 18 (2016): 496-506
36

Publications
• Survey paper:
 Paper Title : MINING SOCIAL NETWORKS FOR BUSINESS COMPETITION
ANALYSIS
 Journal : Asian Journal For Convergence In Technology (AJCT), 2019
• Implementation paper:
 Paper Title:MINING COMPETITORS FROM STRUCTURED/UNSTRUCTURED
DATA
 Journal: International Journal of Scienﬁc and Engineering Research (IJSER) (ISSN
2229-5518)
37

THANK YOU
38

Mining competitors from large unstructured data

More Related Content

What's hot (12)

Similar to Mining competitors from large unstructured data (20)

Recently uploaded (20)

Mining competitors from large unstructured data