Ability Study of Proximity Measure for Big Data Mining Context on Clustering

Ability Study of Proximity Measure for Big
Data Mining Context on Clustering
Kamlesh Kumar Pandey
Research Scholar
Dept. of Computer Science & Applications
Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P.
E-mail: kamleshamk@gmail.com
2nd International Conference on Communication and Computational Technologies
(Paper ID : 16)
Paper Presentation
on

Content
• Objectives
• Big Data
• Big Data Mining
• Proximity Measures Enabled Clustering Taxonomy
• Proximity Measure Taxonomy
• Analysis of Proximity Measure for Big Data Mining

Objectives
• The objective of this study is identifying a proximity measures for big data
clustering respect to volume, variety, and velocity and presents how to
create a cluster with the help of a proximity measure under the partition,
hierarchical, density, grid, model, fuzzy and graph based cluster taxonomy.

Big Data
• Present time technology is growing very fast. Every originations, industries or person
moving towards Internet of things, cloud computing, warless sensor networks, social
media, internet. These sources generated a data growing fast in per second, minutes or per
hour in size of Terabytes or Petabytes .
• Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research
paper. All of these authors define Big Data there means if the data set is large then
gigabyte then these type of data set is known as Big Data.
• Doug Laney et al (2001) was the first person who gave a proper definition for Big Data.
He gave three characteristics Volume, Variety, and Velocity of Big Data and these
characteristics known as 3 V’s of Big Data Management. If traditional data have met two
basic characteristic at a time these data are come to under Big data.
• Gartner (2012), “Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making”

Big Data V’s
• In present time seven V’s used for Big Data where the first three V’s Volume,
Variety, and Velocity are the main characteristics of big data. In addition to
Veracity, Variability, Value, and Visualization are depending on the organization.

Big Data Mining
• Big Data Mining fetching on the requested information, uncovering
hidden relationship or patterns or extracting for the needed information or
knowledge from a dataset these datasets have to meet three V’s of Big
Data with higher complexity.

Clustering
• Clustering is the one of the approaches for analysis and discovering the
complex relation, pattern, and data in the form of underlying groups for the
unlabeled object and Big Data perspective, the clustering algorithm must be
deal high volume, high variety and high velocity with scalability.

Clustering Taxonomy
• Partitioning based Clustering: These clustering methods constructs the clusters on the bases of
center in the choice of k number of clusters. This clustering method used proximity measures as
finding out the center of the cluster creation.
• Hierarchical based Clustering: In this approach, large data are organized in a hierarchical manner
based on the medium of proximity and its detect on easily relationship between data points.
• Density Based Clustering: This clustering method scans the spatial databases and used for
probability distribution based distance measure and distance measurements for creating core,
border and noise point for density cluster.
• Grid-Based Clustering: This clustering algorithm splits the data space into a grid structure,
calculate the cell density, calculate the grid structure with the help of distance measure.
• Model-Based Clustering: This clustering method optimizes the dataset into the mathematical
model based on the mixing of probability distributions based distance measure and distance
measure are also measuring the parameter of the selected model.

Proximity Measure Taxonomy
In present time various proximity measure is available for cluster construction
and these proximity measure categories under
• Minkowski
• L(1)
• L(2)
• Inner product
• Shannon’s entropy
• Combination
• Intersection

Minkowski family Proximity Measures
• 𝑑𝑖𝑠 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑃
1
𝑃 (Eq.1)
• 𝑑𝑖𝑠manhattan 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑃
(Eq.2) p=1, construct hyper-rectangular shape cluster.
• 𝑑𝑖𝑠euclidean 𝐴, 𝐵 = 2
𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖|2 (Eq.3) p=2, construct compact or isolated cluster.
• 𝑑𝑖𝑠minkowski 𝐴, 𝐵 =
𝑝
𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑝 (Eq.4) p= 2 to ∞, construct isolated or compacted
cluster.
• 𝑑𝑖𝑠chebyshev 𝐴, 𝐵 = ∞
𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖|∞ = 𝑚𝑎𝑥𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑃 (Eq.5)
p= ∞ to maximum. This distance measure is used for when two data points are greatest of their
absolute magnitude along with data dimension.

L(1) family, Manhattan family Proximity Measures
The Manhattan distance measure faces two difficulties in the respects of distance value. First
one is normalization of a distance value and second is related to figure out of small and large
distance.
• 𝑑𝑖𝑠sorensen 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴 𝑖−𝐵 𝑖|
𝑖=1
𝑛
|𝐴 𝑖+𝐵 𝑖|
(Eq.6) Normalized dis value between 0 and 1.
• 𝑑𝑖𝑠soergel 𝐴, 𝐵 = 𝑖=1
𝑛
𝑖=1
𝑛
max 𝐴 𝑖 𝐵 𝑖
(Eq.7) choosing the max coefficient data point.
• 𝑑𝑖𝑠kulczynski 𝐴, 𝐵 = 𝑖=1
𝑛
𝑖=1
𝑛
min 𝐴 𝑖 𝐵 𝑖
(Eq.8) choosing the min coefficient data point.
• 𝑑𝑖𝑠motyka 𝐴, 𝐵 = 𝑖=1
𝑛
𝑖=1
𝑛
|𝐴 𝑖+𝐵 𝑖|
(Eq.9) takes the max data point of the data set.
• 𝑑𝑖𝑠canberra 𝐴, 𝐵 = 𝑖=1
𝑛 |𝐴 𝑖−𝐵𝑖|
|𝐴 𝑖+𝐵𝑖|
(Eq.10) absolute difference of the individual data.
• 𝑑𝑖𝑠lorentzian 𝐴, 𝐵 = i=1
n
ln(1 + |𝐴𝑖 − 𝐵𝑖| (Eq.11) Normalized dis value natural logarithm
• 𝑑𝑖𝑠wavehedge 𝐴, 𝐵 = 𝑖=1
𝑛 |𝐴 𝑖−𝐵 𝑖|
(Eq.12) normalizes the difference of each data
pair with its max value

L(2) or χ 2, Euclidian family Proximity Measures
The L(2) family member based on the Euclidian distance and gives the distance value after
normalization.
• 𝑑𝑖𝑠matusita 𝐴, 𝐵 = 𝑖=1
𝑛
( 𝐴𝑖 − 𝐵𝑖)2 (Eq. 13)
• 𝑑𝑖𝑠clark 𝐴, 𝐵 = 𝑖=1
𝑛
(
(𝐴 𝑖+𝐵 𝑖)
)2
(Eq. 14)
• 𝑑𝑖𝑠divergence 𝐴, 𝐵 = 2 𝑖=1
𝑛 (𝐴 𝑖−𝐵 𝑖)2
(𝐴 𝑖+𝐵 𝑖)2 (Eq. 15)
• 𝑑𝑖𝑠squared_euclidean 𝐴, 𝐵 = 𝑖=1
𝑛
(𝐴𝑖 − 𝐵𝑖)2 (Eq.16)
• 𝑑𝑖𝑠squared_chi 𝐴, 𝐵 = 𝑖=1
(𝐴 𝑖+𝐵 𝑖)
(Eq. 17)
• 𝑑𝑖𝑠pearson_chi 𝐴, 𝐵 = 𝑖=1
𝐵 𝑖
(Eq. 18)
• 𝑑𝑖𝑠neyman_chi 𝐴, 𝐵 = 𝑖=1
𝐴 𝑖
(Eq. 19)

Inner product family Proximity Measures
Inner product distance measure gives the distance on the basis of multiplication of real
data. Their formulation shows as Eq. 20 and normalized distances shown as Eq. 21 to Eq.
24 respectively on the basis of Eq. 20.
• 𝑑𝑖𝑠inner_product 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 𝐵𝑖 (Eq. 20)
• 𝑑𝑖𝑠cosine 𝐴, 𝐵 = 𝒊=𝟏
𝒏
𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 𝐴 𝑖
𝟐
𝒊=𝟏
𝒏 𝐵 𝑖
𝟐
(Eq. 21)
• 𝑑𝑖𝑠Jaccard 𝐴, 𝐵 = 1 − 𝑖=1
𝑛
𝐴 𝑖 𝐵 𝑖
𝑖=1
𝑛 𝐴 𝑖
2
+ 𝑖=1
𝑛 𝐵 𝑖
2− 𝑖=1
𝑛 𝐴 𝑖 𝐵 𝑖
(Eq. 22)
• 𝑑𝑖𝑠dice 𝐴, 𝐵 = 1 −
𝟐 𝒊=𝟏
𝒏
𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 𝐴 𝑖
2
+ 𝒊=𝟏
𝒏 𝐵 𝑖
2 (Eq. 23)
• 𝑑𝑖𝑠Harmonic_mean 𝐴, 𝐵 = 2 i=1
n AiBi
Ai+Bi
(Eq. 24)

Shannon’s entropy family Proximity Measures
This family is based on probabilistic uncertainty or entropy. Here all data point Ai must
be non-negative and their sum is always equal to 1 and logarithm base is fixed as 2 for
distance measure.
• 𝑑𝑖𝑠kullback_leibler 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 ln
𝐴 𝑖
𝐵 𝑖
(Eq. 25)
• 𝑑𝑖𝑠jeffreys 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 −𝐵𝑖 ln
𝐴 𝑖
𝐵 𝑖
(Eq. 26)
• 𝑑𝑖𝑠k_divergence 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 ln
2𝐴 𝑖
𝐴 𝑖−𝐵 𝑖
(Eq. 27)
• 𝑑𝑖𝑠topsoe 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 ln
2𝐴 𝑖
+ 𝐵𝑖 ln
2𝐴 𝑖
(Eq. 28)
• 𝑑𝑖𝑠jensen_shannon 𝐴, 𝐵 =
𝟏
𝟐 𝑖=1
𝑛
𝐴𝑖 ln
2𝐴 𝑖
+ 𝑖=1
𝑛
𝐵𝑖 ln
2𝐴 𝑖
(Eq. 29)
• 𝑑𝑖𝑠jensen_difference 𝐴, 𝐵 = 𝑖=1
𝑛 𝐴 𝑖 ln 𝐴 𝑖+𝐵 𝑖 ln 𝐵 𝑖
2
−
2
ln
2
(Eq. 30)

Combination family Proximity Measures
This family member defines a distance measure on the bases of a combination of two or
more distance measures.
• 𝑑𝑖𝑠taneja 𝐴, 𝐵 = 𝑖=1
𝑛 𝐴 𝑖+𝐵 𝑖
2
ln
𝐴 𝑖+𝐵 𝑖
2
𝐴 𝑖 𝐵 𝑖
(Eq. 31)
• 𝑑𝑖𝑠kumar_johnson 𝐴, 𝐵 = 𝑖=1
𝑛 𝐴 𝑖
2
−𝐵 𝑖
2 2
2 𝐴 𝑖 𝐵 𝑖
3
2
(Eq. 32)

Intersection family Proximity Measures
This family member defines a distance measure on the basis of intersection between data
points. The formulation of intersection distance measure is shown as (Eq. 33) and their
variance shown as Eq. 34 to Eq. 36 .
• 𝑑𝑖𝑠intersection 𝐴, 𝐵 = 𝑖=1
𝑛
min 𝐴𝑖 𝐵𝑖 (Eq. 33)
• 𝑑𝑖𝑠wavehedges 𝐴, 𝐵 = 𝑖=1
𝑛 |𝐴 𝑖−𝐵 𝑖|
(Eq. 34)
• 𝑑𝑖𝑠ruzicka 𝐴, 𝐵 = 𝒊=𝟏
𝒏
mix 𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 max 𝐴 𝑖 𝐵 𝑖
(Eq. 35)
• 𝑑𝑖𝑠tanimoto 𝐴, 𝐵 = 𝒊=𝟏
𝒏
max 𝐴 𝑖 𝐵 𝑖 − 𝒊=𝟏
𝒏
mix 𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 max 𝐴 𝑖 𝐵 𝑖
(Eq. 36)

Analysis of Proximity Measure for Big Data Mining
Fahad et al. 2014 and Pandove et al., 2015 describe clustering algorithm criteria on the bases
of the Volume, Velocity, and Variety dimensions of big data.
• Volume related criteria such as cluster is must be dealt huge scale size, high dimensional
and noisy of the dataset.
• Variety related criteria such as cluster is must be recognized as dataset categorization
and clusters Shape.
• Velocity related criteria define the complexity, scalability, and performance of the
clustering algorithm during the execution of real dataset.
This paper takes Volume as clustering algorithm must be deal high scale dataset, Variety deal
data type of clustering as continuous (numerical) and categorical (nominal and binary)
data , and Velocity defines time complexity for identification of big data enable distance
measures.

Distance measure Type of function Volume
(High Data Set)
Variety
(Type of Data)
Velocity
(Time Complexity)
Minkowski family
Eq. 2 Distance Large Continuous 0(n)
L(1) family
Eq. 6 Semi-metric Large Categorical 0(2n)
Eq. 8 Semi-metric Medium Categorical 0(n)
Eq. 9 similarity Medium Categorical 0(n)

(High Data Set)
Variety
(Type of Data)
Velocity
(Time Complexity)
L(2) or χ 2 family
Eq. 15 Semi-metric Large Categorical 0(n)
Inner product family
Eq. 20 Similarity Medium Categorical 0(3n)

(High Data Set)
Variety
(Type of Data)
Velocity
(Time Complexity)
Shannon’s entropy family
Eq. 25 Similarity Medium Categorical 0(n)
Eq. 27 Similarity Medium Categorical 0(n)
Combination family
Intersection family
Eq. 34 Distance Large Continuous 0(2n)

Conclusions
This paper analyzed all studied 34 proximity measures for big data mining characteristics
as volume (dataset size) Variety (dataset type, and Velocity (time complexity) and find Eq. 2
Manhattan distance, Eq. 3 Euclidean distance, Eq. 4 Minkowski distance, Eq. 5 Chebyshev
distance, Eq. 7 soergel, Eq. 10 canberra, Eq. 11 lorentzian, Eq. 12 wavehedge, Eq. 13
matusita, Eq. 14 clark, Eq. 34 wavehedges are more scalable for big data based on
theoretical, practical and the existing research perspective.

References
1. Rouhani, S., Rotbei, S., & Hamidi, H. (2017). What do we know about the big data researches? A systematic review from 2011 to 2017.
Journal of Decision Systems, 26(4), 368-393. doi:10.1080/12460125.2018.1437654
2. Jin, X., Wah, B. W., Cheng, X., & Wang, Y. (2015). Significance and Challenges of Big Data Research. Big Data Research, 2(2), 59-64.
doi:10.1016/j.bdr.2015.01.006
3. Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(2), 171-209. doi:10.1007/s11036-013-
0489-0
4. Chen, W., Oliverio, J., Kim, J. H., & Shen, J. (2018). The Modeling and Simulation of Data Clustering Algorithms in Data Mining with
Big Data. Journal of Industrial Integration and Management, 1850017. doi:10.1142/s2424862218500173
5. Zhao, X., Liang, J., & Dang, C. (2019). A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Systems,
163, 416-428. doi:10.1016/j.knosys.2018.09.007.
6 Pandove, D., & Goel, S. (2015). A comprehensive study on clustering approaches for big data mining. In Proceedings of IEEE 2nd
International Conference on Electronics and Communication Systems (pp. 1333-1338). IEEE Xplore Digital Library.
doi:10.1109/ecs.2015.7124801
7 Chen, C. P., & Zhang, C. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information
Sciences, 275, 314-347. doi:10.1016/j.ins.2014.01.015
8 Amado, A., Cortez, P., Rita, P., & Moro, S. (2018). Research trends on Big Data in Marketing: A text mining and topic modeling based
literature analysis. European Research on Management and Business Economics, 24(1), 1-7. doi:10.1016/j.iedeen.2017.06.002
9 Lee, I. (2017). Big data: Dimensions, evolution, impacts, and challenges. Business Horizons, 60(3), 293-303.
doi:10.1016/j.bushor.2017.01.004
10. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information
Management, 35(2), 137-144. doi:10.1016/j.ijinfomgt.2014.10.007

References
11. Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges and analytical methods.
Journal of Business Research, 70, 263-286. doi:10.1016/j.jbusres.2016.08.001
12 Bendechache, M., Tari, A., & Kechadi, M. (2018). Parallel and distributed clustering framework for big spatial data mining. International
Journal of Parallel, Emergent and Distributed Systems, 1-19. doi:10.1080/17445760.2018.1446210
13 Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014).
14 Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi
10.1109/ICACCS.2015.7324059, (2015).
15 Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976-8_16,
(2014).
16 Cha, S. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. INTERNATIONAL
JOURNAL OF MATHEMATICAL MODELS AND METHODS IN APPLIED SCIENCES, 4(1), 300-307. doi:10.1109/icpr.2000.906010
17 Lin, Y., Jiang, J., & Lee, S. (2014). A Similarity Measure for Text Classification and Clustering. IEEE Transactions on Knowledge and
Data Engineering, 26(7), 1575-1590. doi:10.1109/tkde.2013.19
18 Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification.
Neurocomputing, 230, 143-151. doi:10.1016/j.neucom.2016.12.007
19. Liu, H., Zhang, X., Zhang, X., & Cui, Y. (2017). Self-adapted mixture distance measure for clustering uncertain data. Knowledge-Based
Systems, 126, 33-47. doi:10.1016/j.knosys.2017.04.002
20. Weller-Fahy, D. J., Borghetti, B. J., & Sodemann, A. A. (2015). A Survey of Distance and Similarity Measures Used Within Network
Intrusion Anomaly Detection. IEEE Communications Surveys & Tutorials, 17(1), 70-91. doi:10.1109/comst.2014.2336610
21. Grant, J., & Hunter, A. (2017). Analysing inconsistent information using distance-based measures. International Journal of Approximate
Reasoning, 89, 3-26. doi:10.1016/j.ijar.2016.04.004

References
22. Merigó, J. M., Casanovas, M., & Zeng, S. (2014). Distance measures with heavy aggregation operators. Applied Mathematical Modelling,
38(13), 3142-3153. doi:10.1016/j.apm.2013.11.036
23. Ikonomakis, E. K., Spyrou, G. M., & Vrahatis, M. N. (2019). Content driven clustering algorithm combining density and distance
functions. Pattern Recognition, 87, 190-202. doi:10.1016/j.patcog.2018.10.007
24. Marcon, E., & Puech, F. (2017). A typology of distance-based measures of spatial concentration. Regional Science and Urban Economics,
62, 56-67. doi:10.1016/j.regsciurbeco.2016.10.004
25. Kocher, M., & Savoy, J. (2017). Distance measures in author profiling. Information Processing & Management, 53(5), 1103-1119.
doi:10.1016/j.ipm.2017.04.004
26. Moghtadaiee, V., & Dempster, A. G. (2015). Determining the best vector distance measure for use in location fingerprinting. Pervasive
and Mobile Computing, 23, 59-79. doi:10.1016/j.pmcj.2014.11.002
27. Chim, H., & Deng, X. (2008). Efficient Phrase-Based Document Similarity for Clustering. IEEE Transactions on Knowledge and Data
Engineering, 20(9), 1217-1229. doi:10.1109/tkde.2008.50
28. Wang, X., Yu, F., & Pedrycz, W. (2016). An area-based shape distance measure of time series. Applied Soft Computing, 48, 650-659.
doi:10.1016/j.asoc.2016.06.033
29. Ramya, R., & Sasikala, T. (2018). A comparative analysis of similarity distance measure functions for biocryptic authentication in cloud
databases. Cluster Computing. doi:10.1007/s10586-017-1568-y
30. Abudalfa, S. I., & Mikki, M. (2013). K-means algorithm with a novel distance measure. Turkish Journal Of Electrical Engineering &
Computer Sciences, 21, 1665-1684. doi:10.3906/elk-1010-869
31 Nadler, M., & Smith, E. P. (1993). Pattern recognition engineering. New York: John Wiley & Sons, ISBN-13: 978-0471622932
32. Gan, G., Ma, C., & Wu, J. (2007). Data clustering: Theory, algorithms, and applications. Philadelphia, PA: SIAM, Society for Industrial
and Applied Mathematics.

References
33 Everitt, B. S. (2011). Cluster Analysis (5th ed., Wiley series in probability and statistics). Southern Gate, Chichester, West SussexUnited
Kingdom: John Wiley & Sons.ISBN: 978-0-470-74991-3
34. Aggarwal, C. C., & Reddy, C. (2014). Data Clustering Algorithms and Applications. CRC Press Taylor & Francis Group.ISBN 978-1-
4665-5822-9
35 Manning, C. D. , Raghavan, P. , & Schütze, H. (2008). Introduction to information retrieval . Cambridge: Cambridge University Press .
36 Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., . . . Bouras, A. (2014). A Survey of Clustering Algorithms for Big
Data: Taxonomy and Empirical Analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267-279.
doi:10.1109/tetc.2014.2330519
40 Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y. (2015). A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data. PLoS ONE, 10(12), doi:10.1371/journal.pone.0144059
41 Kumar, V., Chhabra, J. K., & Kumar, D. (2013). Impact of Distance Measures on the Performance of Clustering Algorithms. Intelligent
Computing, Networking, and Informatics Advances in Intelligent Systems and Computing, 183-190. doi:10.1007/978-81-322-1665-0_17
42. Selvi, C., & Sivasankar, E. (2018). A novel similarity measure towards effective recommendation using Matusita coefficient for
Collaborative Filtering in a sparse dataset. Sādhanā, 43(12). doi:10.1007/s12046-018-0970-3

Ability Study of Proximity Measure for Big Data Mining Context on Clustering

Ability Study of Proximity Measure for Big Data Mining Context on Clustering

More Related Content

What's hot (20)

Similar to Ability Study of Proximity Measure for Big Data Mining Context on Clustering (20)

Recently uploaded (20)

Ability Study of Proximity Measure for Big Data Mining Context on Clustering