SlideShare a Scribd company logo
Ability Study of Proximity Measure for Big
Data Mining Context on Clustering
Kamlesh Kumar Pandey
Research Scholar
Dept. of Computer Science & Applications
Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P.
E-mail: kamleshamk@gmail.com
2nd International Conference on Communication and Computational Technologies
(Paper ID : 16)
Paper Presentation
on
Content
• Objectives
• Big Data
• Big Data Mining
• Proximity Measures Enabled Clustering Taxonomy
• Proximity Measure Taxonomy
• Analysis of Proximity Measure for Big Data Mining
Objectives
• The objective of this study is identifying a proximity measures for big data
clustering respect to volume, variety, and velocity and presents how to
create a cluster with the help of a proximity measure under the partition,
hierarchical, density, grid, model, fuzzy and graph based cluster taxonomy.
Big Data
• Present time technology is growing very fast. Every originations, industries or person
moving towards Internet of things, cloud computing, warless sensor networks, social
media, internet. These sources generated a data growing fast in per second, minutes or per
hour in size of Terabytes or Petabytes .
• Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research
paper. All of these authors define Big Data there means if the data set is large then
gigabyte then these type of data set is known as Big Data.
• Doug Laney et al (2001) was the first person who gave a proper definition for Big Data.
He gave three characteristics Volume, Variety, and Velocity of Big Data and these
characteristics known as 3 V’s of Big Data Management. If traditional data have met two
basic characteristic at a time these data are come to under Big data.
• Gartner (2012), “Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making”
Big Data V’s
• In present time seven V’s used for Big Data where the first three V’s Volume,
Variety, and Velocity are the main characteristics of big data. In addition to
Veracity, Variability, Value, and Visualization are depending on the organization.
Big Data Mining
• Big Data Mining fetching on the requested information, uncovering
hidden relationship or patterns or extracting for the needed information or
knowledge from a dataset these datasets have to meet three V’s of Big
Data with higher complexity.
Clustering
• Clustering is the one of the approaches for analysis and discovering the
complex relation, pattern, and data in the form of underlying groups for the
unlabeled object and Big Data perspective, the clustering algorithm must be
deal high volume, high variety and high velocity with scalability.
Clustering Taxonomy
• Partitioning based Clustering: These clustering methods constructs the clusters on the bases of
center in the choice of k number of clusters. This clustering method used proximity measures as
finding out the center of the cluster creation.
• Hierarchical based Clustering: In this approach, large data are organized in a hierarchical manner
based on the medium of proximity and its detect on easily relationship between data points.
• Density Based Clustering: This clustering method scans the spatial databases and used for
probability distribution based distance measure and distance measurements for creating core,
border and noise point for density cluster.
• Grid-Based Clustering: This clustering algorithm splits the data space into a grid structure,
calculate the cell density, calculate the grid structure with the help of distance measure.
• Model-Based Clustering: This clustering method optimizes the dataset into the mathematical
model based on the mixing of probability distributions based distance measure and distance
measure are also measuring the parameter of the selected model.
Proximity Measure Taxonomy
In present time various proximity measure is available for cluster construction
and these proximity measure categories under
• Minkowski
• L(1)
• L(2)
• Inner product
• Shannon’s entropy
• Combination
• Intersection
Minkowski family Proximity Measures
• 𝑑𝑖𝑠 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑃
1
𝑃 (Eq.1)
• 𝑑𝑖𝑠manhattan 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑃
(Eq.2) p=1, construct hyper-rectangular shape cluster.
• 𝑑𝑖𝑠euclidean 𝐴, 𝐵 = 2
𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖|2 (Eq.3) p=2, construct compact or isolated cluster.
• 𝑑𝑖𝑠minkowski 𝐴, 𝐵 =
𝑝
𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑝 (Eq.4) p= 2 to ∞, construct isolated or compacted
cluster.
• 𝑑𝑖𝑠chebyshev 𝐴, 𝐵 = ∞
𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖|∞ = 𝑚𝑎𝑥𝑖=1
𝑛
|𝐴𝑖 − 𝐵𝑖| 𝑃 (Eq.5)
p= ∞ to maximum. This distance measure is used for when two data points are greatest of their
absolute magnitude along with data dimension.
L(1) family, Manhattan family Proximity Measures
The Manhattan distance measure faces two difficulties in the respects of distance value. First
one is normalization of a distance value and second is related to figure out of small and large
distance.
• 𝑑𝑖𝑠sorensen 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴 𝑖−𝐵 𝑖|
𝑖=1
𝑛
|𝐴 𝑖+𝐵 𝑖|
(Eq.6) Normalized dis value between 0 and 1.
• 𝑑𝑖𝑠soergel 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴 𝑖−𝐵 𝑖|
𝑖=1
𝑛
max 𝐴 𝑖 𝐵 𝑖
(Eq.7) choosing the max coefficient data point.
• 𝑑𝑖𝑠kulczynski 𝐴, 𝐵 = 𝑖=1
𝑛
|𝐴 𝑖−𝐵 𝑖|
𝑖=1
𝑛
min 𝐴 𝑖 𝐵 𝑖
(Eq.8) choosing the min coefficient data point.
• 𝑑𝑖𝑠motyka 𝐴, 𝐵 = 𝑖=1
𝑛
max 𝐴 𝑖 𝐵 𝑖
𝑖=1
𝑛
|𝐴 𝑖+𝐵 𝑖|
(Eq.9) takes the max data point of the data set.
• 𝑑𝑖𝑠canberra 𝐴, 𝐵 = 𝑖=1
𝑛 |𝐴 𝑖−𝐵𝑖|
|𝐴 𝑖+𝐵𝑖|
(Eq.10) absolute difference of the individual data.
• 𝑑𝑖𝑠lorentzian 𝐴, 𝐵 = i=1
n
ln(1 + |𝐴𝑖 − 𝐵𝑖| (Eq.11) Normalized dis value natural logarithm
• 𝑑𝑖𝑠wavehedge 𝐴, 𝐵 = 𝑖=1
𝑛 |𝐴 𝑖−𝐵 𝑖|
max 𝐴 𝑖 𝐵 𝑖
(Eq.12) normalizes the difference of each data
pair with its max value
L(2) or χ 2, Euclidian family Proximity Measures
The L(2) family member based on the Euclidian distance and gives the distance value after
normalization.
• 𝑑𝑖𝑠matusita 𝐴, 𝐵 = 𝑖=1
𝑛
( 𝐴𝑖 − 𝐵𝑖)2 (Eq. 13)
• 𝑑𝑖𝑠clark 𝐴, 𝐵 = 𝑖=1
𝑛
(
|𝐴 𝑖−𝐵 𝑖|
(𝐴 𝑖+𝐵 𝑖)
)2
(Eq. 14)
• 𝑑𝑖𝑠divergence 𝐴, 𝐵 = 2 𝑖=1
𝑛 (𝐴 𝑖−𝐵 𝑖)2
(𝐴 𝑖+𝐵 𝑖)2 (Eq. 15)
• 𝑑𝑖𝑠squared_euclidean 𝐴, 𝐵 = 𝑖=1
𝑛
(𝐴𝑖 − 𝐵𝑖)2 (Eq.16)
• 𝑑𝑖𝑠squared_chi 𝐴, 𝐵 = 𝑖=1
𝑛 (𝐴 𝑖−𝐵 𝑖)2
(𝐴 𝑖+𝐵 𝑖)
(Eq. 17)
• 𝑑𝑖𝑠pearson_chi 𝐴, 𝐵 = 𝑖=1
𝑛 (𝐴 𝑖−𝐵 𝑖)2
𝐵 𝑖
(Eq. 18)
• 𝑑𝑖𝑠neyman_chi 𝐴, 𝐵 = 𝑖=1
𝑛 (𝐴 𝑖−𝐵 𝑖)2
𝐴 𝑖
(Eq. 19)
Inner product family Proximity Measures
Inner product distance measure gives the distance on the basis of multiplication of real
data. Their formulation shows as Eq. 20 and normalized distances shown as Eq. 21 to Eq.
24 respectively on the basis of Eq. 20.
• 𝑑𝑖𝑠inner_product 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 𝐵𝑖 (Eq. 20)
• 𝑑𝑖𝑠cosine 𝐴, 𝐵 = 𝒊=𝟏
𝒏
𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 𝐴 𝑖
𝟐
𝒊=𝟏
𝒏 𝐵 𝑖
𝟐
(Eq. 21)
• 𝑑𝑖𝑠Jaccard 𝐴, 𝐵 = 1 − 𝑖=1
𝑛
𝐴 𝑖 𝐵 𝑖
𝑖=1
𝑛 𝐴 𝑖
2
+ 𝑖=1
𝑛 𝐵 𝑖
2− 𝑖=1
𝑛 𝐴 𝑖 𝐵 𝑖
(Eq. 22)
• 𝑑𝑖𝑠dice 𝐴, 𝐵 = 1 −
𝟐 𝒊=𝟏
𝒏
𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 𝐴 𝑖
2
+ 𝒊=𝟏
𝒏 𝐵 𝑖
2 (Eq. 23)
• 𝑑𝑖𝑠Harmonic_mean 𝐴, 𝐵 = 2 i=1
n AiBi
Ai+Bi
(Eq. 24)
Shannon’s entropy family Proximity Measures
This family is based on probabilistic uncertainty or entropy. Here all data point Ai must
be non-negative and their sum is always equal to 1 and logarithm base is fixed as 2 for
distance measure.
• 𝑑𝑖𝑠kullback_leibler 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 ln
𝐴 𝑖
𝐵 𝑖
(Eq. 25)
• 𝑑𝑖𝑠jeffreys 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 −𝐵𝑖 ln
𝐴 𝑖
𝐵 𝑖
(Eq. 26)
• 𝑑𝑖𝑠k_divergence 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 ln
2𝐴 𝑖
𝐴 𝑖−𝐵 𝑖
(Eq. 27)
• 𝑑𝑖𝑠topsoe 𝐴, 𝐵 = 𝑖=1
𝑛
𝐴𝑖 ln
2𝐴 𝑖
𝐴 𝑖−𝐵 𝑖
+ 𝐵𝑖 ln
2𝐴 𝑖
𝐴 𝑖−𝐵 𝑖
(Eq. 28)
• 𝑑𝑖𝑠jensen_shannon 𝐴, 𝐵 =
𝟏
𝟐 𝑖=1
𝑛
𝐴𝑖 ln
2𝐴 𝑖
𝐴 𝑖−𝐵 𝑖
+ 𝑖=1
𝑛
𝐵𝑖 ln
2𝐴 𝑖
𝐴 𝑖−𝐵 𝑖
(Eq. 29)
• 𝑑𝑖𝑠jensen_difference 𝐴, 𝐵 = 𝑖=1
𝑛 𝐴 𝑖 ln 𝐴 𝑖+𝐵 𝑖 ln 𝐵 𝑖
2
−
𝐴 𝑖−𝐵 𝑖
2
ln
𝐴 𝑖−𝐵 𝑖
2
(Eq. 30)
Combination family Proximity Measures
This family member defines a distance measure on the bases of a combination of two or
more distance measures.
• 𝑑𝑖𝑠taneja 𝐴, 𝐵 = 𝑖=1
𝑛 𝐴 𝑖+𝐵 𝑖
2
ln
𝐴 𝑖+𝐵 𝑖
2
𝐴 𝑖 𝐵 𝑖
(Eq. 31)
• 𝑑𝑖𝑠kumar_johnson 𝐴, 𝐵 = 𝑖=1
𝑛 𝐴 𝑖
2
−𝐵 𝑖
2 2
2 𝐴 𝑖 𝐵 𝑖
3
2
(Eq. 32)
Intersection family Proximity Measures
This family member defines a distance measure on the basis of intersection between data
points. The formulation of intersection distance measure is shown as (Eq. 33) and their
variance shown as Eq. 34 to Eq. 36 .
• 𝑑𝑖𝑠intersection 𝐴, 𝐵 = 𝑖=1
𝑛
min 𝐴𝑖 𝐵𝑖 (Eq. 33)
• 𝑑𝑖𝑠wavehedges 𝐴, 𝐵 = 𝑖=1
𝑛 |𝐴 𝑖−𝐵 𝑖|
max 𝐴 𝑖 𝐵 𝑖
(Eq. 34)
• 𝑑𝑖𝑠ruzicka 𝐴, 𝐵 = 𝒊=𝟏
𝒏
mix 𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 max 𝐴 𝑖 𝐵 𝑖
(Eq. 35)
• 𝑑𝑖𝑠tanimoto 𝐴, 𝐵 = 𝒊=𝟏
𝒏
max 𝐴 𝑖 𝐵 𝑖 − 𝒊=𝟏
𝒏
mix 𝐴 𝑖 𝐵 𝑖
𝒊=𝟏
𝒏 max 𝐴 𝑖 𝐵 𝑖
(Eq. 36)
Analysis of Proximity Measure for Big Data Mining
Fahad et al. 2014 and Pandove et al., 2015 describe clustering algorithm criteria on the bases
of the Volume, Velocity, and Variety dimensions of big data.
• Volume related criteria such as cluster is must be dealt huge scale size, high dimensional
and noisy of the dataset.
• Variety related criteria such as cluster is must be recognized as dataset categorization
and clusters Shape.
• Velocity related criteria define the complexity, scalability, and performance of the
clustering algorithm during the execution of real dataset.
This paper takes Volume as clustering algorithm must be deal high scale dataset, Variety deal
data type of clustering as continuous (numerical) and categorical (nominal and binary)
data , and Velocity defines time complexity for identification of big data enable distance
measures.
Analysis of Proximity Measure for Big Data Mining
Distance measure Type of function Volume
(High Data Set)
Variety
(Type of Data)
Velocity
(Time Complexity)
Minkowski family
Eq. 2 Distance Large Continuous 0(n)
Eq. 3 Distance Large Continuous 0(n)
Eq. 4 Distance Large Continuous 0(n)
Eq. 5 Distance Large Continuous 0(n)
L(1) family
Eq. 6 Semi-metric Large Categorical 0(2n)
Eq. 7 Distance Large Continuous 0(n)
Eq. 8 Semi-metric Medium Categorical 0(n)
Eq. 9 similarity Medium Categorical 0(n)
Eq. 10 Distance Large Continuous 0(n)
Eq. 11 Distance Large Continuous 0(n)
Eq. 12 Distance Large Continuous 0(n)
Analysis of Proximity Measure for Big Data Mining
Distance measure Type of function Volume
(High Data Set)
Variety
(Type of Data)
Velocity
(Time Complexity)
L(2) or χ 2 family
Eq. 13 Distance Large Continuous 0(n)
Eq. 14 Distance Large Continuous 0(n)
Eq. 15 Semi-metric Large Categorical 0(n)
Eq. 16 Semi-metric Large Categorical 0(n)
Eq. 17 Semi-metric Large Categorical 0(2n)
Eq. 18 Semi-metric Large Categorical 0(2n)
Eq. 19 Semi-metric Large Categorical 0(2n)
Inner product family
Eq. 20 Similarity Medium Categorical 0(3n)
Eq. 21 Similarity Medium Categorical 0(3n)
Eq. 22 Semi-metric Large Categorical 0(2n)
Eq. 23 Semi-metric Large Categorical 0(2n)
Analysis of Proximity Measure for Big Data Mining
Distance measure Type of function Volume
(High Data Set)
Variety
(Type of Data)
Velocity
(Time Complexity)
Shannon’s entropy family
Eq. 25 Similarity Medium Categorical 0(n)
Eq. 26 Semi-metric Large Categorical 0(n)
Eq. 27 Similarity Medium Categorical 0(n)
Eq. 28 Semi-metric Large Categorical 0(n)
Eq. 29 Semi-metric Large Categorical 0(n)
Eq. 30 Semi-metric Large Categorical 0(n)
Combination family
Eq. 31 Semi-metric Large Categorical 0(2n)
Eq. 32 Semi-metric Large Categorical 0(2n)
Intersection family
Eq. 33 Similarity Medium Categorical 0(2n)
Eq. 34 Distance Large Continuous 0(2n)
Conclusions
This paper analyzed all studied 34 proximity measures for big data mining characteristics
as volume (dataset size) Variety (dataset type, and Velocity (time complexity) and find Eq. 2
Manhattan distance, Eq. 3 Euclidean distance, Eq. 4 Minkowski distance, Eq. 5 Chebyshev
distance, Eq. 7 soergel, Eq. 10 canberra, Eq. 11 lorentzian, Eq. 12 wavehedge, Eq. 13
matusita, Eq. 14 clark, Eq. 34 wavehedges are more scalable for big data based on
theoretical, practical and the existing research perspective.
References
1. Rouhani, S., Rotbei, S., & Hamidi, H. (2017). What do we know about the big data researches? A systematic review from 2011 to 2017.
Journal of Decision Systems, 26(4), 368-393. doi:10.1080/12460125.2018.1437654
2. Jin, X., Wah, B. W., Cheng, X., & Wang, Y. (2015). Significance and Challenges of Big Data Research. Big Data Research, 2(2), 59-64.
doi:10.1016/j.bdr.2015.01.006
3. Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(2), 171-209. doi:10.1007/s11036-013-
0489-0
4. Chen, W., Oliverio, J., Kim, J. H., & Shen, J. (2018). The Modeling and Simulation of Data Clustering Algorithms in Data Mining with
Big Data. Journal of Industrial Integration and Management, 1850017. doi:10.1142/s2424862218500173
5. Zhao, X., Liang, J., & Dang, C. (2019). A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Systems,
163, 416-428. doi:10.1016/j.knosys.2018.09.007.
6 Pandove, D., & Goel, S. (2015). A comprehensive study on clustering approaches for big data mining. In Proceedings of IEEE 2nd
International Conference on Electronics and Communication Systems (pp. 1333-1338). IEEE Xplore Digital Library.
doi:10.1109/ecs.2015.7124801
7 Chen, C. P., & Zhang, C. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information
Sciences, 275, 314-347. doi:10.1016/j.ins.2014.01.015
8 Amado, A., Cortez, P., Rita, P., & Moro, S. (2018). Research trends on Big Data in Marketing: A text mining and topic modeling based
literature analysis. European Research on Management and Business Economics, 24(1), 1-7. doi:10.1016/j.iedeen.2017.06.002
9 Lee, I. (2017). Big data: Dimensions, evolution, impacts, and challenges. Business Horizons, 60(3), 293-303.
doi:10.1016/j.bushor.2017.01.004
10. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information
Management, 35(2), 137-144. doi:10.1016/j.ijinfomgt.2014.10.007
References
11. Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges and analytical methods.
Journal of Business Research, 70, 263-286. doi:10.1016/j.jbusres.2016.08.001
12 Bendechache, M., Tari, A., & Kechadi, M. (2018). Parallel and distributed clustering framework for big spatial data mining. International
Journal of Parallel, Emergent and Distributed Systems, 1-19. doi:10.1080/17445760.2018.1446210
13 Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014).
14 Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi
10.1109/ICACCS.2015.7324059, (2015).
15 Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976-8_16,
(2014).
16 Cha, S. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. INTERNATIONAL
JOURNAL OF MATHEMATICAL MODELS AND METHODS IN APPLIED SCIENCES, 4(1), 300-307. doi:10.1109/icpr.2000.906010
17 Lin, Y., Jiang, J., & Lee, S. (2014). A Similarity Measure for Text Classification and Clustering. IEEE Transactions on Knowledge and
Data Engineering, 26(7), 1575-1590. doi:10.1109/tkde.2013.19
18 Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification.
Neurocomputing, 230, 143-151. doi:10.1016/j.neucom.2016.12.007
19. Liu, H., Zhang, X., Zhang, X., & Cui, Y. (2017). Self-adapted mixture distance measure for clustering uncertain data. Knowledge-Based
Systems, 126, 33-47. doi:10.1016/j.knosys.2017.04.002
20. Weller-Fahy, D. J., Borghetti, B. J., & Sodemann, A. A. (2015). A Survey of Distance and Similarity Measures Used Within Network
Intrusion Anomaly Detection. IEEE Communications Surveys & Tutorials, 17(1), 70-91. doi:10.1109/comst.2014.2336610
21. Grant, J., & Hunter, A. (2017). Analysing inconsistent information using distance-based measures. International Journal of Approximate
Reasoning, 89, 3-26. doi:10.1016/j.ijar.2016.04.004
References
22. Merigó, J. M., Casanovas, M., & Zeng, S. (2014). Distance measures with heavy aggregation operators. Applied Mathematical Modelling,
38(13), 3142-3153. doi:10.1016/j.apm.2013.11.036
23. Ikonomakis, E. K., Spyrou, G. M., & Vrahatis, M. N. (2019). Content driven clustering algorithm combining density and distance
functions. Pattern Recognition, 87, 190-202. doi:10.1016/j.patcog.2018.10.007
24. Marcon, E., & Puech, F. (2017). A typology of distance-based measures of spatial concentration. Regional Science and Urban Economics,
62, 56-67. doi:10.1016/j.regsciurbeco.2016.10.004
25. Kocher, M., & Savoy, J. (2017). Distance measures in author profiling. Information Processing & Management, 53(5), 1103-1119.
doi:10.1016/j.ipm.2017.04.004
26. Moghtadaiee, V., & Dempster, A. G. (2015). Determining the best vector distance measure for use in location fingerprinting. Pervasive
and Mobile Computing, 23, 59-79. doi:10.1016/j.pmcj.2014.11.002
27. Chim, H., & Deng, X. (2008). Efficient Phrase-Based Document Similarity for Clustering. IEEE Transactions on Knowledge and Data
Engineering, 20(9), 1217-1229. doi:10.1109/tkde.2008.50
28. Wang, X., Yu, F., & Pedrycz, W. (2016). An area-based shape distance measure of time series. Applied Soft Computing, 48, 650-659.
doi:10.1016/j.asoc.2016.06.033
29. Ramya, R., & Sasikala, T. (2018). A comparative analysis of similarity distance measure functions for biocryptic authentication in cloud
databases. Cluster Computing. doi:10.1007/s10586-017-1568-y
30. Abudalfa, S. I., & Mikki, M. (2013). K-means algorithm with a novel distance measure. Turkish Journal Of Electrical Engineering &
Computer Sciences, 21, 1665-1684. doi:10.3906/elk-1010-869
31 Nadler, M., & Smith, E. P. (1993). Pattern recognition engineering. New York: John Wiley & Sons, ISBN-13: 978-0471622932
32. Gan, G., Ma, C., & Wu, J. (2007). Data clustering: Theory, algorithms, and applications. Philadelphia, PA: SIAM, Society for Industrial
and Applied Mathematics.
References
33 Everitt, B. S. (2011). Cluster Analysis (5th ed., Wiley series in probability and statistics). Southern Gate, Chichester, West SussexUnited
Kingdom: John Wiley & Sons.ISBN: 978-0-470-74991-3
34. Aggarwal, C. C., & Reddy, C. (2014). Data Clustering Algorithms and Applications. CRC Press Taylor & Francis Group.ISBN 978-1-
4665-5822-9
35 Manning, C. D. , Raghavan, P. , & Schütze, H. (2008). Introduction to information retrieval . Cambridge: Cambridge University Press .
36 Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., . . . Bouras, A. (2014). A Survey of Clustering Algorithms for Big
Data: Taxonomy and Empirical Analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267-279.
doi:10.1109/tetc.2014.2330519
40 Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y. (2015). A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data. PLoS ONE, 10(12), doi:10.1371/journal.pone.0144059
41 Kumar, V., Chhabra, J. K., & Kumar, D. (2013). Impact of Distance Measures on the Performance of Clustering Algorithms. Intelligent
Computing, Networking, and Informatics Advances in Intelligent Systems and Computing, 183-190. doi:10.1007/978-81-322-1665-0_17
42. Selvi, C., & Sivasankar, E. (2018). A novel similarity measure towards effective recommendation using Matusita coefficient for
Collaborative Filtering in a sparse dataset. Sādhanā, 43(12). doi:10.1007/s12046-018-0970-3
Ability Study of Proximity Measure for Big Data Mining Context on Clustering

More Related Content

PPTX
Types of clustering and different types of clustering algorithms
PDF
PPTX
Handling noisy data
DOC
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Finding Relationships between the Our-NIR Cluster Results
PPTX
Quantum persistent k cores for community detection
PPTX
Cluster Validation
Types of clustering and different types of clustering algorithms
Handling noisy data
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
The International Journal of Engineering and Science (The IJES)
Finding Relationships between the Our-NIR Cluster Results
Quantum persistent k cores for community detection
Cluster Validation

What's hot (20)

PPT
DATA MINING:Clustering Types
PPTX
"Principal Component Analysis - the original paper" presentation @ Papers We ...
PDF
Hierarchical Clustering
PDF
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
PPTX
Cluster Analysis
PDF
Paper id 26201483
PPTX
Graph based approaches to Gene Expression Clustering
PPTX
Cluster analysis
 
PPTX
Introduction to Linear Discriminant Analysis
PPTX
master defense hyun-wong choi_2019_05_14_rev19
PPTX
defense hyun-wong choi_2019_05_14_rev18
PPTX
master defense hyun-wong choi_2019_05_14_rev19
PDF
A Correlative Information-Theoretic Measure for Image Similarity
PPTX
Final edited master defense-hyun_wong choi_2019_05_23_rev21
PDF
Fractal Image Compression By Range Block Classification
PDF
08 distributed optimization
PPTX
L4 cluster analysis NWU 4.3 Graphics Course
PDF
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
PDF
Unsupervised learning clustering
PDF
07 dimensionality reduction
DATA MINING:Clustering Types
"Principal Component Analysis - the original paper" presentation @ Papers We ...
Hierarchical Clustering
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
Cluster Analysis
Paper id 26201483
Graph based approaches to Gene Expression Clustering
Cluster analysis
 
Introduction to Linear Discriminant Analysis
master defense hyun-wong choi_2019_05_14_rev19
defense hyun-wong choi_2019_05_14_rev18
master defense hyun-wong choi_2019_05_14_rev19
A Correlative Information-Theoretic Measure for Image Similarity
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Fractal Image Compression By Range Block Classification
08 distributed optimization
L4 cluster analysis NWU 4.3 Graphics Course
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
Unsupervised learning clustering
07 dimensionality reduction
Ad

Similar to Ability Study of Proximity Measure for Big Data Mining Context on Clustering (20)

PPTX
03 Data Mining Techniques
PPTX
Cluster analysis
PPT
4 DM Clustering ifor computerscience.ppt
PPTX
Cluster Analysis
PPTX
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
PPTX
PDF
Ir3116271633
PPTX
Lect4 principal component analysis-I
PDF
Module - 5 Machine Learning-22ISE62.pdf
PPT
26-Clustering MTech-2017.ppt
PPT
DM UNIT_4 PPT for btech final year students
PDF
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
PPTX
machine learning - Clustering in R
PDF
Ensemble based Distributed K-Modes Clustering
PDF
CSA 3702 machine learning module 3
PPTX
MODULE 4_ CLUSTERING.pptx
PDF
SindyAutoEncoder: Interpretable Latent Dynamics via Sparse Identification
PDF
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
PPTX
Could a Data Science Program use Data Science Insights?
03 Data Mining Techniques
Cluster analysis
4 DM Clustering ifor computerscience.ppt
Cluster Analysis
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
Ir3116271633
Lect4 principal component analysis-I
Module - 5 Machine Learning-22ISE62.pdf
26-Clustering MTech-2017.ppt
DM UNIT_4 PPT for btech final year students
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
machine learning - Clustering in R
Ensemble based Distributed K-Modes Clustering
CSA 3702 machine learning module 3
MODULE 4_ CLUSTERING.pptx
SindyAutoEncoder: Interpretable Latent Dynamics via Sparse Identification
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Could a Data Science Program use Data Science Insights?
Ad

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Foundation of Data Science unit number two notes
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to machine learning and Linear Models
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Analytics and business intelligence.pdf
Business Acumen Training GuidePresentation.pptx
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Foundation of Data Science unit number two notes
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
.pdf is not working space design for the following data for the following dat...
Introduction to machine learning and Linear Models
1_Introduction to advance data techniques.pptx
Supervised vs unsupervised machine learning algorithms
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Ability Study of Proximity Measure for Big Data Mining Context on Clustering

  • 1. Ability Study of Proximity Measure for Big Data Mining Context on Clustering Kamlesh Kumar Pandey Research Scholar Dept. of Computer Science & Applications Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P. E-mail: kamleshamk@gmail.com 2nd International Conference on Communication and Computational Technologies (Paper ID : 16) Paper Presentation on
  • 2. Content • Objectives • Big Data • Big Data Mining • Proximity Measures Enabled Clustering Taxonomy • Proximity Measure Taxonomy • Analysis of Proximity Measure for Big Data Mining
  • 3. Objectives • The objective of this study is identifying a proximity measures for big data clustering respect to volume, variety, and velocity and presents how to create a cluster with the help of a proximity measure under the partition, hierarchical, density, grid, model, fuzzy and graph based cluster taxonomy.
  • 4. Big Data • Present time technology is growing very fast. Every originations, industries or person moving towards Internet of things, cloud computing, warless sensor networks, social media, internet. These sources generated a data growing fast in per second, minutes or per hour in size of Terabytes or Petabytes . • Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research paper. All of these authors define Big Data there means if the data set is large then gigabyte then these type of data set is known as Big Data. • Doug Laney et al (2001) was the first person who gave a proper definition for Big Data. He gave three characteristics Volume, Variety, and Velocity of Big Data and these characteristics known as 3 V’s of Big Data Management. If traditional data have met two basic characteristic at a time these data are come to under Big data. • Gartner (2012), “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”
  • 5. Big Data V’s • In present time seven V’s used for Big Data where the first three V’s Volume, Variety, and Velocity are the main characteristics of big data. In addition to Veracity, Variability, Value, and Visualization are depending on the organization.
  • 6. Big Data Mining • Big Data Mining fetching on the requested information, uncovering hidden relationship or patterns or extracting for the needed information or knowledge from a dataset these datasets have to meet three V’s of Big Data with higher complexity.
  • 7. Clustering • Clustering is the one of the approaches for analysis and discovering the complex relation, pattern, and data in the form of underlying groups for the unlabeled object and Big Data perspective, the clustering algorithm must be deal high volume, high variety and high velocity with scalability.
  • 8. Clustering Taxonomy • Partitioning based Clustering: These clustering methods constructs the clusters on the bases of center in the choice of k number of clusters. This clustering method used proximity measures as finding out the center of the cluster creation. • Hierarchical based Clustering: In this approach, large data are organized in a hierarchical manner based on the medium of proximity and its detect on easily relationship between data points. • Density Based Clustering: This clustering method scans the spatial databases and used for probability distribution based distance measure and distance measurements for creating core, border and noise point for density cluster. • Grid-Based Clustering: This clustering algorithm splits the data space into a grid structure, calculate the cell density, calculate the grid structure with the help of distance measure. • Model-Based Clustering: This clustering method optimizes the dataset into the mathematical model based on the mixing of probability distributions based distance measure and distance measure are also measuring the parameter of the selected model.
  • 9. Proximity Measure Taxonomy In present time various proximity measure is available for cluster construction and these proximity measure categories under • Minkowski • L(1) • L(2) • Inner product • Shannon’s entropy • Combination • Intersection
  • 10. Minkowski family Proximity Measures • 𝑑𝑖𝑠 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴𝑖 − 𝐵𝑖| 𝑃 1 𝑃 (Eq.1) • 𝑑𝑖𝑠manhattan 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴𝑖 − 𝐵𝑖| 𝑃 (Eq.2) p=1, construct hyper-rectangular shape cluster. • 𝑑𝑖𝑠euclidean 𝐴, 𝐵 = 2 𝑖=1 𝑛 |𝐴𝑖 − 𝐵𝑖|2 (Eq.3) p=2, construct compact or isolated cluster. • 𝑑𝑖𝑠minkowski 𝐴, 𝐵 = 𝑝 𝑖=1 𝑛 |𝐴𝑖 − 𝐵𝑖| 𝑝 (Eq.4) p= 2 to ∞, construct isolated or compacted cluster. • 𝑑𝑖𝑠chebyshev 𝐴, 𝐵 = ∞ 𝑖=1 𝑛 |𝐴𝑖 − 𝐵𝑖|∞ = 𝑚𝑎𝑥𝑖=1 𝑛 |𝐴𝑖 − 𝐵𝑖| 𝑃 (Eq.5) p= ∞ to maximum. This distance measure is used for when two data points are greatest of their absolute magnitude along with data dimension.
  • 11. L(1) family, Manhattan family Proximity Measures The Manhattan distance measure faces two difficulties in the respects of distance value. First one is normalization of a distance value and second is related to figure out of small and large distance. • 𝑑𝑖𝑠sorensen 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴 𝑖−𝐵 𝑖| 𝑖=1 𝑛 |𝐴 𝑖+𝐵 𝑖| (Eq.6) Normalized dis value between 0 and 1. • 𝑑𝑖𝑠soergel 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴 𝑖−𝐵 𝑖| 𝑖=1 𝑛 max 𝐴 𝑖 𝐵 𝑖 (Eq.7) choosing the max coefficient data point. • 𝑑𝑖𝑠kulczynski 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴 𝑖−𝐵 𝑖| 𝑖=1 𝑛 min 𝐴 𝑖 𝐵 𝑖 (Eq.8) choosing the min coefficient data point. • 𝑑𝑖𝑠motyka 𝐴, 𝐵 = 𝑖=1 𝑛 max 𝐴 𝑖 𝐵 𝑖 𝑖=1 𝑛 |𝐴 𝑖+𝐵 𝑖| (Eq.9) takes the max data point of the data set. • 𝑑𝑖𝑠canberra 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴 𝑖−𝐵𝑖| |𝐴 𝑖+𝐵𝑖| (Eq.10) absolute difference of the individual data. • 𝑑𝑖𝑠lorentzian 𝐴, 𝐵 = i=1 n ln(1 + |𝐴𝑖 − 𝐵𝑖| (Eq.11) Normalized dis value natural logarithm • 𝑑𝑖𝑠wavehedge 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴 𝑖−𝐵 𝑖| max 𝐴 𝑖 𝐵 𝑖 (Eq.12) normalizes the difference of each data pair with its max value
  • 12. L(2) or χ 2, Euclidian family Proximity Measures The L(2) family member based on the Euclidian distance and gives the distance value after normalization. • 𝑑𝑖𝑠matusita 𝐴, 𝐵 = 𝑖=1 𝑛 ( 𝐴𝑖 − 𝐵𝑖)2 (Eq. 13) • 𝑑𝑖𝑠clark 𝐴, 𝐵 = 𝑖=1 𝑛 ( |𝐴 𝑖−𝐵 𝑖| (𝐴 𝑖+𝐵 𝑖) )2 (Eq. 14) • 𝑑𝑖𝑠divergence 𝐴, 𝐵 = 2 𝑖=1 𝑛 (𝐴 𝑖−𝐵 𝑖)2 (𝐴 𝑖+𝐵 𝑖)2 (Eq. 15) • 𝑑𝑖𝑠squared_euclidean 𝐴, 𝐵 = 𝑖=1 𝑛 (𝐴𝑖 − 𝐵𝑖)2 (Eq.16) • 𝑑𝑖𝑠squared_chi 𝐴, 𝐵 = 𝑖=1 𝑛 (𝐴 𝑖−𝐵 𝑖)2 (𝐴 𝑖+𝐵 𝑖) (Eq. 17) • 𝑑𝑖𝑠pearson_chi 𝐴, 𝐵 = 𝑖=1 𝑛 (𝐴 𝑖−𝐵 𝑖)2 𝐵 𝑖 (Eq. 18) • 𝑑𝑖𝑠neyman_chi 𝐴, 𝐵 = 𝑖=1 𝑛 (𝐴 𝑖−𝐵 𝑖)2 𝐴 𝑖 (Eq. 19)
  • 13. Inner product family Proximity Measures Inner product distance measure gives the distance on the basis of multiplication of real data. Their formulation shows as Eq. 20 and normalized distances shown as Eq. 21 to Eq. 24 respectively on the basis of Eq. 20. • 𝑑𝑖𝑠inner_product 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴𝑖 𝐵𝑖 (Eq. 20) • 𝑑𝑖𝑠cosine 𝐴, 𝐵 = 𝒊=𝟏 𝒏 𝐴 𝑖 𝐵 𝑖 𝒊=𝟏 𝒏 𝐴 𝑖 𝟐 𝒊=𝟏 𝒏 𝐵 𝑖 𝟐 (Eq. 21) • 𝑑𝑖𝑠Jaccard 𝐴, 𝐵 = 1 − 𝑖=1 𝑛 𝐴 𝑖 𝐵 𝑖 𝑖=1 𝑛 𝐴 𝑖 2 + 𝑖=1 𝑛 𝐵 𝑖 2− 𝑖=1 𝑛 𝐴 𝑖 𝐵 𝑖 (Eq. 22) • 𝑑𝑖𝑠dice 𝐴, 𝐵 = 1 − 𝟐 𝒊=𝟏 𝒏 𝐴 𝑖 𝐵 𝑖 𝒊=𝟏 𝒏 𝐴 𝑖 2 + 𝒊=𝟏 𝒏 𝐵 𝑖 2 (Eq. 23) • 𝑑𝑖𝑠Harmonic_mean 𝐴, 𝐵 = 2 i=1 n AiBi Ai+Bi (Eq. 24)
  • 14. Shannon’s entropy family Proximity Measures This family is based on probabilistic uncertainty or entropy. Here all data point Ai must be non-negative and their sum is always equal to 1 and logarithm base is fixed as 2 for distance measure. • 𝑑𝑖𝑠kullback_leibler 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴𝑖 ln 𝐴 𝑖 𝐵 𝑖 (Eq. 25) • 𝑑𝑖𝑠jeffreys 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴𝑖 −𝐵𝑖 ln 𝐴 𝑖 𝐵 𝑖 (Eq. 26) • 𝑑𝑖𝑠k_divergence 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴𝑖 ln 2𝐴 𝑖 𝐴 𝑖−𝐵 𝑖 (Eq. 27) • 𝑑𝑖𝑠topsoe 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴𝑖 ln 2𝐴 𝑖 𝐴 𝑖−𝐵 𝑖 + 𝐵𝑖 ln 2𝐴 𝑖 𝐴 𝑖−𝐵 𝑖 (Eq. 28) • 𝑑𝑖𝑠jensen_shannon 𝐴, 𝐵 = 𝟏 𝟐 𝑖=1 𝑛 𝐴𝑖 ln 2𝐴 𝑖 𝐴 𝑖−𝐵 𝑖 + 𝑖=1 𝑛 𝐵𝑖 ln 2𝐴 𝑖 𝐴 𝑖−𝐵 𝑖 (Eq. 29) • 𝑑𝑖𝑠jensen_difference 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴 𝑖 ln 𝐴 𝑖+𝐵 𝑖 ln 𝐵 𝑖 2 − 𝐴 𝑖−𝐵 𝑖 2 ln 𝐴 𝑖−𝐵 𝑖 2 (Eq. 30)
  • 15. Combination family Proximity Measures This family member defines a distance measure on the bases of a combination of two or more distance measures. • 𝑑𝑖𝑠taneja 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴 𝑖+𝐵 𝑖 2 ln 𝐴 𝑖+𝐵 𝑖 2 𝐴 𝑖 𝐵 𝑖 (Eq. 31) • 𝑑𝑖𝑠kumar_johnson 𝐴, 𝐵 = 𝑖=1 𝑛 𝐴 𝑖 2 −𝐵 𝑖 2 2 2 𝐴 𝑖 𝐵 𝑖 3 2 (Eq. 32)
  • 16. Intersection family Proximity Measures This family member defines a distance measure on the basis of intersection between data points. The formulation of intersection distance measure is shown as (Eq. 33) and their variance shown as Eq. 34 to Eq. 36 . • 𝑑𝑖𝑠intersection 𝐴, 𝐵 = 𝑖=1 𝑛 min 𝐴𝑖 𝐵𝑖 (Eq. 33) • 𝑑𝑖𝑠wavehedges 𝐴, 𝐵 = 𝑖=1 𝑛 |𝐴 𝑖−𝐵 𝑖| max 𝐴 𝑖 𝐵 𝑖 (Eq. 34) • 𝑑𝑖𝑠ruzicka 𝐴, 𝐵 = 𝒊=𝟏 𝒏 mix 𝐴 𝑖 𝐵 𝑖 𝒊=𝟏 𝒏 max 𝐴 𝑖 𝐵 𝑖 (Eq. 35) • 𝑑𝑖𝑠tanimoto 𝐴, 𝐵 = 𝒊=𝟏 𝒏 max 𝐴 𝑖 𝐵 𝑖 − 𝒊=𝟏 𝒏 mix 𝐴 𝑖 𝐵 𝑖 𝒊=𝟏 𝒏 max 𝐴 𝑖 𝐵 𝑖 (Eq. 36)
  • 17. Analysis of Proximity Measure for Big Data Mining Fahad et al. 2014 and Pandove et al., 2015 describe clustering algorithm criteria on the bases of the Volume, Velocity, and Variety dimensions of big data. • Volume related criteria such as cluster is must be dealt huge scale size, high dimensional and noisy of the dataset. • Variety related criteria such as cluster is must be recognized as dataset categorization and clusters Shape. • Velocity related criteria define the complexity, scalability, and performance of the clustering algorithm during the execution of real dataset. This paper takes Volume as clustering algorithm must be deal high scale dataset, Variety deal data type of clustering as continuous (numerical) and categorical (nominal and binary) data , and Velocity defines time complexity for identification of big data enable distance measures.
  • 18. Analysis of Proximity Measure for Big Data Mining Distance measure Type of function Volume (High Data Set) Variety (Type of Data) Velocity (Time Complexity) Minkowski family Eq. 2 Distance Large Continuous 0(n) Eq. 3 Distance Large Continuous 0(n) Eq. 4 Distance Large Continuous 0(n) Eq. 5 Distance Large Continuous 0(n) L(1) family Eq. 6 Semi-metric Large Categorical 0(2n) Eq. 7 Distance Large Continuous 0(n) Eq. 8 Semi-metric Medium Categorical 0(n) Eq. 9 similarity Medium Categorical 0(n) Eq. 10 Distance Large Continuous 0(n) Eq. 11 Distance Large Continuous 0(n) Eq. 12 Distance Large Continuous 0(n)
  • 19. Analysis of Proximity Measure for Big Data Mining Distance measure Type of function Volume (High Data Set) Variety (Type of Data) Velocity (Time Complexity) L(2) or χ 2 family Eq. 13 Distance Large Continuous 0(n) Eq. 14 Distance Large Continuous 0(n) Eq. 15 Semi-metric Large Categorical 0(n) Eq. 16 Semi-metric Large Categorical 0(n) Eq. 17 Semi-metric Large Categorical 0(2n) Eq. 18 Semi-metric Large Categorical 0(2n) Eq. 19 Semi-metric Large Categorical 0(2n) Inner product family Eq. 20 Similarity Medium Categorical 0(3n) Eq. 21 Similarity Medium Categorical 0(3n) Eq. 22 Semi-metric Large Categorical 0(2n) Eq. 23 Semi-metric Large Categorical 0(2n)
  • 20. Analysis of Proximity Measure for Big Data Mining Distance measure Type of function Volume (High Data Set) Variety (Type of Data) Velocity (Time Complexity) Shannon’s entropy family Eq. 25 Similarity Medium Categorical 0(n) Eq. 26 Semi-metric Large Categorical 0(n) Eq. 27 Similarity Medium Categorical 0(n) Eq. 28 Semi-metric Large Categorical 0(n) Eq. 29 Semi-metric Large Categorical 0(n) Eq. 30 Semi-metric Large Categorical 0(n) Combination family Eq. 31 Semi-metric Large Categorical 0(2n) Eq. 32 Semi-metric Large Categorical 0(2n) Intersection family Eq. 33 Similarity Medium Categorical 0(2n) Eq. 34 Distance Large Continuous 0(2n)
  • 21. Conclusions This paper analyzed all studied 34 proximity measures for big data mining characteristics as volume (dataset size) Variety (dataset type, and Velocity (time complexity) and find Eq. 2 Manhattan distance, Eq. 3 Euclidean distance, Eq. 4 Minkowski distance, Eq. 5 Chebyshev distance, Eq. 7 soergel, Eq. 10 canberra, Eq. 11 lorentzian, Eq. 12 wavehedge, Eq. 13 matusita, Eq. 14 clark, Eq. 34 wavehedges are more scalable for big data based on theoretical, practical and the existing research perspective.
  • 22. References 1. Rouhani, S., Rotbei, S., & Hamidi, H. (2017). What do we know about the big data researches? A systematic review from 2011 to 2017. Journal of Decision Systems, 26(4), 368-393. doi:10.1080/12460125.2018.1437654 2. Jin, X., Wah, B. W., Cheng, X., & Wang, Y. (2015). Significance and Challenges of Big Data Research. Big Data Research, 2(2), 59-64. doi:10.1016/j.bdr.2015.01.006 3. Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(2), 171-209. doi:10.1007/s11036-013- 0489-0 4. Chen, W., Oliverio, J., Kim, J. H., & Shen, J. (2018). The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data. Journal of Industrial Integration and Management, 1850017. doi:10.1142/s2424862218500173 5. Zhao, X., Liang, J., & Dang, C. (2019). A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Systems, 163, 416-428. doi:10.1016/j.knosys.2018.09.007. 6 Pandove, D., & Goel, S. (2015). A comprehensive study on clustering approaches for big data mining. In Proceedings of IEEE 2nd International Conference on Electronics and Communication Systems (pp. 1333-1338). IEEE Xplore Digital Library. doi:10.1109/ecs.2015.7124801 7 Chen, C. P., & Zhang, C. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347. doi:10.1016/j.ins.2014.01.015 8 Amado, A., Cortez, P., Rita, P., & Moro, S. (2018). Research trends on Big Data in Marketing: A text mining and topic modeling based literature analysis. European Research on Management and Business Economics, 24(1), 1-7. doi:10.1016/j.iedeen.2017.06.002 9 Lee, I. (2017). Big data: Dimensions, evolution, impacts, and challenges. Business Horizons, 60(3), 293-303. doi:10.1016/j.bushor.2017.01.004 10. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144. doi:10.1016/j.ijinfomgt.2014.10.007
  • 23. References 11. Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges and analytical methods. Journal of Business Research, 70, 263-286. doi:10.1016/j.jbusres.2016.08.001 12 Bendechache, M., Tari, A., & Kechadi, M. (2018). Parallel and distributed clustering framework for big spatial data mining. International Journal of Parallel, Emergent and Distributed Systems, 1-19. doi:10.1080/17445760.2018.1446210 13 Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014). 14 Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi 10.1109/ICACCS.2015.7324059, (2015). 15 Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976-8_16, (2014). 16 Cha, S. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. INTERNATIONAL JOURNAL OF MATHEMATICAL MODELS AND METHODS IN APPLIED SCIENCES, 4(1), 300-307. doi:10.1109/icpr.2000.906010 17 Lin, Y., Jiang, J., & Lee, S. (2014). A Similarity Measure for Text Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering, 26(7), 1575-1590. doi:10.1109/tkde.2013.19 18 Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing, 230, 143-151. doi:10.1016/j.neucom.2016.12.007 19. Liu, H., Zhang, X., Zhang, X., & Cui, Y. (2017). Self-adapted mixture distance measure for clustering uncertain data. Knowledge-Based Systems, 126, 33-47. doi:10.1016/j.knosys.2017.04.002 20. Weller-Fahy, D. J., Borghetti, B. J., & Sodemann, A. A. (2015). A Survey of Distance and Similarity Measures Used Within Network Intrusion Anomaly Detection. IEEE Communications Surveys & Tutorials, 17(1), 70-91. doi:10.1109/comst.2014.2336610 21. Grant, J., & Hunter, A. (2017). Analysing inconsistent information using distance-based measures. International Journal of Approximate Reasoning, 89, 3-26. doi:10.1016/j.ijar.2016.04.004
  • 24. References 22. Merigó, J. M., Casanovas, M., & Zeng, S. (2014). Distance measures with heavy aggregation operators. Applied Mathematical Modelling, 38(13), 3142-3153. doi:10.1016/j.apm.2013.11.036 23. Ikonomakis, E. K., Spyrou, G. M., & Vrahatis, M. N. (2019). Content driven clustering algorithm combining density and distance functions. Pattern Recognition, 87, 190-202. doi:10.1016/j.patcog.2018.10.007 24. Marcon, E., & Puech, F. (2017). A typology of distance-based measures of spatial concentration. Regional Science and Urban Economics, 62, 56-67. doi:10.1016/j.regsciurbeco.2016.10.004 25. Kocher, M., & Savoy, J. (2017). Distance measures in author profiling. Information Processing & Management, 53(5), 1103-1119. doi:10.1016/j.ipm.2017.04.004 26. Moghtadaiee, V., & Dempster, A. G. (2015). Determining the best vector distance measure for use in location fingerprinting. Pervasive and Mobile Computing, 23, 59-79. doi:10.1016/j.pmcj.2014.11.002 27. Chim, H., & Deng, X. (2008). Efficient Phrase-Based Document Similarity for Clustering. IEEE Transactions on Knowledge and Data Engineering, 20(9), 1217-1229. doi:10.1109/tkde.2008.50 28. Wang, X., Yu, F., & Pedrycz, W. (2016). An area-based shape distance measure of time series. Applied Soft Computing, 48, 650-659. doi:10.1016/j.asoc.2016.06.033 29. Ramya, R., & Sasikala, T. (2018). A comparative analysis of similarity distance measure functions for biocryptic authentication in cloud databases. Cluster Computing. doi:10.1007/s10586-017-1568-y 30. Abudalfa, S. I., & Mikki, M. (2013). K-means algorithm with a novel distance measure. Turkish Journal Of Electrical Engineering & Computer Sciences, 21, 1665-1684. doi:10.3906/elk-1010-869 31 Nadler, M., & Smith, E. P. (1993). Pattern recognition engineering. New York: John Wiley & Sons, ISBN-13: 978-0471622932 32. Gan, G., Ma, C., & Wu, J. (2007). Data clustering: Theory, algorithms, and applications. Philadelphia, PA: SIAM, Society for Industrial and Applied Mathematics.
  • 25. References 33 Everitt, B. S. (2011). Cluster Analysis (5th ed., Wiley series in probability and statistics). Southern Gate, Chichester, West SussexUnited Kingdom: John Wiley & Sons.ISBN: 978-0-470-74991-3 34. Aggarwal, C. C., & Reddy, C. (2014). Data Clustering Algorithms and Applications. CRC Press Taylor & Francis Group.ISBN 978-1- 4665-5822-9 35 Manning, C. D. , Raghavan, P. , & Schütze, H. (2008). Introduction to information retrieval . Cambridge: Cambridge University Press . 36 Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., . . . Bouras, A. (2014). A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267-279. doi:10.1109/tetc.2014.2330519 40 Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y. (2015). A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data. PLoS ONE, 10(12), doi:10.1371/journal.pone.0144059 41 Kumar, V., Chhabra, J. K., & Kumar, D. (2013). Impact of Distance Measures on the Performance of Clustering Algorithms. Intelligent Computing, Networking, and Informatics Advances in Intelligent Systems and Computing, 183-190. doi:10.1007/978-81-322-1665-0_17 42. Selvi, C., & Sivasankar, E. (2018). A novel similarity measure towards effective recommendation using Matusita coefficient for Collaborative Filtering in a sparse dataset. Sādhanā, 43(12). doi:10.1007/s12046-018-0970-3