K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A THEORETICAL AND EXPERIMENTAL ANALYSIS

CONTACT: PRAVEEN KUMAR. L (, +91 – 9791938249)
MAIL ID: sunsid1989@gmail.com, praveen@nexgenproject.com
Web: www.nexgenproject.com, www.finalyear-ieeeprojects.com
K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A
THEORETICAL AND EXPERIMENTAL ANALYSIS
ABSTRACT:
Given a point p and a set of points S, the kNN operation finds the k closest
points to p in S. It is a computational intensive task with a large range of
applications such as knowledge discovery or data mining. However, as the
volume and the dimension of data increase, only distributed approaches can
perform such costly operation in a reasonable time. Recent works have
focused on implementing efficient solutions using the MapReduce
programming model because it is suitable for distributed large scale data
processing. Although these works provide different solutions to the same
problem, each one has particular constraints and properties. In this paper, we
compare the different existing approaches for computing kNN on MapReduce,
first theoretically, and then by performing an extensive experimental
evaluation. To be able to compare solutions, we identify three generic steps
for kNN computation on MapReduce: data pre-processing, data partitioning
and computation. We then analyze each step from load balancing, accuracy
and complexity aspects. Experiments in this paper use a variety of datasets,
and analyze the impact of data volume, data dimension and the value of k
from many perspectives like time and space complexity, and accuracy. The
experimental part brings new advantages and shortcomings that are discussed
for each algorithm. To the best of our knowledge, this is the first paper that

compares kNN computing methods on MapReduce both theoretically and
experimentally with the same setting. Overall, this paper can be used as a
guide to tackle kNN-based practical problems in the context of big data.
CONCLUSION
In this paper, we have studied existing solutions to perform the kNN operation
in the context of MapReduce. We have first approached this problem from a
workflow point of view. We have pointed out that all solutions follow three
main steps to compute kNN over MapReduce, namely preprocessing of data,
partitioning and actual computation. We have listed and explained the
different algorithms which could be chosen for each step, and developed their
pros and cons, in terms of load balancing, accuracy of results, and overall
complexity. In a second part, we have performed extensive experiments to
compare the performance, disk usage and accuracy of all these algorithms in
the same environment. We have mainly used two real datasets, a geographic
coordinates one (2 dimensions) and an image based one (SURF descriptors,
128 dimensions). For all algorithms, it was the first published experiment on
such high dimensions. Moreover, we have performed a fine analysis, outlining,
for each algorithm, the importance and difficulty of fine tuning some
parameters to obtain the best performance.
REFERENCES
[1] D. Li, Q. Chen, and C.-K. Tang, “Motion-aware knn laplacian for video
matting,” in ICCV’13, 2013.

[2] H.-P. Kriegel and T. Seidl, “Approximation-based similarity search for 3-D
surface segments,” Geoinformatica, 1998.
[3] X. Bai, R. Guerraoui, A.-M. Kermarrec, and V. Leroy, “Collaborative
personalized top-k processing,” ACM Trans. Database Syst., 2011.
[4] D. Rafiei and A. Mendelzon, “Similarity-based queries for time series data,”
SIGMOD Rec., 1997.
[5] R. Agrawal, C. Faloutsos, and A. N. Swami, “Efficient similarity search in
sequence databases,” in Foundations of Data Organization and Algorithms,
1993.
[6] K. Inthajak, C. Duanggate, B. Uyyanonvara, S. Makhanov, and S. Barman,
“Medical image blob detection with feature stability and knn classification,” in
Computer Science and Software Engineering, 2011.
[7] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas, “Fast
nearest neighbor search in medical image databases,” in VLDB’96, 1996.
[8] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “idistance: An
adaptive b+-tree based indexing method for nearest neighbor search,” ACM
Trans. Database Syst., vol. 30, no. 2, pp. 364–397, 2005.
[9] C. B¨ohm and F. Krebs, “The k-nearest neighbour join: Turbo charging the
kdd process,” Knowl. Inf. Syst., vol. 6, no. 6, pp. 728– 749, Nov. 2004.

[10] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An efficient access method
for similarity search in metric spaces,” in VLDB’97, 1997.

K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A THEORETICAL AND EXPERIMENTAL ANALYSIS

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A THEORETICAL AND EXPERIMENTAL ANALYSIS (20)

More from Nexgen Technology (20)

Recently uploaded (20)

K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A THEORETICAL AND EXPERIMENTAL ANALYSIS