ANALYTIC QUERIES OVER GEOSPATIAL TIME-SERIES DATA USING DISTRIBUTED HASH TABLES

CONTACT: PRAVEEN KUMAR. L (, +91 – 9791938249)
MAIL ID: sunsid1989@gmail.com, praveen@nexgenproject.com
Web: www.nexgenproject.com, www.finalyear-ieeeprojects.com
ANALYTIC QUERIES OVER GEOSPATIAL TIME-SERIES DATA USING DISTRIBUTED
HASH TABLES
Abstract
As remote sensing equipment and networked observational devices continue
to proliferate, their corresponding data volumes have surpassed the storage
and processing capabilities of commodity computing hardware. This trend has
led to the development of distributed storage frameworks that incrementally
scale out by assimilating resources as necessary. While challenging in its own
right, storing and managing voluminous datasets is only the precursor to a
broader field of research: extracting insights, relationships, and models from
the underlying datasets. The focus of this study is twofold: exploratory and
predictive analytics over voluminous, multidimensional datasets in a
distributed environment. Both of these types of analysis represent a higher-
level abstraction over standard query semantics; rather than indexing every
discrete value for subsequent retrieval, our framework autonomously learns
the relationships and interactions between dimensions in the dataset and
makes the information readily available to users. This functionality includes
statistical synopses, correlation analysis, hypothesis testing, probabilistic
structures, and predictive models that not only enable the discovery of
nuanced relationships between dimensions, but also allow future events and
trends to be predicted. The algorithms presented in this work were evaluated
empirically on a real-world geospatial time-series dataset in a production
environment, and are broadly applicable across other storage frameworks

CONCLUSION:
Support for analytic queries over voluminous datasets entails accounting for:
(1) the speed differential between memory accesses and disk I/O,
(2) how metadata is organized and managed,
(3) the performance impact of the data structures,
(4) dispersion of query loads, and
(5) the avoidance of I/O hotspots.
These factors enable us to provide a rich set of exploratory analysis
functionality as well as predictive models that produce insights beyond just the
trends present in the dataset. One key aspect of our approach is minimizing
disk accesses. This is achieved by carefully maintaining metadata graphs that
retain expressiveness for query evaluations but preserve compactness to
ensure memory residency while avoiding page faults and thrashing. The graphs
remain compact even in situations where individual nodes store hundreds of
millions of files. Further, statistical synopses ensure the knowledge base is
continually updated as live streams occur. We achieve this via the use and
adaptation of online algorithms, compact data structures, and lightweight
models. This also allows us to perform query evaluations at multiple
geographic scales. We avoid query hotspots by propagating the queries to

nodes likely to satisfy them, performing in-memory evaluations and avoiding
disk accesses. This reduces the likelihood of queries building up and
overflowing request queues at individual nodes. By targeting only a specific
subset of the nodes, we minimize cases where queries are evaluated that
produce no results. Our use of Geohashes also allows us to localize queries
efficiently. Hotspot avoidance ensures faster overall turnaround times for
individual queries. Combined with efficient pipelining, this allows multiple
queries to be evaluated concurrently at a high rate, which is validated by our
empirical results.
REFERENCES
[1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large
clusters,” Communications of the ACM, 2008.
[2] M. Malensek, S. Pallickara, and S. Pallickara, “Exploiting geospatial and
chronological characteristics in data streams to enable efficient storage and
retrievals,” Future Gener. Comput. Syst., 2012.
[3] W. Budgaga, M. Malensek, S. Pallickara, N. Harvey, F. J. Breidt, and S.
Pallickara, “Predictive analytics using statistical, learning, and ensemble
methods to support real-time exploration of discrete event simulations,”
Future Gener. Comput. Syst., vol. 56, 2016.

[4] M. Malensek, S. Pallickara, and S. Pallickara, “Fast, ad hoc query evaluations
over multidimensional geospatial datasets,” Cloud Computing, IEEE
Transactions on (To appear), 2015.
[5] A. Lakshman and P. Malik, “Cassandra: a decentralized structured storage
system,” ACM SIGOPS Op. Sys. Rev., vol. 44, 2010.
[6] D. Hastorun, M. Jampani, G. Kakulapati, A. Pilchin, S. Sivasubramanian, P.
Vosshall, and W. Vogels, “Dynamo: amazon’s highly available key-value store,”
in SOSP. Citeseer, 2007.
[7] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan, “Chord: A
scalable peer-to-peer lookup service for internet applications,” ACM SIGCOMM
Computer Communication Review, vol. 31, no. 4, pp. 149–160, 2001.
[8] G. Niemeyer. (2008) Geohash. [Online]. Available:
http://en.wikipedia.org/wiki/Geohash
[9] C. Tolooee, M. Malensek, and S. L. Pallickara, “A framework for managing
continuous query evaluations over voluminous, multidimensional datasets,” in
Proceedings of the 2014 ACM Cloud and Autonomic Computing Conference,
ser. CAC ’14. ACM, 2014.
[10] National Oceanic and Atmospheric Administration. (2015) The north
american mesoscale forecast system. [Online]. Available:
http://www.emc.ncep.noaa.gov/index.php?branch=NAM

ANALYTIC QUERIES OVER GEOSPATIAL TIME-SERIES DATA USING DISTRIBUTED HASH TABLES

More Related Content

Viewers also liked (20)

Similar to ANALYTIC QUERIES OVER GEOSPATIAL TIME-SERIES DATA USING DISTRIBUTED HASH TABLES (20)

More from Nexgen Technology (20)

Recently uploaded (20)

ANALYTIC QUERIES OVER GEOSPATIAL TIME-SERIES DATA USING DISTRIBUTED HASH TABLES