[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo

Spatial Tajo
Supporting Spatial Queries on Apache Tajo

Contents
 What is Spatial Tajo?
 Motive for Development
 Why I chose Apache Tajo?
 Plan for the implementation of the plug-in
 Current status
◦ Parts implemented
◦ Parts not yet implemented
 Conclusion
 References

What is Spatial Tajo?
A plug-in to provide spatial queries for Tajo
In detail, it is a plug-in allowing the provision
and performance of queries about spatial
relations and spatial analysis for data sets
stored in the distributed data warehouse system.
◦ Providing spatial functions for spatial queries
◦ Supporting data types
◦ Supporting an index structure for spatial data sets
◦ Supporting raster data
Overall Architecture of Apache Tajo
Tajo Worker
Local
File-
System
HDFS
Amazon
S3
QueryMaster
Local Query Engine
StorageManager Spatial Tajo
Tajo Master
CatalogStore
Allocate
a query
Manage
metadata
Client
JDBC SQL Shell Web UI

Motive for development
 The volume of the spatial data sets to be analyzed come near big data, and I’d like
to analyze this using SQL.
 I'd like to use a system working without batch processing.
 I'd like to use free software or free solution (Provided that I contribute my
experiences to communities).

Motive for development
 Of course, there are good ones among the known software and solutions. But…
◦ Relational Database and DBMS
◦ Oracle Spatial and Graph (Oracle Database+Plug-in)
◦ MySQL DBMS
◦ PostGIS with PostgreSQL
◦ NoSQL
◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB
◦ HBase, Hive
◦ Cluster and Cloud
◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop (http://spatialhadoop.cs.umn.edu) (on Hadoop)
◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)
 Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!

Why I chose Apache Tajo?
 Apache Tajo
◦ A robust big data relational and distributed data warehouse system for Apache Hadoop
◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large
-data sets stored on HDFS and other data sources.
 Why did I choose Apache Tajo?: Features
◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3
◦ It supports the insertion of data, but doesn’t support the update of the data
◦ Compatibility: Can query with ANSI/ISO SQL
◦ It is faster than processing using MapReduce and guarantees fault tolerance
◦ It is easy to build up and manage
◦ The user can implement and plug it in yourself if necessary

Plan for the implementation of the plug-in
 Spatial functions for spatial queries
◦ Distances, Equals, Disjoints, Intersects and Touches,
Crosses, Overlaps, Contains, Lengths, Areas and Centroids
◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)
 Adding spatial data types
 Enabling to run kNN queries
 Supporting an index for spatial data
◦ R-tree, Quad-tree and KD-tree, etc.

Current status – Parts implemented
 Most primary spatial functions
◦ Implementing most primary spatial functions using JTS
◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains
 Running kNN queries
◦ It can run kNN queries using the implemented spatial functions.
 Indexing for spatial data
◦ R-tree indexing using Sort-Tile-Recursive (STR)
Local indexes
Global
Index
The process of reading data
using the index
1. Reading the global index
and finding search keys
2. Finding local indexes
corresponding to the
search keys,
3. Finding the search keys in
the local indexes
4. Directly reading the tuples
Wikipedia: R-tree - STR, SpatialHadoop Paper and Tajo Document

Current status – Parts not yet implemented
 Adding spatial data types
◦ Parameters of spatial functions are inaccurate and inconvenient to use.
 Spatial functions not yet implemented
◦ Lengths, areas and centroids
◦ Transform functions (e.g. from_OOO and to_OOO)
Optimizing functions and queries
◦ Optimizing spatial functions
◦ Optimizing kNN queries
Indexing spatial data
◦ Quad-tree (with GeoHash) and KD-tree
Modularization
◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.

Conclusion
 What is Spatial Tajo?
 Motive for Development
 Why I chose Apache Tajo?
 Plan for the implementation of the plug-in
 Current status
◦ Parts implemented
◦ Parts not yet implemented

References
Apache Tajo
◦ Official Website, Source codes, User documentation
◦ Efficient In-situ processing of various storage types on Apache Tajo
◦ Apache Tajo Enters the SQL-on-Hadoop Space
◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean?
◦ Setting up an Apache Tajo Cluster on Amazon EMR
PostgreSQL and PostGIS Document
SpatialHadoop
◦ Official Website, Source codes
◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data
◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce

Reference
Spatial Databases: With Application to GIS
Indexing
◦ STR: A simple and efficient algorithm for R-tree packing
◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce
◦ Tajo: A distributed data warehouse system on large clusters
◦ Apache Tajo Documents: Index types
Wikipedia
◦ Spatial Database
◦ Spatial Query
◦ R-tree

Thank You for listening
Do you have any questions?
Please e-mail(pseudojo.1989@gmail.com) me with the questions,
and I’ll answer them in detail.

[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo

More Related Content

What's hot (20)

Similar to [FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo (20)

Recently uploaded (20)

[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo

Editor's Notes