SlideShare a Scribd company logo
Spatial Tajo
Supporting Spatial Queries on Apache Tajo
Contents
 What is Spatial Tajo?
 Motive for Development
 Why I chose Apache Tajo?
 Plan for the implementation of the plug-in
 Current status
◦ Parts implemented
◦ Parts not yet implemented
 Conclusion
 References
What is Spatial Tajo?
A plug-in to provide spatial queries for Tajo
In detail, it is a plug-in allowing the provision
and performance of queries about spatial
relations and spatial analysis for data sets
stored in the distributed data warehouse system.
◦ Providing spatial functions for spatial queries
◦ Supporting data types
◦ Supporting an index structure for spatial data sets
◦ Supporting raster data
Overall Architecture of Apache Tajo
Tajo Worker
Local
File-
System
HDFS
Amazon
S3
QueryMaster
Local Query Engine
StorageManager Spatial Tajo
Tajo Master
CatalogStore
Allocate
a query
Manage
metadata
Client
JDBC SQL Shell Web UI
Motive for development
 The volume of the spatial data sets to be analyzed come near big data, and I’d like
to analyze this using SQL.
 I'd like to use a system working without batch processing.
 I'd like to use free software or free solution (Provided that I contribute my
experiences to communities).
Motive for development
 Of course, there are good ones among the known software and solutions. But…
◦ Relational Database and DBMS
◦ Oracle Spatial and Graph (Oracle Database+Plug-in)
◦ MySQL DBMS
◦ PostGIS with PostgreSQL
◦ NoSQL
◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB
◦ HBase, Hive
◦ Cluster and Cloud
◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop (http://spatialhadoop.cs.umn.edu) (on Hadoop)
◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)
 Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!
Why I chose Apache Tajo?
 Apache Tajo
◦ A robust big data relational and distributed data warehouse system for Apache Hadoop
◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large
-data sets stored on HDFS and other data sources.
 Why did I choose Apache Tajo?: Features
◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3
◦ It supports the insertion of data, but doesn’t support the update of the data
◦ Compatibility: Can query with ANSI/ISO SQL
◦ It is faster than processing using MapReduce and guarantees fault tolerance
◦ It is easy to build up and manage
◦ The user can implement and plug it in yourself if necessary
Plan for the implementation of the plug-in
 Spatial functions for spatial queries
◦ Distances, Equals, Disjoints, Intersects and Touches,
Crosses, Overlaps, Contains, Lengths, Areas and Centroids
◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)
 Adding spatial data types
 Enabling to run kNN queries
 Supporting an index for spatial data
◦ R-tree, Quad-tree and KD-tree, etc.
Current status – Parts implemented
 Most primary spatial functions
◦ Implementing most primary spatial functions using JTS
◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains
 Running kNN queries
◦ It can run kNN queries using the implemented spatial functions.
 Indexing for spatial data
◦ R-tree indexing using Sort-Tile-Recursive (STR)
Local indexes
Global
Index
The process of reading data
using the index
1. Reading the global index
and finding search keys
2. Finding local indexes
corresponding to the
search keys,
3. Finding the search keys in
the local indexes
4. Directly reading the tuples
Wikipedia: R-tree - STR, SpatialHadoop Paper and Tajo Document
Current status – Parts not yet implemented
 Adding spatial data types
◦ Parameters of spatial functions are inaccurate and inconvenient to use.
 Spatial functions not yet implemented
◦ Lengths, areas and centroids
◦ Transform functions (e.g. from_OOO and to_OOO)
Optimizing functions and queries
◦ Optimizing spatial functions
◦ Optimizing kNN queries
Indexing spatial data
◦ Quad-tree (with GeoHash) and KD-tree
Modularization
◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.
Conclusion
 What is Spatial Tajo?
 Motive for Development
 Why I chose Apache Tajo?
 Plan for the implementation of the plug-in
 Current status
◦ Parts implemented
◦ Parts not yet implemented
References
Apache Tajo
◦ Official Website, Source codes, User documentation
◦ Efficient In-situ processing of various storage types on Apache Tajo
◦ Apache Tajo Enters the SQL-on-Hadoop Space
◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean?
◦ Setting up an Apache Tajo Cluster on Amazon EMR
PostgreSQL and PostGIS Document
SpatialHadoop
◦ Official Website, Source codes
◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data
◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce
Reference
Spatial Databases: With Application to GIS
Indexing
◦ STR: A simple and efficient algorithm for R-tree packing
◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce
◦ Tajo: A distributed data warehouse system on large clusters
◦ Apache Tajo Documents: Index types
Wikipedia
◦ Spatial Database
◦ Spatial Query
◦ R-tree
Thank You for listening
Do you have any questions?
Please e-mail(pseudojo.1989@gmail.com) me with the questions,
and I’ll answer them in detail.

More Related Content

PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PPTX
Apache Spark in Scientific Applciations
PPTX
Seattle Scalability Meetup - Ted Dunning - MapR
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PDF
Hadoop & Complex Systems Research
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Koalas: Making an Easy Transition from Pandas to Apache Spark
Apache Spark in Scientific Applciations
Seattle Scalability Meetup - Ted Dunning - MapR
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Hadoop & Complex Systems Research

What's hot (20)

PDF
End-to-End Data Pipelines with Apache Spark
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Accelerating Data Processing in Spark SQL with Pandas UDFs
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PPTX
Big data clustering
PDF
Jump Start into Apache Spark (Seattle Spark Meetup)
PPTX
Mapping Data Flows Training April 2021
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Reshape Data Lake (as of 2020.07)
PDF
Advanced Natural Language Processing with Apache Spark NLP
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Redshift Spectrum & AWS Athena Deep Dive
PDF
Cosmos DB Real-time Advanced Analytics Workshop
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PDF
Informational Referential Integrity Constraints Support in Apache Spark with ...
PDF
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
PDF
Revealing the Power of Legacy Machine Data
PPSX
Hadoop Ecosystem
End-to-End Data Pipelines with Apache Spark
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Accelerating Data Processing in Spark SQL with Pandas UDFs
Azure Data Factory Data Flow Performance Tuning 101
Big data clustering
Jump Start into Apache Spark (Seattle Spark Meetup)
Mapping Data Flows Training April 2021
Spark Under the Hood - Meetup @ Data Science London
Reshape Data Lake (as of 2020.07)
Advanced Natural Language Processing with Apache Spark NLP
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Performant data processing with PySpark, SparkR and DataFrame API
Redshift Spectrum & AWS Athena Deep Dive
Cosmos DB Real-time Advanced Analytics Workshop
Hadoop Strata Talk - Uber, your hadoop has arrived
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Informational Referential Integrity Constraints Support in Apache Spark with ...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Revealing the Power of Legacy Machine Data
Hadoop Ecosystem
Ad

Similar to [FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo (20)

PPTX
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
PDF
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
PDF
What's New Tajo 0.10 and Its Beyond
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PDF
Apache Tajo - An open source big data warehouse
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PPTX
Apache Tajo - BWC 2014
PPTX
design_doc
PPTX
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
PDF
Using python to analyze spatial data
PDF
Harnessing Spark Catalyst for Custom Data Payloads
PDF
Big Linked Data Querying - ExtremeEarth Open Workshop
PPT
Building a Spatial Database in PostgreSQL
PPTX
High Dimensional Indexing using MongoDB (MongoSV 2012)
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
PPTX
Spot db consistency checking and optimization in spatial database
PPTX
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
PDF
Apache TAJO
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
What's New Tajo 0.10 and Its Beyond
Introduction to Apache Tajo: Data Warehouse for Big Data
Apache Tajo - An open source big data warehouse
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Apache Tajo - BWC 2014
design_doc
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
Using python to analyze spatial data
Harnessing Spark Catalyst for Custom Data Payloads
Big Linked Data Querying - ExtremeEarth Open Workshop
Building a Spatial Database in PostgreSQL
High Dimensional Indexing using MongoDB (MongoSV 2012)
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Spot db consistency checking and optimization in spatial database
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Apache TAJO
Ad

Recently uploaded (20)

PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
medical staffing services at VALiNTRY
PDF
Cost to Outsource Software Development in 2025
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
System and Network Administration Chapter 2
PPTX
L1 - Introduction to python Backend.pptx
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
top salesforce developer skills in 2025.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Transform Your Business with a Software ERP System
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Design an Analysis of Algorithms II-SECS-1021-03
Why Generative AI is the Future of Content, Code & Creativity?
Upgrade and Innovation Strategies for SAP ERP Customers
medical staffing services at VALiNTRY
Cost to Outsource Software Development in 2025
Designing Intelligence for the Shop Floor.pdf
System and Network Administration Chapter 2
L1 - Introduction to python Backend.pptx
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Reimagine Home Health with the Power of Agentic AI​
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
iTop VPN Free 5.6.0.5262 Crack latest version 2025
top salesforce developer skills in 2025.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Systems & Binary Numbers (comprehensive )
Softaken Excel to vCard Converter Software.pdf
Transform Your Business with a Software ERP System

[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo

  • 1. Spatial Tajo Supporting Spatial Queries on Apache Tajo
  • 2. Contents  What is Spatial Tajo?  Motive for Development  Why I chose Apache Tajo?  Plan for the implementation of the plug-in  Current status ◦ Parts implemented ◦ Parts not yet implemented  Conclusion  References
  • 3. What is Spatial Tajo? A plug-in to provide spatial queries for Tajo In detail, it is a plug-in allowing the provision and performance of queries about spatial relations and spatial analysis for data sets stored in the distributed data warehouse system. ◦ Providing spatial functions for spatial queries ◦ Supporting data types ◦ Supporting an index structure for spatial data sets ◦ Supporting raster data Overall Architecture of Apache Tajo Tajo Worker Local File- System HDFS Amazon S3 QueryMaster Local Query Engine StorageManager Spatial Tajo Tajo Master CatalogStore Allocate a query Manage metadata Client JDBC SQL Shell Web UI
  • 4. Motive for development  The volume of the spatial data sets to be analyzed come near big data, and I’d like to analyze this using SQL.  I'd like to use a system working without batch processing.  I'd like to use free software or free solution (Provided that I contribute my experiences to communities).
  • 5. Motive for development  Of course, there are good ones among the known software and solutions. But… ◦ Relational Database and DBMS ◦ Oracle Spatial and Graph (Oracle Database+Plug-in) ◦ MySQL DBMS ◦ PostGIS with PostgreSQL ◦ NoSQL ◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB ◦ HBase, Hive ◦ Cluster and Cloud ◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop (http://spatialhadoop.cs.umn.edu) (on Hadoop) ◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)  Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!
  • 6. Why I chose Apache Tajo?  Apache Tajo ◦ A robust big data relational and distributed data warehouse system for Apache Hadoop ◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large -data sets stored on HDFS and other data sources.  Why did I choose Apache Tajo?: Features ◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3 ◦ It supports the insertion of data, but doesn’t support the update of the data ◦ Compatibility: Can query with ANSI/ISO SQL ◦ It is faster than processing using MapReduce and guarantees fault tolerance ◦ It is easy to build up and manage ◦ The user can implement and plug it in yourself if necessary
  • 7. Plan for the implementation of the plug-in  Spatial functions for spatial queries ◦ Distances, Equals, Disjoints, Intersects and Touches, Crosses, Overlaps, Contains, Lengths, Areas and Centroids ◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)  Adding spatial data types  Enabling to run kNN queries  Supporting an index for spatial data ◦ R-tree, Quad-tree and KD-tree, etc.
  • 8. Current status – Parts implemented  Most primary spatial functions ◦ Implementing most primary spatial functions using JTS ◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains  Running kNN queries ◦ It can run kNN queries using the implemented spatial functions.  Indexing for spatial data ◦ R-tree indexing using Sort-Tile-Recursive (STR) Local indexes Global Index The process of reading data using the index 1. Reading the global index and finding search keys 2. Finding local indexes corresponding to the search keys, 3. Finding the search keys in the local indexes 4. Directly reading the tuples Wikipedia: R-tree - STR, SpatialHadoop Paper and Tajo Document
  • 9. Current status – Parts not yet implemented  Adding spatial data types ◦ Parameters of spatial functions are inaccurate and inconvenient to use.  Spatial functions not yet implemented ◦ Lengths, areas and centroids ◦ Transform functions (e.g. from_OOO and to_OOO) Optimizing functions and queries ◦ Optimizing spatial functions ◦ Optimizing kNN queries Indexing spatial data ◦ Quad-tree (with GeoHash) and KD-tree Modularization ◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.
  • 10. Conclusion  What is Spatial Tajo?  Motive for Development  Why I chose Apache Tajo?  Plan for the implementation of the plug-in  Current status ◦ Parts implemented ◦ Parts not yet implemented
  • 11. References Apache Tajo ◦ Official Website, Source codes, User documentation ◦ Efficient In-situ processing of various storage types on Apache Tajo ◦ Apache Tajo Enters the SQL-on-Hadoop Space ◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean? ◦ Setting up an Apache Tajo Cluster on Amazon EMR PostgreSQL and PostGIS Document SpatialHadoop ◦ Official Website, Source codes ◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data ◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce
  • 12. Reference Spatial Databases: With Application to GIS Indexing ◦ STR: A simple and efficient algorithm for R-tree packing ◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce ◦ Tajo: A distributed data warehouse system on large clusters ◦ Apache Tajo Documents: Index types Wikipedia ◦ Spatial Database ◦ Spatial Query ◦ R-tree
  • 13. Thank You for listening Do you have any questions? Please e-mail(pseudojo.1989@gmail.com) me with the questions, and I’ll answer them in detail.

Editor's Notes

  • #2: Hello, everyone. My name is Hyungu Cho, and I am doing my master’s in Computer Science at Kunsan National University. The title of my presentation is Spatial Tajo (Supporting spatial queries on Tajo). I would appreciate if you understand that I am going to make the presentation by reading the transcript because I am not that good at English.
  • #3: The presentation will be made in the following order: “What is Tajo?”, “Motive for development”, “Why I chose Apache Tajo?”, “Plan for the implementation of the plug-in”, “Parts implemented and parts not yet implemented” and “Conclusion.”
  • #4: What is spatial Tajo? To describe it briefly, it is a plug-in to provide spatial queries for Tajo. To describe it more in detail, it is a plug-in to provide and perform querying datasets by spatial queries using SQL in a distributed data warehouse system, Tajo, which provides spatial functions, supports spatial data types for spatial queries, supports indexing spatial data and allows the use of raster data.
  • #5: Then, what would be the motive for the development of this plug-in? As there was getting more interest in big data for the last decade, attempts to analyze spatial big data, too, have increased naturally. I, also, thought that the volume of ‘the data containing the spatial information’ I attempted to analyze could come close to big data. The data held by my Lab are a collection of tweets on Twitter in real time, which are not the ones consisting only of spatial information, but I attempted a few analyses using the data. When I analyzed the data, analysis using Hadoop was a trend, and I, too, conducted an analysis using Hadoop. However, I had to use MapReduce to conduct an analysis using Hadoop, and often thought that using SQL whenever I conduct an experiment would be convenient. Since it is somewhat difficult for an analyst to use MapReduce freely, which essentially is batch processing, the more data, the higher latency becomes. So, I did research to use existing software or solutions. Of course, there are good ones among the existing software and solutions.
  • #6: Traditional spatial database and DBMS include Oracle Spatial and Graph which can be installed to Oracle database as a plug-in and MySQL DBMS. Both are satisfactory solutions, but are commercial products, somewhat difficult to build up a cluster. My school did not provide them separately, and solutions that continuously cost were not an appropriate option. Using PostGIS that does not cost seems to be a good method; however, since it is not itself software to analyze big data, it was held off. As for NoSQL, there are document-oriented databases as MongoDB, CouchDB and RethinkDB. I could have used this, but I already had structured data, so I did not have to use this. HBase was made by modeling after Google’s BigTable, but it is somewhat difficult for an analyst to use that. Hive is convenient since it does not use SQL, but HiveQL, similar to that, but it was not judged that Hive which uses MapReduce would be an appropriate choice. Other solutions that use Hadoop include GeoMesa, ESRI GIS Tools for Hadoop and SpatialHadoop. ESRI GIS Tools for Hadoop is close to tools or libraries prepared for analysis using Hadoop, while SpatialHadoop, too, is close to the form of a combination of spatial plug-in with Hadoop. Overall, in that it does not cost, despite a little effort should be invested, it seems that performing spatial queries using Hive or using GeoMesa may be most appropriate. However, I finally, decided to prepare plug-in that allows Tajo, free and open-source software with low latency to perform spatial queries by myself.
  • #7: First, I will introduce Tajo before I will speak about why I chose Tajo. Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and Extract-Transform-Load (ETL) process on large-data sets stored on Hadoop Distributed File System (HDFS) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities. Then, why did I choose Tajo? There are a few features that become the reasons for my choice. Since Tajo is designed to run on Hadoop, using Hadoop can alleviate my concern about data distribution, and also supports Amazon S3. Tajo has a function of external table, so it can bring existing files and make queries. It does not support update syntax like HBase or Hive, but it can overwrite. Tajo supports SQL standard and does not use MapReduce, so it is faster than batch processing using MapReduce, has fault tolerance and supports dynamic scheduling for long-running queries. Tajo is convenient in terms of installation, construction and operation. Of course, it is a project still growing, so it is certainly difficult to solve when it fails. And if you need, you can implement it yourself and attach it in the form of plug-in. Tajo, as described above, has merits and demerits as compared to the solutions or software in the examples, but it is free and open-source software, so I decided to implement it on Tajo, which was judged to be appropriate considering the characteristics introduced.
  • #8: The plan for the implementation of the plug-in is broadly divided into four steps. First, for the implementation of spatial functions for spatial queries, I decided to implement 11 basic functions including distance and equals and the function for the conversion of spatial data types. Second, the addition of types of spatial data, which means the custom type containing spatial information like ‘Point, LineString’. Third, kNN query, which means the implementation for smooth performance of kNN after the implementation of spatial functions. Fourth, indexing spatial data, which means the implementation of an index that allows retrieval of necessary data in making spatial queries.
  • #9: Parts that have been implemented so far include spatial functions, kNN query and spatial data indexing. For the spatial functions, I implemented most of the primary functions using JTS. I carried out kNN queries using the spatial functions implemented, which were working smoothly. For spatial data indexing, borrowing the operation methods of index in Tajo and SpatialHadoop, I implemented a two level R-tree using Sort-Tile-Recursive (STR). The two-level R-tree has two forms, a global index and local indexes, as shown in the picture. The process of building the two-level R-tree index is as follows: First, Tajo's workers divide the stored datasets by each area using the STR. Second, they build local indexes for each area. Third, they extract only the data on a certain upper level from each of the local indexes and build a global index using them. The process of reading the two-level R-tree index is as follows: First, Tajo’s workers read the global index and find search keys. Second, they find the local indexes corresponding to the search keys. Third, they find the search keys in the local indexes. Lastly, they read the data directly from the storage and construct tuples.
  • #10: Parts that are not yet implemented include spatial data types, spatial functions, kNN query, spatial data index and modularization. Spatial data types are currently not implemented, so there is an inconvenience that spatial functions have to use primitive types. I am going to resolve the inconvenience of spatial functions, once spatial data types are implemented. First, as for spatial functions, I am going to implement the functions as lengths, areas, centroids that are not implemented and then, optimize each of the spatial functions and kNN queries. Next, as for the indexing of spatial data, I am going to implement Quad-tree or KD-Tree using GeoHash as well as R-tree. Lastly, I am going to modularize the plug-in. Currently, it is combined since it is difficult to separate it from Tajo, but after going through the final modularization in the plug-in, it will be distributed in the form of a plug-in.
  • #11: Conclusion What is Spatial Tajo? It is a plug-in to provide spatial queries for Tajo. What is a motive for development? I began developing it as I wanted an analysis using SQL in a distributed data warehouse system, instead of using MapReduce or similar batch processing. Why did I choose Apache Tajo? I did so because I can use it without a great concern about the distribution storage, it supports SQL syntax, is faster than MapReduce, guarantees fault tolerance, and above all, I can implement and attach it in the form of a plug-in myself if necessary. A plan for implementation is as follows: As an overall plan for implementation, I am going to implement spatial functions for spatial queries, add spatial data types and allow running kNN queries and supporting the function of indexing the spatial data. The current status is as follows: The parts implemented so far include most spatial functions, the running of kNN queries and support for indexing through implementing a two-level R-tree. The parts not yet implemented include spatial data types, the remaining spatial functions, the optimization of spatial functions and spatial queries, another method of indexing the spatial data and the modularization of plug-in.
  • #12: For today’s presentation, I mainly referred to the documents and source codes of Tajo, SpatialHadoop and PostGIS and etc., and there are books or websites for information about the contents, as well.
  • #13: For today’s presentation, I mainly referred to the documents and source codes of Tajo, SpatialHadoop and PostGIS and etc., and there are books or websites for information about the contents, as well.
  • #14: For Q&A, please e-mail your questions to the e-mail address stated here, and I will answer them in detail. Thank you for listening to my presentation.