SlideShare a Scribd company logo
The Problem 
A MapReduce Algorithm to Create Contiguity 
Weights for Spatial Analysis of Big Data 
Xun Li, Wenwen Li, Luc Anselin, Sergio Rey, Julia 
Nov 4, 2014 
BIGSPATIAL 2014 
Koschinsky 
1
Big Spatial Data Challenge 
Cyber-Framework: CyberGIS, Spatial Hadoop 
2 
Big Spatial Data Domain 
Spatial 
Data 
Management 
Computing 
Grids 
Super 
Computers 
HPC 
Spatial 
Analysis 
Cloud Computing 
Platform 
Visualization 
Spatial 
Process 
Modeling 
Spatial 
Pattern 
Detection
Spatial Analysis on Big Data 
3 
Spatial 
Analysis 
Spatial Data 
Preprocessing 
Spatial Data 
Exploration 
Spatial Model 
Specification 
Spatial Model 
Estimation 
Spatial Model 
Validation 
Spatial 
Clustering/Autocorre 
lation 
Spatial Lag Model 
Spatial Error Model 
Spatial Weights: W 
Spatial Statistics 
Example:
Spatial Weights 
Spatial Weights 
• Spatial weights is an essential component in spatial analysis where a 
representation of spatial structure is needed. 
• Tobler: “Everything is related to everything else, but near things are 
more related to each other”. 
Create Spatial Weights (W) 
• Extract spatial structure: 
• Spatial neighboring information (contiguity based weights) 
• Spatial distance information (distance based weights) 
4 
A B C D E 
A 0 1 0 0 0 
B 1 0 1 1 0 
C 0 1 0 1 1 
D 0 1 1 0 0 
E 0 0 1 0 0 
A B C D E 
2.5 
2.5 
3.5 
A 0 
1.2 
B 1.2 0 
2.3 
0.7 
C 2.3 
0 
1.1 
D 0.7 1.1 
0 
E 0.3 
0 
4.5 
0.3 
2.5 
2.5 
3.5 
4.5 
0.1 
0.1 
Contiguity based Weights Distance based Weights
Contiguity Spatial Weights: how to find neighbors 
5 
Classic Algorithms: 
• Brutal force search : 
• Test A against B,C,D,E | B against C,D,E | C against D,E | D against E 
• O(n2) 
• Spatial Index : 
• Binning algorithm 
• r-tree index 
O(n logn) 
• Rook Contiguity: 
neighbors share borders 
• Queen Contiguity: 
neighbors share borders or vertices
Parallelize Spatial Weights Creation for big data? 
6 
Split data with a buffer zone 
A B C D E 
A 0 1 1 1 0 
B 1 0 0 1 0 
C 1 0 0 1 0 
D 1 1 1 0 1 
E 0 0 0 1 0
Counting Algorithm for Contiguity Weights Creation 
7 
Counting Algorithms: 
• Inspired by TopoJson: 
• Same vertices only stored once. 
• Counting how many polygons share a point (Queen Weights): O(n) 
1 
2 
3 4 
6 
5 
7 
8 
9 
10 
11 
12 
13 
14 
16 
15 
17 
18 
20 
19 
Count A: 
{1:[A], 
2:[A], 
3:[A], 
4:[A], 
5:[A], 
6:[A]} 
Count B: 
{1:[A] 
,2:[A] 
,3:[A] 
,4:[A] 
,5:[A,B] 
,6:[A,B] 
,7:[B] 
,8:[B] 
,9:[B] 
,10:[B]} 
Count C: 
{1:[A] 
,2:[A], 
,3:[A,C] 
,4:[A,C] 
,5:[A,B] 
,6:[A,B] 
,7:[B] 
,8:[B] 
,9:[B] 
,10:[B] 
,13:[C] 
,14:[C] 
,15:[C] 
,16:[C]} 
Neighbors: 
[A,C] 
[A,B]
Counting Algorithm for Contiguity Weights Creation 
8 
Counting Algorithms: 
• Counting how many polygons share an edge (Rook Weights): O(n) 
1 
2 
3 4 
6 
5 
7 
8 
9 
10 
11 
12 
13 
14 
16 
15 
17 
18 
20 
19 
Count A: 
{(1,2):[A] 
,(2,3):[A] 
,(3,4):[A] 
,(4,5):[A] 
,(5,6):[A] 
,(6,1):[A]} 
Count B: 
{(1,2):[A] 
,(2,3):[A] 
,(3,4):[A] 
,(4,5):[A] 
,(5,6):[A,B] 
,(6,1):[A] 
,(6,7):[B] 
,(7,8):[B] 
,(8,9):[B] 
,(9,10):[B]} 
Neighbors: 
[A,B]
Parallel Counting Algorithm? 
9 
1 
2 
3 4 
6 
5 
7 
8 
9 
10 
11 
12 
13 
14 
16 
15 
17 
18 
20 
19 
7 
Count Results: 
{1:[A] 
,2:[A] 
,3:[A,C] 
,4:[A,C] 
,5:[A] 
,6:[A] 
,13:[C] 
,14:[C] 
…} 
Count Results: 
{5:[B,D] 
,6:[B] 
…,9:[B] 
,10:[B,D] 
,11:[D,E] 
,12:[D,E] 
,13:[D] 
…} 
1 
2 
3 4 
6 
5 
13 
14 
16 
15 
4 
6 
5 
7 
8 
9 
10 
11 
12 
13 
17 
20 
19 
7
Parallel Counting Algorithm? –Conti. 
10 
Print line by line 
1:[A] 
2:[A] 
3:[A,C] 
4:[A,C] 
5:[A] 
6:[A] 
13:[C] 
14:[C] 
… 
Print line by line 
5:[B,D] 
6:[B] 
… 
9:[B] 
10:[B,D] 
11:[D,E] 
12:[D,E] 
13:[D] 
… 
1 
2 
3 4 
6 
5 
13 
14 
16 
15 
4 
6 
5 
7 
8 
9 
10 
11 
12 
13 
17 
20 
19 
7 
Merge & Sort 
Two Results: 
1:[A] 
2:[A] 
3:[A,C] 
4:[A,C] 
4:[A] 
4:[D] 
5:[A] 
5:[B,D] 
6:[A] 
6:[B] 
7:[B] 
11:[D,E] 
12:[D,E] 
13:[C] 
13:[D] 
14:[C] 
… 
{3:[A,C]} 
{4:[A,C,D]} 
{5:[A,B,D]} 
{6:[A,B]} 
{11:[D,E]} 
{12:[D,E]} 
{13:[C,D]} 
A B C D E 
A 0 1 1 1 0 
B 1 0 0 1 0 
C 1 0 0 1 0 
D 1 1 1 0 1 
E 0 0 0 1 0
MapReduce Contiguity Weights Creation 
11 
Input HDFS Output HDFS 
Data 
split1 
split2 
split3 
split4 
map 
map 
map 
map 
Sorted 
results1 
Sorted 
results2 
reduce 
reduce 
W.part0 
W.part1 
DistCP W
MapReduce Contiguity Weights Creation –Cont. 
12 
Other Details: 
• Input data (each line): 
e.g. 
A, 1,2,3,4,5,6 
• Output data *.gal file (every two lines): 
e.g. 
A 3 
B C D 
• Source code: 
https://github.com/lixun910/mrweights
Experiments 
13 
Original Data: 
• parcel data of Chicago city in the United States 
• 592,521 polygons 
Artificial Big Data: 
• Duplicate original data several times side by side 
• For example: a 4x original data with 2,370,084 polygons 
• The largest test data is a 32x original data
Experiment 
14 
Test System 
• Desktop Computer 
• 2.93 GHz 8 cores CPU, 16 GB memory, 100 GB HD and 64- 
bit Operating System 
• Hadoop System 
• Amazon Elastic MapReduce (EMR) 
• 1 to 18 nodes of “C3 Extra Large” computer instance 
(7.5 GB memory, 14 cores (4 core x 3.5 unit) CPU, 80 GB (2 x 
40GB SSD), 64-bit Operating System and 500Mbps moderate 
network speed )
Experiment 
15 
Code/Application 
• Desktop version (Python) 
• No parallel 
• Hadoop version (Python) 
• Executed via Hadoop streaming pipeline
Experiment-1 
16 
PC v.s. Hadoop 
• Data: 1x, 2x, 4x, 8x, 16x and 32x data respectively 
• Hadoop setup: 6 nodes of C3.xlarge
Experiment-2 
17 
Hadoop with different number of nodes on 32x data 
• Hadoop setup: 6, 12, 14, 18 nodes of C3.xlarge
Integrate to Weights Creation Web Service 
18 
HPC Pool & Hadoop 
Threshold to trigger 
Hadoop Weights 
Creation: 
2 million polygons
Issues 
19 
• This algorithm won’t work when spatial neighbors do not share 
points or edges (it requires the shared points are exactly same) 
• This algorithm can’t generate distance based weights 
• Potential solution 
• Use MapReduce r-tree (SpatialHadoop)
Conclusion 
• Contribution: a MapReduce algorithm to create 
contiguity weights matrix for big spatial data 
• Ongoing work: use existing MapReduce r-tree to solve 
the potential issues of this algorithm 
20
Thanks! 
The Problem 
Nov 4, 2014 
BIGSPATIAL 2014 
21

More Related Content

PDF
Graph Regularised Hashing
PPTX
Session 08 geospatial data
PDF
Remotesensingandgisapplications
PDF
Processing Geospatial Data At Scale @locationtech
PDF
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
PPTX
GIS fundamentals - raster
PPTX
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
PDF
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Graph Regularised Hashing
Session 08 geospatial data
Remotesensingandgisapplications
Processing Geospatial Data At Scale @locationtech
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
GIS fundamentals - raster
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...

What's hot (18)

PPTX
Using Deep Learning to Derive 3D Cities from Satellite Imagery
PDF
Processing Geospatial at Scale at LocationTech
PDF
Big Data and Geospatial with HPCC Systems
PDF
EECSCon Poster
PDF
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
PDF
Foss4 g 2017-kansai-ryoo-kim
PDF
OpenLayers Feature Frenzy
PDF
2013.10.24 big datavisualization
PDF
Spatial station
PDF
OL3-Cesium: 3D for OpenLayers maps
PPTX
Dem analaysis and catchment delineation using GIS
PPTX
Raster processing
PPTX
What's New in ArcGIS 10.1 Data Interoperability Extension
PPTX
R programming language in spatial analysis
PDF
Feature Extraction Based Estimation of Rain Fall By Cross Correlating Cloud R...
PPTX
Mapreduce
PDF
QGIS training class 1
PDF
Infinum Android Talks #04 - Google Maps Android API utility library
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Processing Geospatial at Scale at LocationTech
Big Data and Geospatial with HPCC Systems
EECSCon Poster
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Foss4 g 2017-kansai-ryoo-kim
OpenLayers Feature Frenzy
2013.10.24 big datavisualization
Spatial station
OL3-Cesium: 3D for OpenLayers maps
Dem analaysis and catchment delineation using GIS
Raster processing
What's New in ArcGIS 10.1 Data Interoperability Extension
R programming language in spatial analysis
Feature Extraction Based Estimation of Rain Fall By Cross Correlating Cloud R...
Mapreduce
QGIS training class 1
Infinum Android Talks #04 - Google Maps Android API utility library
Ad

Viewers also liked (18)

PPTX
Diane_MAED-EM presentation report
PPTX
Mi ppt inicial
PDF
ฉันเหมือนใคร
PDF
Interview in The Policy Magazine, The UAE Insurance Report 2012
PDF
Health insurance exchanges Employer Coverage Tool
PDF
ADVN - archief en onderzoekscentrum
PDF
180180219 de-toekomst-van-confederaal-belgie-volgens-n-va
PPTX
How effective is the combination of main product
PDF
Les classes inversées, un phénomène précurseur pour la formation à l’ère numé...
PDF
Japan, Korea and India - Cross Cultural Paper - by Erek Cyr
PPTX
Citations
PPT
Empalme de números índices
PPT
12 3 12 leccion
DOC
Market research for msben project advert
PPT
ματιές στο ναύπλιο
PPT
Customer Service - Banco Sabadell
PDF
Tree-to-Sequence Attentional Neural Machine Translation (ACL 2016)
PDF
The Future Of Work
Diane_MAED-EM presentation report
Mi ppt inicial
ฉันเหมือนใคร
Interview in The Policy Magazine, The UAE Insurance Report 2012
Health insurance exchanges Employer Coverage Tool
ADVN - archief en onderzoekscentrum
180180219 de-toekomst-van-confederaal-belgie-volgens-n-va
How effective is the combination of main product
Les classes inversées, un phénomène précurseur pour la formation à l’ère numé...
Japan, Korea and India - Cross Cultural Paper - by Erek Cyr
Citations
Empalme de números índices
12 3 12 leccion
Market research for msben project advert
ματιές στο ναύπλιο
Customer Service - Banco Sabadell
Tree-to-Sequence Attentional Neural Machine Translation (ACL 2016)
The Future Of Work
Ad

Similar to Big spatial2014 mapreduceweights (20)

PDF
Ling liu part 01:big graph processing
PDF
High-Performance Graph Analysis and Modeling
PPTX
Distributed approximate spectral clustering for large scale datasets
PDF
Scalable and Adaptive Graph Querying with MapReduce
PPTX
Large Scale Machine Learning with Apache Spark
PDF
Start From A MapReduce Graph Pattern-recognize Algorithm
PDF
Graph chi
PDF
MapReduce Algorithm Design
PDF
Scalable Graph Clustering with Pregel
PPTX
Large-scale Recommendation Systems on Just a PC
PDF
Topological Data Analysis
PDF
1 chayes
PDF
Outrageous Ideas for Graph Databases
PPT
Reverse Engineering for additive manufacturing
PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
PDF
MapReduce Algorithm Design - Parallel Reduce Operations
PDF
Advanced Data Structures 2006
PDF
Real-Time Big Data Stream Analytics
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PDF
GraphChi big graph processing
Ling liu part 01:big graph processing
High-Performance Graph Analysis and Modeling
Distributed approximate spectral clustering for large scale datasets
Scalable and Adaptive Graph Querying with MapReduce
Large Scale Machine Learning with Apache Spark
Start From A MapReduce Graph Pattern-recognize Algorithm
Graph chi
MapReduce Algorithm Design
Scalable Graph Clustering with Pregel
Large-scale Recommendation Systems on Just a PC
Topological Data Analysis
1 chayes
Outrageous Ideas for Graph Databases
Reverse Engineering for additive manufacturing
design mapping lecture6-mapreducealgorithmdesign.ppt
MapReduce Algorithm Design - Parallel Reduce Operations
Advanced Data Structures 2006
Real-Time Big Data Stream Analytics
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
GraphChi big graph processing

More from Arizona State University (8)

PPTX
CartoDB fans: GeoDa1.8 provides extra power of spatial analysis
PPTX
CAST a software for data analysis in space and time
PPTX
Travel Plan using Geo-tagged Photos in Geocrowd2013
PPTX
Machine learningmove website
PPTX
Wxpysal website
PPTX
3 d pointcloud
PDF
Xelerator software
KEY
Mining attractive places and travel patterns from photos
CartoDB fans: GeoDa1.8 provides extra power of spatial analysis
CAST a software for data analysis in space and time
Travel Plan using Geo-tagged Photos in Geocrowd2013
Machine learningmove website
Wxpysal website
3 d pointcloud
Xelerator software
Mining attractive places and travel patterns from photos

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Cloud computing and distributed systems.
PPTX
A Presentation on Artificial Intelligence
PPTX
Machine Learning_overview_presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25-Week II
MIND Revenue Release Quarter 2 2025 Press Release
Cloud computing and distributed systems.
A Presentation on Artificial Intelligence
Machine Learning_overview_presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
gpt5_lecture_notes_comprehensive_20250812015547.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.

Big spatial2014 mapreduceweights

  • 1. The Problem A MapReduce Algorithm to Create Contiguity Weights for Spatial Analysis of Big Data Xun Li, Wenwen Li, Luc Anselin, Sergio Rey, Julia Nov 4, 2014 BIGSPATIAL 2014 Koschinsky 1
  • 2. Big Spatial Data Challenge Cyber-Framework: CyberGIS, Spatial Hadoop 2 Big Spatial Data Domain Spatial Data Management Computing Grids Super Computers HPC Spatial Analysis Cloud Computing Platform Visualization Spatial Process Modeling Spatial Pattern Detection
  • 3. Spatial Analysis on Big Data 3 Spatial Analysis Spatial Data Preprocessing Spatial Data Exploration Spatial Model Specification Spatial Model Estimation Spatial Model Validation Spatial Clustering/Autocorre lation Spatial Lag Model Spatial Error Model Spatial Weights: W Spatial Statistics Example:
  • 4. Spatial Weights Spatial Weights • Spatial weights is an essential component in spatial analysis where a representation of spatial structure is needed. • Tobler: “Everything is related to everything else, but near things are more related to each other”. Create Spatial Weights (W) • Extract spatial structure: • Spatial neighboring information (contiguity based weights) • Spatial distance information (distance based weights) 4 A B C D E A 0 1 0 0 0 B 1 0 1 1 0 C 0 1 0 1 1 D 0 1 1 0 0 E 0 0 1 0 0 A B C D E 2.5 2.5 3.5 A 0 1.2 B 1.2 0 2.3 0.7 C 2.3 0 1.1 D 0.7 1.1 0 E 0.3 0 4.5 0.3 2.5 2.5 3.5 4.5 0.1 0.1 Contiguity based Weights Distance based Weights
  • 5. Contiguity Spatial Weights: how to find neighbors 5 Classic Algorithms: • Brutal force search : • Test A against B,C,D,E | B against C,D,E | C against D,E | D against E • O(n2) • Spatial Index : • Binning algorithm • r-tree index O(n logn) • Rook Contiguity: neighbors share borders • Queen Contiguity: neighbors share borders or vertices
  • 6. Parallelize Spatial Weights Creation for big data? 6 Split data with a buffer zone A B C D E A 0 1 1 1 0 B 1 0 0 1 0 C 1 0 0 1 0 D 1 1 1 0 1 E 0 0 0 1 0
  • 7. Counting Algorithm for Contiguity Weights Creation 7 Counting Algorithms: • Inspired by TopoJson: • Same vertices only stored once. • Counting how many polygons share a point (Queen Weights): O(n) 1 2 3 4 6 5 7 8 9 10 11 12 13 14 16 15 17 18 20 19 Count A: {1:[A], 2:[A], 3:[A], 4:[A], 5:[A], 6:[A]} Count B: {1:[A] ,2:[A] ,3:[A] ,4:[A] ,5:[A,B] ,6:[A,B] ,7:[B] ,8:[B] ,9:[B] ,10:[B]} Count C: {1:[A] ,2:[A], ,3:[A,C] ,4:[A,C] ,5:[A,B] ,6:[A,B] ,7:[B] ,8:[B] ,9:[B] ,10:[B] ,13:[C] ,14:[C] ,15:[C] ,16:[C]} Neighbors: [A,C] [A,B]
  • 8. Counting Algorithm for Contiguity Weights Creation 8 Counting Algorithms: • Counting how many polygons share an edge (Rook Weights): O(n) 1 2 3 4 6 5 7 8 9 10 11 12 13 14 16 15 17 18 20 19 Count A: {(1,2):[A] ,(2,3):[A] ,(3,4):[A] ,(4,5):[A] ,(5,6):[A] ,(6,1):[A]} Count B: {(1,2):[A] ,(2,3):[A] ,(3,4):[A] ,(4,5):[A] ,(5,6):[A,B] ,(6,1):[A] ,(6,7):[B] ,(7,8):[B] ,(8,9):[B] ,(9,10):[B]} Neighbors: [A,B]
  • 9. Parallel Counting Algorithm? 9 1 2 3 4 6 5 7 8 9 10 11 12 13 14 16 15 17 18 20 19 7 Count Results: {1:[A] ,2:[A] ,3:[A,C] ,4:[A,C] ,5:[A] ,6:[A] ,13:[C] ,14:[C] …} Count Results: {5:[B,D] ,6:[B] …,9:[B] ,10:[B,D] ,11:[D,E] ,12:[D,E] ,13:[D] …} 1 2 3 4 6 5 13 14 16 15 4 6 5 7 8 9 10 11 12 13 17 20 19 7
  • 10. Parallel Counting Algorithm? –Conti. 10 Print line by line 1:[A] 2:[A] 3:[A,C] 4:[A,C] 5:[A] 6:[A] 13:[C] 14:[C] … Print line by line 5:[B,D] 6:[B] … 9:[B] 10:[B,D] 11:[D,E] 12:[D,E] 13:[D] … 1 2 3 4 6 5 13 14 16 15 4 6 5 7 8 9 10 11 12 13 17 20 19 7 Merge & Sort Two Results: 1:[A] 2:[A] 3:[A,C] 4:[A,C] 4:[A] 4:[D] 5:[A] 5:[B,D] 6:[A] 6:[B] 7:[B] 11:[D,E] 12:[D,E] 13:[C] 13:[D] 14:[C] … {3:[A,C]} {4:[A,C,D]} {5:[A,B,D]} {6:[A,B]} {11:[D,E]} {12:[D,E]} {13:[C,D]} A B C D E A 0 1 1 1 0 B 1 0 0 1 0 C 1 0 0 1 0 D 1 1 1 0 1 E 0 0 0 1 0
  • 11. MapReduce Contiguity Weights Creation 11 Input HDFS Output HDFS Data split1 split2 split3 split4 map map map map Sorted results1 Sorted results2 reduce reduce W.part0 W.part1 DistCP W
  • 12. MapReduce Contiguity Weights Creation –Cont. 12 Other Details: • Input data (each line): e.g. A, 1,2,3,4,5,6 • Output data *.gal file (every two lines): e.g. A 3 B C D • Source code: https://github.com/lixun910/mrweights
  • 13. Experiments 13 Original Data: • parcel data of Chicago city in the United States • 592,521 polygons Artificial Big Data: • Duplicate original data several times side by side • For example: a 4x original data with 2,370,084 polygons • The largest test data is a 32x original data
  • 14. Experiment 14 Test System • Desktop Computer • 2.93 GHz 8 cores CPU, 16 GB memory, 100 GB HD and 64- bit Operating System • Hadoop System • Amazon Elastic MapReduce (EMR) • 1 to 18 nodes of “C3 Extra Large” computer instance (7.5 GB memory, 14 cores (4 core x 3.5 unit) CPU, 80 GB (2 x 40GB SSD), 64-bit Operating System and 500Mbps moderate network speed )
  • 15. Experiment 15 Code/Application • Desktop version (Python) • No parallel • Hadoop version (Python) • Executed via Hadoop streaming pipeline
  • 16. Experiment-1 16 PC v.s. Hadoop • Data: 1x, 2x, 4x, 8x, 16x and 32x data respectively • Hadoop setup: 6 nodes of C3.xlarge
  • 17. Experiment-2 17 Hadoop with different number of nodes on 32x data • Hadoop setup: 6, 12, 14, 18 nodes of C3.xlarge
  • 18. Integrate to Weights Creation Web Service 18 HPC Pool & Hadoop Threshold to trigger Hadoop Weights Creation: 2 million polygons
  • 19. Issues 19 • This algorithm won’t work when spatial neighbors do not share points or edges (it requires the shared points are exactly same) • This algorithm can’t generate distance based weights • Potential solution • Use MapReduce r-tree (SpatialHadoop)
  • 20. Conclusion • Contribution: a MapReduce algorithm to create contiguity weights matrix for big spatial data • Ongoing work: use existing MapReduce r-tree to solve the potential issues of this algorithm 20
  • 21. Thanks! The Problem Nov 4, 2014 BIGSPATIAL 2014 21

Editor's Notes

  • #3: Hot topic Much research has focused on creating a cber-framework Computing resources includes: computing grids, super computers, HPC, cloud computing platform etc. 5 import components SA provides scientists Ability to analyze big data statistically
  • #4: Is a process of Spatial weights is an essential part of spatial analysis since it represents the geographic structure of spatial objects. For example,.. However, current data structure and algorithms base on sing desk com arch There are some research work tried to parallelize spatial analysis, however, they are still not capable of dealing with big data. And no one talks about creating spatial weights, which is the first step to solve this problem.
  • #5: Spatial Weights Create Spatial Weights What is W? W is most represented using a matrix, called weights matrix. Each cell value represent the spatial relationship between object I and J If the cell value is Zero, then the two objects has no spatial relationship in this weights matrix Contiguity weights matrix is a binary matrix. Value 1 represents two objects are contiguous. They are neighbors. Distance weights matrix uses actual distance between two objects.
  • #6: r-tree works by group nearby objects using their bounding box at different hierarchical level for a fast search. For each spatial object, it takes O(logn) time to find candidate neighbors r-tree has faster search time than binning algorithm, but it takes longer time to create a r-tree index. So, binning algorithm is more practical than r-tree
  • #7: However, find a buffer zone takes extra time, and since the geometries have irregular shapes, most of the time it’s hard to find a proper buffer zone. Another solution, which we are trying now is using the MapReduced r-tree, and we can talk about it later.
  • #12: HDFS: Hadoop Distributed File System
  • #17: Since Hadoop will spend extra time to deliver program and communicate with running nodes, it is actually slower than running the same program on the desktop computer for dataset less than 4-time of the raw data (2 million) However, the bigger the data, the better performance this algorithm can achieve on the Hadoop system. For example, for a 8x data, the algorithm on Hadoop took 167 seconds to complete, and the runtime is much faster than that on a desktop computer (482.67 seconds) The PC can’t handle 16x data 8 million. the running time increases linearly , which means this algorithm can be scaled up with growing size of data
  • #18: The best performance we can get from all tests is using 18 computer nodes in Hadoop to create contiguity weights file using 32x data in 163 seconds. The running time also does not decline linearly with the increasing number of computing nodes. This phenomenon is reasonable since there will be some extra time used for larger number of computing nodes to communicate inside the Hadoop system.
  • #19: Web Processing Service (WPS)
  • #21: We demonstrate the capability and efficiency of this algorithm by generating the weights file for big spatial data using Amazon’s hadoop system.