SlideShare a Scribd company logo
Geospatial Open Data and Urban Growth Modelling for
Evidence-based Decision Making in perspective of Smart
Cities
PIYUSH YADAV
17/01/2020 © Lero 2015 2
About me
❑ Researcher at Insight Center for Data
Analytics and Lero Software Research Centre at NUI
Galway (NUIG)
❑ Researcher- CTO at Tata Research Development and
Design Centre (TRDDC) which is part of TCS
Innovation Lab , Member of project in collaboration
with IIT Bombay.
❑ M.Tech. (CSE) with specialization in information
security at IIIT Delhi in 2013, Research Assistant
McGill Univ. Canada.
❑ Research Interest : Complex Event Processing,
Video Analytics, Distributed Systems, Machine
Learning, Smart Cities, GIS and Remote Sensing
❑ Publications : 17 Conference Papers, 1 Journal, 1
Book Chapter, 6 Posters, 2 Patents Filed , 1 Industry
Report (Dell)
Twitter
LinkedIn
Website
Contact
17/01/2020 © Lero 2015 3
• Learning Outcomes
• Geospatial Data
• Classification for Satellite Images
• Case Study: Urban Growth Modelling
• Multi-source Open Data Management
• Quality Issues in Multi-source Open Data
• Techniques for data preparation and cleaning
• Assignments
Outline
17/01/2020 © Lero 2015 4
Learning Outcomes
You will learn:
• Importance of Geospatial Data and Land Use Land Cover in development of Smart Cities
• Fundamentals of Satellite Image Classification
• How to model urban growth and predict future growth of city.
• Importance of Open Data in Smart Cities
• Explain the nature and types of data issues in (Open) Data
• Discuss techniques for identifying data quality issues
• Demonstrate data preparation and cleaning strategies (e.g., data clustering, filtering, etc.)
17/01/2020 © Lero 2015 5
Copernicus Hackathon Ireland 2019
• Last Year 3 teams participated from this class.
• 2 teams won the prize.
Air Quality
Aftab Alam, Nikhil Nambiar, Vignesh Kamath
https://prezi.com/view/iZEygJaFnxqAJH7lR9TM/
Smart Agriculture
17/01/2020 © Lero 2015 6
Geospatial Data
Geospatial data or spatial data (as it's sometimes known), is information that has a geographic aspect to it
➢ Coordinates: Lat Long
➢ Postal Address
➢ Physical Features
Vector - This form uses points, lines, and polygons to represent spatial features such as cities, roads, and
streams.
Raster - This form uses cells (computer often use dots or pixels) to represent spatial features.(our focus in this
lecture)
Types
https://www.bolton-
menk.com/books/lindsey/Lindsey.html
17/01/2020 © Lero 2015 7
Satellite Imagery: Basics
How we see colour Electromagnetic Spectrum
• Electromagnetic (EM) spectrum describes the continuous spectrum of energy from
high energy gamma rays and x-rays to very low energy microwaves and radio
waves.
• Visible light, or light that our eyes can detect, is just a small portion of the EM
spectrum.
• Satellites collect data by passing the reflected energy from the Earth through filters that separate the energy
into small windows of the EM spectrum into discrete spectral bands (Raster Image)
Satellite Imaging
https://landsat.usgs.gov/atmospheric-transmittance-information
17/01/2020 © Lero 2015 8
Multispectral(3-10 bands)
Hyperspectral(100-1000 bands(nm))
Normal Image (3 bands
Red, Green, Blue)
Image Bands/Channels
An image constitute of multiple bands from this electromagnetic spectrum.
http://www.splibtarang.com/index.php
Stack of Bands ~ Tensor
17/01/2020 © Lero 2015 9
LANDSAT Satellite Images
• Landsat program is the longest-running enterprise for acquisition of satellite imagery of Earth by Nasa
• Till now 8 satellites
• Landsat 1- launched 1972, Landsat 7- 1999, Landsat 8 -2013
• Can download data from : https://earthexplorer.usgs.gov/
Landsat 7 Bands Landsat 8 Bands Scan Line Correction Issue
In Landsat 7 (2003)
Other Earth Observation
Satellites
17/01/2020 © Lero 2015 10
Pre-processing of Landsat Image
Cracknell, A. (2007). Atmospheric Corrections to Passive Satellite Remote Sensing Data. In A. Cracknell, Introduction To Remote Sensing, Second Edition (p. 196). CRC Press.
Retrieved September 1, 2015
Kaufman, Y. J. (1989). The atmospheric effect on remote sensing and its correction. In Theory and applications of optical remote sensing (pp. 336-428).
Atmospheric Correction
Solar Correction
• Electromagnetic radiation captured by the satellite sensors is affected
because of the atmospheric interference such as scattering,
dispersion, etc.
• Subtract the digital number (DN) of water pixels in band 4 (infrared
band) as it has very low water leaving radiance (Cracknell 2007).
• DN values were then converted to spectral radiance (Kaufmann 1989).
𝑳 = 𝑳 𝒎𝒊𝒏 +
𝑳 𝒎𝒂𝒙
𝟐𝟓𝟒
−
𝑳 𝒎𝒊𝒏
𝟐𝟓𝟓
𝒙 𝑫𝑵
• For clear Landsat images, solar correction of the images was done by converting
spectral radiance to exoatmospheric reflectance (Kaufmann 1989).
𝝆 𝒑 =
𝝅 ⋅ 𝑳 𝝀⋅ 𝒅 𝟐
𝑬𝑺𝑼𝑵 𝝀 ⋅ 𝒄𝒐𝒔𝜽 𝒔
17/01/2020 © Lero 2015 11
17/01/2020 © Lero 2015 12
Pre-processing of Landsat Image
Band 1 Band 2
Band 3……..
Converted to
Reflectance
https://drive.google.com/drive/folders/1KGQmkZ7bN2M-ED31sDNWVtX29VntfzWs
View Using KML on Google Earth. Download file from below link
R
G
B
…
17/01/2020 © Lero 2015 13
Classify Landsat Image (Supervised Learning)
Create Training Data
Class ID Class Name Location(x,y)
1 Vegetation
2 Impervious Surface(Built Up)
3 Soil
4 Water
Train Model
• Maximum Likelihood
• SVM
• DNN
Spectral Signature for Different Classes
Classify
17/01/2020 © Lero 2015 14
Classified Image
17/01/2020 © Lero 2015 15
World Population is growing
Increased Economic Activities
Increased Urban Growth Rate
Case Study: Urban Growth Modelling
An Aerial View of urban growth in 2006 and 2014
Urban Growth
Change in Land Use Land Cover
17/01/2020 © Lero 2015 16
A KEY ASPECT OF
URBAN GROWTH
IS AFFECT ON
LAND USE LAND
COVER CHANGE
LAND COVER
INDICATES
THE PHYSICAL
LAND TYPE SUCH
AS FOREST OR
OPEN WATER
LAND USE
DOCUMENTS HOW
PEOPLE ARE USING
THE LAND SUCH AS
AGRICULTURE
Land Use Land Cover Change(LULCC)
17/01/2020 © Lero 2015 17
Factors Affecting Land Use Land Cover
• Predominantly, change over space but
remain relatively static with respect to
time.
• Digital Elevation Model (DEM)
Spatial
Factors
• Change over both time and space.
• Proximity to the primary roads
Spatio-
temporal
Factors
• Change over time but spatially static
for a given study area.
• National Gross Domestic Product
(GDP)
Temporal
Factors
Direct
Factors
Indirect
Factors
Land Use
Land Cover
Change
17/01/2020 © Lero 2015 18
Urban Growth Models
Thus the lattice based spatio-temporal models, e.g. Cellular
Automata (CA) and Logistic Regression (LR), are effectively used to
model the spatial geographic processes.
LULC images of two distinct time instances are taken and the
probabilities are computed using the frequency of change from one
LULC class to another and generate transition probability matrix.
Urban Growth models are used for prediction of land use land cover
(LULC) changes. LULC modeling is extremely difficult due to
complex interactions between multi-scale factors.
Schematic of an integrated Markov
Chain model
Limitation: Persistent Growth Rate
17/01/2020 © Lero 2015 19
Our Contribution
Hidden Markov
Model
Introduction of Hidden
Markov Model (HMM)
Temporal Factors
Incorporate temporal
factors in LULC change
modelling using HMM.
Model the underlying
temporal factors as
Gaussian distributions,
conditioned on the
hidden states, to learn
land cover type
transition probabilities
Integrate
Integrate our model
with other spatio-
temporal models such
as Logistic Regression
(LR) to yield richer
integrated models than
the corresponding MC
based integrated
models.
An urban growth model with
multi-scale direct and indirect
factors impacting LULC changes
17/01/2020 © Lero 2015 20
Our Model
A Hidden Markov Model with hidden states
(V, I, S) and sample emissions (GDP and
Liquidity)
Proposed urban growth model: HMM
integrated with Logistic Regression model
17/01/2020 © Lero 2015 21
Study Area: Pune
• Tier-A city situated in the state of Maharashtra, India.
• Located 560 m above the sea level.
• Famous for Information Technology and Automobile industries and various research institutes.
• Considered 45 sq. km of the city area which have gone under rapid urbanisation.
17/01/2020 © Lero 2015 22
Temporal Growth Factors
Gross Domestic Product
National. Amount of goods and services produced within the border of a
country in a specific time interval.
Interest Rate Cycle National. Revised bimonthly. A tight monetary policy affects the overall
investment policy which leads to slowdown and vice versa.
Consumer Price Index National. Low inflation creates developmental investment environment.
Gross Fixed Capital Formation
National. Amount that government spends in the capital formation(such
as infrastructure building, land improvements) of the country. Greater
the GFCF investment higher is the rate of urbanization .
Urban Population Growth Rate
National. In order to accommodate a higher influx of people, cities
are expanding along their outskirts, leading to the growth in urban
agglomerate.
Electricity Consumption Regional. Typically, regions with higher electricity demand grow
faster than those with lesser demand.
Road Length Added
Regional. Better connectivity of a region helps in better transportation
and thus provides impetus to growth by allowing setup of new industrial
complexes and other infrastructure services.
17/01/2020 © Lero 2015 23
Temporal Growth Factors Data
GDP growth rate (%)
Absolute average CPI Inflation (%)
Gross fixed capital formation (%GDP)
Urban population growth rate (%)
Bimonthly interest (repo) rate (%)
Per capita electricity consumption in
kilowatt-hours
17/01/2020 © Lero 2015 24
Land Use Land Cover (LULC) Data
LULC data is required for HMM hidden states and LR models as an input.
Time
period
Yearly, 2001 to 2014 (between March to
April)
Latitude 18.38847838°N - 18.79279909°N
Longitude 73.64552005°E - 74.07494971°E
Bands 1 to 7
Resolution 30m
Pixels 1500 𝑥 1500
Landsat 7
Landsat-7 Specifications
Scan Line Correction (SLC)
• In 2003 Landsat-7 SLC in ETM+ instrument has developed a fault thus creating
some black lines in the captured images.
• Image Smoothening using windowing.
LULC Data Pre-processing
Atmospheric Correction: explained earlier
Solar Correction: explained earlier
17/01/2020 © Lero 2015 25
• Classified into seven broad LULC classes on the basis of the nature of the
landscape.
• Forest Canopy, Agriculture Area, Residential Area, Industrial Area, Common
Open Area, Burnt Grass, Bright Soil, and Water Body.
Classes
• For classification a labeled set of pixels for each class of interest was collected
(500 to 3000 samples per class). The feature vector for each pixel consisted of
all seven band values.
• Support Vector Machines
• Manual Correction (Concrete and Quarry)
SVM Classification
• Vegetation, Impervious Surface, and Soil
VIS Classes
LULC Data Classification
17/01/2020 © Lero 2015 26
A Quick Recap
LULC Data
17/01/2020 © Lero 2015 27
Spatio-Temporal Factors
Digital Elevation Model (DEM) and Slope
Proximity to primary roads:
Mask
CARTOSAT 1
Water bodies were masked out from the LULC image
3 D View
DEM Image
Primary Road Layers
17/01/2020 © Lero 2015 28
Results
HMM Experiments
Computed MC transition probabilities for 2001-2002, Learned HMM transition probabilities for
2014, Computed MC transition probabilities for 2014
• Used Gaussian HMM library in Scikit Learn
• We designed a HMM with the three hidden states (V, I, and S) and temporal factors
• HMM was initialized with MC transition probabilities for the year 2001 to 2002
• A stable model was obtained empirically after 50000 iterations with a threshold of less than 0.01
17/01/2020 © Lero 2015 29
Results
Land Change Modelling Experiments
• Terrset’s Land Change Modeler.
• Transition sub-models were defined for four LC change types, i.e., V to S, V to I, S to V, and S to I.
• Slope gradient and primary roads layer were used as the primary driver variables .
𝒔𝒖𝒊𝒕𝒂𝒃𝒊𝒍𝒊𝒕𝒚 =
𝟏
𝒔𝒍𝒐𝒑𝒆 𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕 𝟎.𝟏
• Suitability map. Greater the value higher the suitability and vice-versa.
• Suitability for urbanization is high in areas such as roads, low lying
river basin, and around the urbanized areas where the slope gradient
is less.
• Towards, the south end the suitability drops significantly, as the area
has hills and valleys.
• Four of the sub models were built using Logistic Regression.
17/01/2020 © Lero 2015 30
Results
Soil to Impervious Soil to Vegetation
Vegetation to Impervious Vegetation to Soil
Heat maps depicting transition probabilities from one state to another
17/01/2020 © Lero 2015 31
• The two models were then used to predict changes for the year 2014.
Results
Actual land cover image of
2014 obtained from
classification
Predicted land cover image
of 2014 (HMM-LR)
Predicted land cover image
of 2014 (MC-LR)
• Visually it is evident that the HMM based predicted image is significantly better, in terms of similarity with
the actual classified LC image than the MC based predicted image .
17/01/2020 © Lero 2015 32
HMM-LR MC-LR
V I S V I S
Precision 0.48 0.49 0.60 0.54 0.38 0.34
Recall 0.48 0.52 0.59 0.54 0.32 0.39
Results
• Blob Analysis of urban and non urban regions. Blobs denote concentrated urban regions.
• Green blobs are true positives, blue blobs are false negatives, and red blobs are the false positives.
• HMM-LR false positives are smaller in size and less dense than those of the MC-LR. The HMM output is well
balanced and resembles the actual output better.
• 11% increment in precision of the persistence of Impervious Surface (I) is observed.
• Precision of Soil (S) class type has jumped up by 26%.
• Drop in the precision of Vegetation (V) class type by a marginal 6% . This is because vegetation cover is an
outcome of relatively easy process as compared to S and I .
Blob Analysis of urban areas. Left to right: (i) Actual, (ii)
MC-LR, (iii) HMM-LR
Precision and Recall for integrated models
17/01/2020 © Lero 2015 33
Conclusion
• Markov Chain (MC) models are limited in their urban prediction capabilities due to the
assumption of constant rate of persistence of land cover class types and inability to model the
temporal factors.
• We have proposed a new temporal model using Hidden Markov Model.
• We have demonstrated the usefulness of our model over MC by predicting urban growth for
an upcoming city of India (Pune).
• We believe that this inquiry into HMM based models provides yet another tool that will
equip the urban modelers, planners and decision makers to better design sustainable
urban environments.
• 11% and 26% increment of precision in Impervious Surface and Soil Class respectively.
https://www.researchgate.net/publication/327745849_Computational_Model_for_Urban_Growth_Using_Socioeconomic_Latent_Parameters
17/01/2020 © Lero 2015 34
Open Data
17/01/2020 © Lero 2015 3535
Open Data
17/01/2020 © Lero 2015 3636
10030 112
https://data.gov.ie/stats
17/01/2020 © Lero 2015 37
How is Open Data being used?
Engagement/Innovation
https://www.mapalerter.com/
Data Modelling / Decision-Making
http://exceedence.com/monetising-metocean-data-an-open-data-
project/
17/01/2020 © Lero 2015 38
Monitoring / Planning
Quality and Qualifications Ireland
http://infographics.qqi.ie/
Sustainability / Mobility
https://citybik.es/
17/01/2020 © Lero 2015 39
Open Data Management Challenge
39
17/01/2020 © Lero 2015 40
From Data to Smart Data
40
Data
Sources
Predictive
Analytics User
Awareness
Recommen-
dations
Smart Apps
Open Data
Management
Data Modeling
Collection
Aggregation
Enrichment
Linking
Classification
Cleaning
Integration
Storing
Querying
Is this data good enough for creating accurate and reliable apps?
17/01/2020 © Lero 2015 41
Open Data Management Challenge
Open Data Quality can be very challenging for designing apps and decision support
models
Open Data can have multiple issues: missing values, different formats, irregular
timestamps, abnormal values, etc.
Data preparation such as filtering and classification is an important step for further
analysis
Data is not complete and require combining multiple data sources
41
17/01/2020 © Lero 2015 42
Case Specifics
42
17/01/2020 © Lero 2015 43
Data Preparation for Building a map of Playing Pitches around Dublin
43
The data is available on https://data.gov.ie
Different
Formats
17/01/2020 © Lero 2015 44
And even more challenges!
44
Different
Formats
Different
Attributes
Missing
Values
17/01/2020 © Lero 2015 45
And even more challenges!
45
Different
Formats
Different
Attributes
Missing
Values
Objective: Create a good
quality dataset from these
resource!
17/01/2020 © Lero 2015 46
What is a good quality data?
46
A Conventional Definition of Data Quality
Good quality data are:
Accurate, Complete, Unique,
Up-to-date, and Consistent ;
meaning …
17/01/2020 © Lero 2015 47
Accurate means …
Are we storing correct values?
➔ Values in the data entries should be consistent: Same form or
value representation
47
Sensor Timestamp Value Location
M1n 12/01/2018T10:03:59 12.3 Galway
M3n 1452592980000 9.5 GA
M5n 01/12/2018 10:03 1.55 NUIG
Example: What issues can you identify from this table?
17/01/2020 © Lero 2015 48
Possible solution
48
Create a Unified Data Model
Do you have access to the data source?
Convert your data before
further processing
Adjust sources to send
data using your model
NoYes
Accurate means …
17/01/2020 © Lero 2015 49
Complete means …
Does the data contain everything it is supposed to contain?
49
Sensor Timestamp Value Location
M1n 08/01/2018T00:00:00 32.5 NEB, NUIG
M1n 09/01/2018T00:00:00 21.2
M1n 10/01/2018T00:00:00 26.1 NEB, NUIG
M1n 12/01/2018T00:00:00 23.5 NEB, NUIG
M1n 13/01/2018T00:00:00 NEB, NUIG
M1n 14/01/2018T00:00:00 26.1 NEB, NUIG
Example: What issues can you identify from this table?
17/01/2020 © Lero 2015 50
Unique means …
Do the data entries appear only once?
➔ This issue generally appears when manual entries are allowed in
the dataset
50
Surname Firstname DoB Driving test passed:
Smith J. 17/12/85 17/12/05
Smith Jack 17/12/85 17/12/2005
Smith Jock 17/12/95 17/12/2005
Example: What issues can you identify from this table?
17/01/2020 © Lero 2015 51
Consistent means …
Does the data contain any logical errors or impossibilities?
51
Sensor Timestamp Value Location
M1n 08/01/2018T00:00:00 32.5 NEB, NUIG
M1n 09/01/2018T00:00:00 21.2 NEB, NUIG
M1n 10/01/2018T00:00:00 0 NEB, NUIG
M1n 11/01/2018T00:00:00 23.5 NEB, NUIG
M1n 12/01/2018T00:00:00 -1.23 NEB, NUIG
M1n 13/01/2018T00:00:00 26.1 NEB, NUIG
Example: What issues can you identify from this table?
Are these errors? How can we identify them?
➔ Possible solutions: Filtering and Outliers detection.
17/01/2020 © Lero 2015 52
Up-to-Date means …
Is the data updated regularly?
52
A sensor moved to a new location.
What implications can this have?
Can you think of a case where it doesn’t
matter whether or not the data are kept up
to date?
17/01/2020 © Lero 2015 53
Techniques for Data Preparation
53
17/01/2020 © Lero 2015 54
Minimal Data Preparation Pipeline
54
Observation Quality Enhancement
Understanding
the format of the
data and its
elements
Classification,
Aggregation, Filtering,
Enrichment, etc.
Modeling
Identify relevant
attributes and
representation
format
17/01/2020 © Lero 2015 55
Step 1: Observation
This step involves the descriptive analysis (auditing) of individual
data resources
Data observations can be:
– Highly structured: by having a predefined checklist of observational attributes
(e.g., format, attributes, frequency, volume, language, etc.)
– Semi-structured: by having an ad-hock checklist of observational attributes
55
• Cons:
– Can be time
consuming
• Pros:
– Define contextual information about the
data
– Provides good and early insights into data
quality issue
17/01/2020 © Lero 2015 56
Step 2: Modeling
This step involves the use of formal techniques for creating a data model
Examples of techniques: Object-Relational mapping, Relational model etc.
Methodologies:
– Top-down: predefined information about the data
– Bottom-up: results from a reengineering effort
56
17/01/2020 © Lero 2015 57
Step 3.1: Classification
Data classification is the process of organizing data by categories
for refined and targeted analysis
➔Example: Water or Energy consumption for working days vs. non
working days
➔Categories depend on the intended use of the data
57
17/01/2020 © Lero 2015 58
Step 3.2: Aggregation
Data aggregation is a data mining process that summarizes the data with
respect to certain criteria/dimensions.
Data aggregations help increase search performance
Facilitates data reporting and analysis
Types of aggregations: Sum, Count, Min/Max , AVG, etc.
Aggregation strategies and levels: temporal (hourly, daily, etc.), source-based
(resources hierarchy), location-based (outlet, room, area, building), etc.
58
The level of aggregations depends on the available data and its
intended use
Example of useful aggregations: Hourly traffic congestion level per road.
Quarterly Inflation price
17/01/2020 © Lero 2015 59
Data filtering is the process of refining data sets by removing data items that do not comply to
certain criteria
Example: Keep data with positive water consumption values
Filters depend on the context of the observations (negative values may be meaningful in
installations where water flows in both directions on a pipe)
59
Step 3.3: Filtering
Content-based Filtering
– Selecting data items based on their values
(e.g., keep only positive values)
Policy-based Filtering
– Filtering rules are defined as constrains
similar to access control mechanisms (e.g.,
for security reasons)
Statistical Filtering
– Identify a baseline for a content-based filtering
– Baselines are determined from historical data analysis
– Outliers detection
Hybrid Filtering
– Combination of filtering options
Filtering Types
17/01/2020 © Lero 2015 6060
Step 3.3: Filtering Outliers Detection
Value inconsistent with rest of
the dataset – Global Outlier
Special outliers – Local Outlier
• Observations inconsistent with their
neighborhoods
• A local instability or discontinuity
➢ Low quality measurements: faulty collectors, manual
errors, wrong calibrations of devices
➢ Network issues: problems with data transmission from
data sources to the data management platform
➢ Missing values or redundant values: can create wrong
aggregations
➢ Correct but exceptional data!
Causes of Outliers
17/01/2020 © Lero 2015 61
Outlier Detection Approaches
Deviation-based outlier detection
– Sequential exception
Distance-based outlier detection
– Index-based, nested-loop, cell-based, local-outliers
Statistical-based outlier detection
– Distribution-based, depth-based
61
17/01/2020 © Lero 2015 62
Distance-based Outlier Detection
62
• General idea:
– Judge a point based on the distance to its neighbors
– Several variants proposed
• Basic Assumption:
– Normal data objects have a dense neighborhood
– Outliers are far apart from their neighbors
• Basic Model:
– Given a radius
– A point is considered
an outlier if at least 𝜫
percent of all other
points have a distance
to 𝜫 less than 𝝴
17/01/2020 © Lero 2015 63
Step 3.4: Enrichment
This step supplements/adds additional information to the data.
Possible techniques:
– Additional information can be accessed from other resources
– Use of services such as translation, value conversion, adding a zip code, etc.
– [In case of semantic linked data] Linking to other concepts through new predicates.
63
17/01/2020 © Lero 2015 64
Summary
64
Discussed Land Use Land Cover
Discussed Satellite Imaging and Classification
Discussed Case study on Urban Growth Modelling
Discussed the challenges of developing decision support systems with Open Data (e.g., need for
accurate trusted information)
Explained the nature and types of data issues in (Open) Data: different formats, missing values,
Discussed techniques for identifying data quality issues
Discussed data preparation and cleaning strategies (e.g., data clustering, filtering, etc.)
Identified a minimal data preparation pipeline
17/01/2020 © Lero 2015 6565
Rahm, Erhard, and Hong Hai Do. "Data cleaning:
Problems and current approaches." IEEE Data
Eng. Bull. 23.4 (2000): 3-13.
Assigned Reading
https://landsat.gsfc.nasa.gov/pdf_archive/How2make.pdf
17/01/2020 © Lero 2015 66
Acknowledgments
I created this material from several resources:
– https://study.com/academy/lesson/geospatial-data-definition-example.html
– Data from USGS
– http://www.splibtarang.com/index.php
– Yadav, Piyush, Shamsuddin Ladha, Shailesh Deshpande, and Edward Curry. " Computational Model for Urban Growth
Using Socioeconomic Latent Parameters ", In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pp. 65-78. Springer, Cham, 2018
– NASA, Landsat Website
– Data from https://data.gov.ie ,
– A ppt by David Corn, “Data Quality and Data Cleaning1”
– A ppt by Eric Poulin and Colin Yu, “Outlier Detection and Analysis”
– A paper by Erhard Rahm and Hong Hai Do, ” Data Cleaning: Problems and Current Approaches”
– A ppt by Cameron Brooks, “Lets Build a Smarter Planet: IBM Smarter Water Management”
66
17/01/2020 © Lero 2015 67
Further Reading
For further readings I recommend the following books
67
Book Link
17/01/2020 © Lero 2015 68
Assignments
Group Assignment
Total 100 marks
Two Sections
Section 1- (30 marks)
– Objective- Classify a given Landsat Images of a Dublin region of two years using QGIS software and find one major
change that you can see between two images
– Marking Scheme: Report 100% (30 marks)
Section2- (70 marks)
– Objective: Create a complete and clean dataset by merging three datasets
– Dataset: Real world data from https://data.gov.ie
• Playing pitches around Dublin
• Multiple formats (minimum 2 are required)
• Data completion using other sources
– Tools: Python or Java
– Marking scheme:
• Report 50% (35 marks)
• Code/Analytics 50% (35 marks)
17/01/2020 © Lero 2015 69
Guidelines For Group
Two people in each group
Fill the group information by 21st Jan , 5pm. (Link Given Below)
Those who will not fill will be assigned random groups.
For any doubt you can mail me on piyush.yadav@insight-centre.org
Assignment Due: Jan 30th Midnight
https://docs.google.com/spreadsheets/d/1eTwNF6-OqvSGKZtv0WWREgRjnKt8w_OATEH6b18unJQ/edit?usp=sharing
THANK YOU
QUESTIONS

More Related Content

PDF
Comoros Subject-NDC workshop 2014
PDF
New information system for enhancing climate & water governance
DOCX
Topic 19
PDF
Topographic Information System as a Tool for Environmental Management, a Case...
PPT
Applications of GIS to Logistics and Transportation
PDF
Flood modeling and mapping system for emergency responders
PDF
A New Dimension
PPT
Who? What? Why we better care?
Comoros Subject-NDC workshop 2014
New information system for enhancing climate & water governance
Topic 19
Topographic Information System as a Tool for Environmental Management, a Case...
Applications of GIS to Logistics and Transportation
Flood modeling and mapping system for emergency responders
A New Dimension
Who? What? Why we better care?

What's hot (20)

PPTX
Moreno_EUEC2016_ICF_final
PPT
Geographic Information Systems in the Oil & Gas Industry
PPT
GIS for Infrastructure Management
PPT
Urban planing & gis
PPTX
Application of gis in urban traffic air quality
PPT
civil engineer
PDF
Tom Martlev - detailed geological modelling in urban areas focused on structu...
PDF
Spring 2013
PPTX
Application of gis and gps in civil engineering
PDF
Icelandic Bathy model
PPT
Massachgusetts, USGS, and Fugro/Earthdata
PDF
2018 GIS in Development: Developing a National Map of Subsurface Infrastructure
PPTX
INSPIRE and Land Use - The need for real harmonised data about urban plans
PPTX
Crowd-Sourcing Approach of Building Ground Truth Database for Global Urban Ar...
PPTX
Gis powerpoint
PPTX
Digital Elevation Models
PPTX
Gis in transportation
PPTX
Introduction and Application of GIS
PPT
FLOOD MAP MODERNIZATION
Moreno_EUEC2016_ICF_final
Geographic Information Systems in the Oil & Gas Industry
GIS for Infrastructure Management
Urban planing & gis
Application of gis in urban traffic air quality
civil engineer
Tom Martlev - detailed geological modelling in urban areas focused on structu...
Spring 2013
Application of gis and gps in civil engineering
Icelandic Bathy model
Massachgusetts, USGS, and Fugro/Earthdata
2018 GIS in Development: Developing a National Map of Subsurface Infrastructure
INSPIRE and Land Use - The need for real harmonised data about urban plans
Crowd-Sourcing Approach of Building Ground Truth Database for Global Urban Ar...
Gis powerpoint
Digital Elevation Models
Gis in transportation
Introduction and Application of GIS
FLOOD MAP MODERNIZATION
Ad

Similar to Geospatial Open Data and Urban Growth Modelling for Evidence-based Decision Making in perspective of Smart Cities (20)

PDF
Computational Model for Urban Growth Using Socioeconomic Latent Parameters
PDF
Using satellite imagery to track economic change
PDF
Using satellite imagery to track economic change
PDF
IRJET- Land Use & Land Cover Change Detection using G.I.S. & Remote Sensing
PDF
Land_Use_-_Land_Cover_Change_Analysis_of_Karur_Tow.pdf
PDF
IRJET- Landuse, Landcover and Urban Development of Coimbatore North Zone for ...
PDF
ANALYSIS OF LAND USE AND LAND COVER CHANGE OF BANGALORE URBAN USING REMOTE SE...
PDF
Landmap CETIS 2012
PPSX
SPATIO-TEMPORAL URBAN CHANGE DETECTION, ANALYSIS AND PREDICTION OF KATHMANDU ...
PPTX
SPECTRAL INDICES FOR URBAN AREA (GEOINFORMATICS AND REMOTE SENSING)
PPTX
Monitoring Land Use and Land Cover through Remote Sensing and GIS
PPT
Intro to GIS and Remote Sensing
PDF
Urban Growth Patterns In India Spatial Analysis For Sustainable Development 1...
PPTX
Land use cover pptx.
PPT
Spatial temporal urban change extraction and modeling of Kathmandu Valley
PDF
GIS and Remote sensing in land use land cover change
PDF
Multi-Scale Urban Analysis Using Remote Sensing and GIS
PPTX
Land use land cover
PDF
Landuse and Landcover analysis using Remote Sensing and GIS: A Case Study in ...
DOCX
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Computational Model for Urban Growth Using Socioeconomic Latent Parameters
Using satellite imagery to track economic change
Using satellite imagery to track economic change
IRJET- Land Use & Land Cover Change Detection using G.I.S. & Remote Sensing
Land_Use_-_Land_Cover_Change_Analysis_of_Karur_Tow.pdf
IRJET- Landuse, Landcover and Urban Development of Coimbatore North Zone for ...
ANALYSIS OF LAND USE AND LAND COVER CHANGE OF BANGALORE URBAN USING REMOTE SE...
Landmap CETIS 2012
SPATIO-TEMPORAL URBAN CHANGE DETECTION, ANALYSIS AND PREDICTION OF KATHMANDU ...
SPECTRAL INDICES FOR URBAN AREA (GEOINFORMATICS AND REMOTE SENSING)
Monitoring Land Use and Land Cover through Remote Sensing and GIS
Intro to GIS and Remote Sensing
Urban Growth Patterns In India Spatial Analysis For Sustainable Development 1...
Land use cover pptx.
Spatial temporal urban change extraction and modeling of Kathmandu Valley
GIS and Remote sensing in land use land cover change
Multi-Scale Urban Analysis Using Remote Sensing and GIS
Land use land cover
Landuse and Landcover analysis using Remote Sensing and GIS: A Case Study in ...
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to machine learning and Linear Models
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Introduction to the R Programming Language
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
annual-report-2024-2025 original latest.
Introduction to machine learning and Linear Models
Miokarditis (Inflamasi pada Otot Jantung)
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
[EN] Industrial Machine Downtime Prediction
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to the R Programming Language
climate analysis of Dhaka ,Banglades.pptx
Clinical guidelines as a resource for EBP(1).pdf

Geospatial Open Data and Urban Growth Modelling for Evidence-based Decision Making in perspective of Smart Cities

  • 1. Geospatial Open Data and Urban Growth Modelling for Evidence-based Decision Making in perspective of Smart Cities PIYUSH YADAV
  • 2. 17/01/2020 © Lero 2015 2 About me ❑ Researcher at Insight Center for Data Analytics and Lero Software Research Centre at NUI Galway (NUIG) ❑ Researcher- CTO at Tata Research Development and Design Centre (TRDDC) which is part of TCS Innovation Lab , Member of project in collaboration with IIT Bombay. ❑ M.Tech. (CSE) with specialization in information security at IIIT Delhi in 2013, Research Assistant McGill Univ. Canada. ❑ Research Interest : Complex Event Processing, Video Analytics, Distributed Systems, Machine Learning, Smart Cities, GIS and Remote Sensing ❑ Publications : 17 Conference Papers, 1 Journal, 1 Book Chapter, 6 Posters, 2 Patents Filed , 1 Industry Report (Dell) Twitter LinkedIn Website Contact
  • 3. 17/01/2020 © Lero 2015 3 • Learning Outcomes • Geospatial Data • Classification for Satellite Images • Case Study: Urban Growth Modelling • Multi-source Open Data Management • Quality Issues in Multi-source Open Data • Techniques for data preparation and cleaning • Assignments Outline
  • 4. 17/01/2020 © Lero 2015 4 Learning Outcomes You will learn: • Importance of Geospatial Data and Land Use Land Cover in development of Smart Cities • Fundamentals of Satellite Image Classification • How to model urban growth and predict future growth of city. • Importance of Open Data in Smart Cities • Explain the nature and types of data issues in (Open) Data • Discuss techniques for identifying data quality issues • Demonstrate data preparation and cleaning strategies (e.g., data clustering, filtering, etc.)
  • 5. 17/01/2020 © Lero 2015 5 Copernicus Hackathon Ireland 2019 • Last Year 3 teams participated from this class. • 2 teams won the prize. Air Quality Aftab Alam, Nikhil Nambiar, Vignesh Kamath https://prezi.com/view/iZEygJaFnxqAJH7lR9TM/ Smart Agriculture
  • 6. 17/01/2020 © Lero 2015 6 Geospatial Data Geospatial data or spatial data (as it's sometimes known), is information that has a geographic aspect to it ➢ Coordinates: Lat Long ➢ Postal Address ➢ Physical Features Vector - This form uses points, lines, and polygons to represent spatial features such as cities, roads, and streams. Raster - This form uses cells (computer often use dots or pixels) to represent spatial features.(our focus in this lecture) Types https://www.bolton- menk.com/books/lindsey/Lindsey.html
  • 7. 17/01/2020 © Lero 2015 7 Satellite Imagery: Basics How we see colour Electromagnetic Spectrum • Electromagnetic (EM) spectrum describes the continuous spectrum of energy from high energy gamma rays and x-rays to very low energy microwaves and radio waves. • Visible light, or light that our eyes can detect, is just a small portion of the EM spectrum. • Satellites collect data by passing the reflected energy from the Earth through filters that separate the energy into small windows of the EM spectrum into discrete spectral bands (Raster Image) Satellite Imaging https://landsat.usgs.gov/atmospheric-transmittance-information
  • 8. 17/01/2020 © Lero 2015 8 Multispectral(3-10 bands) Hyperspectral(100-1000 bands(nm)) Normal Image (3 bands Red, Green, Blue) Image Bands/Channels An image constitute of multiple bands from this electromagnetic spectrum. http://www.splibtarang.com/index.php Stack of Bands ~ Tensor
  • 9. 17/01/2020 © Lero 2015 9 LANDSAT Satellite Images • Landsat program is the longest-running enterprise for acquisition of satellite imagery of Earth by Nasa • Till now 8 satellites • Landsat 1- launched 1972, Landsat 7- 1999, Landsat 8 -2013 • Can download data from : https://earthexplorer.usgs.gov/ Landsat 7 Bands Landsat 8 Bands Scan Line Correction Issue In Landsat 7 (2003) Other Earth Observation Satellites
  • 10. 17/01/2020 © Lero 2015 10 Pre-processing of Landsat Image Cracknell, A. (2007). Atmospheric Corrections to Passive Satellite Remote Sensing Data. In A. Cracknell, Introduction To Remote Sensing, Second Edition (p. 196). CRC Press. Retrieved September 1, 2015 Kaufman, Y. J. (1989). The atmospheric effect on remote sensing and its correction. In Theory and applications of optical remote sensing (pp. 336-428). Atmospheric Correction Solar Correction • Electromagnetic radiation captured by the satellite sensors is affected because of the atmospheric interference such as scattering, dispersion, etc. • Subtract the digital number (DN) of water pixels in band 4 (infrared band) as it has very low water leaving radiance (Cracknell 2007). • DN values were then converted to spectral radiance (Kaufmann 1989). 𝑳 = 𝑳 𝒎𝒊𝒏 + 𝑳 𝒎𝒂𝒙 𝟐𝟓𝟒 − 𝑳 𝒎𝒊𝒏 𝟐𝟓𝟓 𝒙 𝑫𝑵 • For clear Landsat images, solar correction of the images was done by converting spectral radiance to exoatmospheric reflectance (Kaufmann 1989). 𝝆 𝒑 = 𝝅 ⋅ 𝑳 𝝀⋅ 𝒅 𝟐 𝑬𝑺𝑼𝑵 𝝀 ⋅ 𝒄𝒐𝒔𝜽 𝒔
  • 12. 17/01/2020 © Lero 2015 12 Pre-processing of Landsat Image Band 1 Band 2 Band 3…….. Converted to Reflectance https://drive.google.com/drive/folders/1KGQmkZ7bN2M-ED31sDNWVtX29VntfzWs View Using KML on Google Earth. Download file from below link R G B …
  • 13. 17/01/2020 © Lero 2015 13 Classify Landsat Image (Supervised Learning) Create Training Data Class ID Class Name Location(x,y) 1 Vegetation 2 Impervious Surface(Built Up) 3 Soil 4 Water Train Model • Maximum Likelihood • SVM • DNN Spectral Signature for Different Classes Classify
  • 14. 17/01/2020 © Lero 2015 14 Classified Image
  • 15. 17/01/2020 © Lero 2015 15 World Population is growing Increased Economic Activities Increased Urban Growth Rate Case Study: Urban Growth Modelling An Aerial View of urban growth in 2006 and 2014 Urban Growth Change in Land Use Land Cover
  • 16. 17/01/2020 © Lero 2015 16 A KEY ASPECT OF URBAN GROWTH IS AFFECT ON LAND USE LAND COVER CHANGE LAND COVER INDICATES THE PHYSICAL LAND TYPE SUCH AS FOREST OR OPEN WATER LAND USE DOCUMENTS HOW PEOPLE ARE USING THE LAND SUCH AS AGRICULTURE Land Use Land Cover Change(LULCC)
  • 17. 17/01/2020 © Lero 2015 17 Factors Affecting Land Use Land Cover • Predominantly, change over space but remain relatively static with respect to time. • Digital Elevation Model (DEM) Spatial Factors • Change over both time and space. • Proximity to the primary roads Spatio- temporal Factors • Change over time but spatially static for a given study area. • National Gross Domestic Product (GDP) Temporal Factors Direct Factors Indirect Factors Land Use Land Cover Change
  • 18. 17/01/2020 © Lero 2015 18 Urban Growth Models Thus the lattice based spatio-temporal models, e.g. Cellular Automata (CA) and Logistic Regression (LR), are effectively used to model the spatial geographic processes. LULC images of two distinct time instances are taken and the probabilities are computed using the frequency of change from one LULC class to another and generate transition probability matrix. Urban Growth models are used for prediction of land use land cover (LULC) changes. LULC modeling is extremely difficult due to complex interactions between multi-scale factors. Schematic of an integrated Markov Chain model Limitation: Persistent Growth Rate
  • 19. 17/01/2020 © Lero 2015 19 Our Contribution Hidden Markov Model Introduction of Hidden Markov Model (HMM) Temporal Factors Incorporate temporal factors in LULC change modelling using HMM. Model the underlying temporal factors as Gaussian distributions, conditioned on the hidden states, to learn land cover type transition probabilities Integrate Integrate our model with other spatio- temporal models such as Logistic Regression (LR) to yield richer integrated models than the corresponding MC based integrated models. An urban growth model with multi-scale direct and indirect factors impacting LULC changes
  • 20. 17/01/2020 © Lero 2015 20 Our Model A Hidden Markov Model with hidden states (V, I, S) and sample emissions (GDP and Liquidity) Proposed urban growth model: HMM integrated with Logistic Regression model
  • 21. 17/01/2020 © Lero 2015 21 Study Area: Pune • Tier-A city situated in the state of Maharashtra, India. • Located 560 m above the sea level. • Famous for Information Technology and Automobile industries and various research institutes. • Considered 45 sq. km of the city area which have gone under rapid urbanisation.
  • 22. 17/01/2020 © Lero 2015 22 Temporal Growth Factors Gross Domestic Product National. Amount of goods and services produced within the border of a country in a specific time interval. Interest Rate Cycle National. Revised bimonthly. A tight monetary policy affects the overall investment policy which leads to slowdown and vice versa. Consumer Price Index National. Low inflation creates developmental investment environment. Gross Fixed Capital Formation National. Amount that government spends in the capital formation(such as infrastructure building, land improvements) of the country. Greater the GFCF investment higher is the rate of urbanization . Urban Population Growth Rate National. In order to accommodate a higher influx of people, cities are expanding along their outskirts, leading to the growth in urban agglomerate. Electricity Consumption Regional. Typically, regions with higher electricity demand grow faster than those with lesser demand. Road Length Added Regional. Better connectivity of a region helps in better transportation and thus provides impetus to growth by allowing setup of new industrial complexes and other infrastructure services.
  • 23. 17/01/2020 © Lero 2015 23 Temporal Growth Factors Data GDP growth rate (%) Absolute average CPI Inflation (%) Gross fixed capital formation (%GDP) Urban population growth rate (%) Bimonthly interest (repo) rate (%) Per capita electricity consumption in kilowatt-hours
  • 24. 17/01/2020 © Lero 2015 24 Land Use Land Cover (LULC) Data LULC data is required for HMM hidden states and LR models as an input. Time period Yearly, 2001 to 2014 (between March to April) Latitude 18.38847838°N - 18.79279909°N Longitude 73.64552005°E - 74.07494971°E Bands 1 to 7 Resolution 30m Pixels 1500 𝑥 1500 Landsat 7 Landsat-7 Specifications Scan Line Correction (SLC) • In 2003 Landsat-7 SLC in ETM+ instrument has developed a fault thus creating some black lines in the captured images. • Image Smoothening using windowing. LULC Data Pre-processing Atmospheric Correction: explained earlier Solar Correction: explained earlier
  • 25. 17/01/2020 © Lero 2015 25 • Classified into seven broad LULC classes on the basis of the nature of the landscape. • Forest Canopy, Agriculture Area, Residential Area, Industrial Area, Common Open Area, Burnt Grass, Bright Soil, and Water Body. Classes • For classification a labeled set of pixels for each class of interest was collected (500 to 3000 samples per class). The feature vector for each pixel consisted of all seven band values. • Support Vector Machines • Manual Correction (Concrete and Quarry) SVM Classification • Vegetation, Impervious Surface, and Soil VIS Classes LULC Data Classification
  • 26. 17/01/2020 © Lero 2015 26 A Quick Recap LULC Data
  • 27. 17/01/2020 © Lero 2015 27 Spatio-Temporal Factors Digital Elevation Model (DEM) and Slope Proximity to primary roads: Mask CARTOSAT 1 Water bodies were masked out from the LULC image 3 D View DEM Image Primary Road Layers
  • 28. 17/01/2020 © Lero 2015 28 Results HMM Experiments Computed MC transition probabilities for 2001-2002, Learned HMM transition probabilities for 2014, Computed MC transition probabilities for 2014 • Used Gaussian HMM library in Scikit Learn • We designed a HMM with the three hidden states (V, I, and S) and temporal factors • HMM was initialized with MC transition probabilities for the year 2001 to 2002 • A stable model was obtained empirically after 50000 iterations with a threshold of less than 0.01
  • 29. 17/01/2020 © Lero 2015 29 Results Land Change Modelling Experiments • Terrset’s Land Change Modeler. • Transition sub-models were defined for four LC change types, i.e., V to S, V to I, S to V, and S to I. • Slope gradient and primary roads layer were used as the primary driver variables . 𝒔𝒖𝒊𝒕𝒂𝒃𝒊𝒍𝒊𝒕𝒚 = 𝟏 𝒔𝒍𝒐𝒑𝒆 𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕 𝟎.𝟏 • Suitability map. Greater the value higher the suitability and vice-versa. • Suitability for urbanization is high in areas such as roads, low lying river basin, and around the urbanized areas where the slope gradient is less. • Towards, the south end the suitability drops significantly, as the area has hills and valleys. • Four of the sub models were built using Logistic Regression.
  • 30. 17/01/2020 © Lero 2015 30 Results Soil to Impervious Soil to Vegetation Vegetation to Impervious Vegetation to Soil Heat maps depicting transition probabilities from one state to another
  • 31. 17/01/2020 © Lero 2015 31 • The two models were then used to predict changes for the year 2014. Results Actual land cover image of 2014 obtained from classification Predicted land cover image of 2014 (HMM-LR) Predicted land cover image of 2014 (MC-LR) • Visually it is evident that the HMM based predicted image is significantly better, in terms of similarity with the actual classified LC image than the MC based predicted image .
  • 32. 17/01/2020 © Lero 2015 32 HMM-LR MC-LR V I S V I S Precision 0.48 0.49 0.60 0.54 0.38 0.34 Recall 0.48 0.52 0.59 0.54 0.32 0.39 Results • Blob Analysis of urban and non urban regions. Blobs denote concentrated urban regions. • Green blobs are true positives, blue blobs are false negatives, and red blobs are the false positives. • HMM-LR false positives are smaller in size and less dense than those of the MC-LR. The HMM output is well balanced and resembles the actual output better. • 11% increment in precision of the persistence of Impervious Surface (I) is observed. • Precision of Soil (S) class type has jumped up by 26%. • Drop in the precision of Vegetation (V) class type by a marginal 6% . This is because vegetation cover is an outcome of relatively easy process as compared to S and I . Blob Analysis of urban areas. Left to right: (i) Actual, (ii) MC-LR, (iii) HMM-LR Precision and Recall for integrated models
  • 33. 17/01/2020 © Lero 2015 33 Conclusion • Markov Chain (MC) models are limited in their urban prediction capabilities due to the assumption of constant rate of persistence of land cover class types and inability to model the temporal factors. • We have proposed a new temporal model using Hidden Markov Model. • We have demonstrated the usefulness of our model over MC by predicting urban growth for an upcoming city of India (Pune). • We believe that this inquiry into HMM based models provides yet another tool that will equip the urban modelers, planners and decision makers to better design sustainable urban environments. • 11% and 26% increment of precision in Impervious Surface and Soil Class respectively. https://www.researchgate.net/publication/327745849_Computational_Model_for_Urban_Growth_Using_Socioeconomic_Latent_Parameters
  • 34. 17/01/2020 © Lero 2015 34 Open Data
  • 35. 17/01/2020 © Lero 2015 3535 Open Data
  • 36. 17/01/2020 © Lero 2015 3636 10030 112 https://data.gov.ie/stats
  • 37. 17/01/2020 © Lero 2015 37 How is Open Data being used? Engagement/Innovation https://www.mapalerter.com/ Data Modelling / Decision-Making http://exceedence.com/monetising-metocean-data-an-open-data- project/
  • 38. 17/01/2020 © Lero 2015 38 Monitoring / Planning Quality and Qualifications Ireland http://infographics.qqi.ie/ Sustainability / Mobility https://citybik.es/
  • 39. 17/01/2020 © Lero 2015 39 Open Data Management Challenge 39
  • 40. 17/01/2020 © Lero 2015 40 From Data to Smart Data 40 Data Sources Predictive Analytics User Awareness Recommen- dations Smart Apps Open Data Management Data Modeling Collection Aggregation Enrichment Linking Classification Cleaning Integration Storing Querying Is this data good enough for creating accurate and reliable apps?
  • 41. 17/01/2020 © Lero 2015 41 Open Data Management Challenge Open Data Quality can be very challenging for designing apps and decision support models Open Data can have multiple issues: missing values, different formats, irregular timestamps, abnormal values, etc. Data preparation such as filtering and classification is an important step for further analysis Data is not complete and require combining multiple data sources 41
  • 42. 17/01/2020 © Lero 2015 42 Case Specifics 42
  • 43. 17/01/2020 © Lero 2015 43 Data Preparation for Building a map of Playing Pitches around Dublin 43 The data is available on https://data.gov.ie Different Formats
  • 44. 17/01/2020 © Lero 2015 44 And even more challenges! 44 Different Formats Different Attributes Missing Values
  • 45. 17/01/2020 © Lero 2015 45 And even more challenges! 45 Different Formats Different Attributes Missing Values Objective: Create a good quality dataset from these resource!
  • 46. 17/01/2020 © Lero 2015 46 What is a good quality data? 46 A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning …
  • 47. 17/01/2020 © Lero 2015 47 Accurate means … Are we storing correct values? ➔ Values in the data entries should be consistent: Same form or value representation 47 Sensor Timestamp Value Location M1n 12/01/2018T10:03:59 12.3 Galway M3n 1452592980000 9.5 GA M5n 01/12/2018 10:03 1.55 NUIG Example: What issues can you identify from this table?
  • 48. 17/01/2020 © Lero 2015 48 Possible solution 48 Create a Unified Data Model Do you have access to the data source? Convert your data before further processing Adjust sources to send data using your model NoYes Accurate means …
  • 49. 17/01/2020 © Lero 2015 49 Complete means … Does the data contain everything it is supposed to contain? 49 Sensor Timestamp Value Location M1n 08/01/2018T00:00:00 32.5 NEB, NUIG M1n 09/01/2018T00:00:00 21.2 M1n 10/01/2018T00:00:00 26.1 NEB, NUIG M1n 12/01/2018T00:00:00 23.5 NEB, NUIG M1n 13/01/2018T00:00:00 NEB, NUIG M1n 14/01/2018T00:00:00 26.1 NEB, NUIG Example: What issues can you identify from this table?
  • 50. 17/01/2020 © Lero 2015 50 Unique means … Do the data entries appear only once? ➔ This issue generally appears when manual entries are allowed in the dataset 50 Surname Firstname DoB Driving test passed: Smith J. 17/12/85 17/12/05 Smith Jack 17/12/85 17/12/2005 Smith Jock 17/12/95 17/12/2005 Example: What issues can you identify from this table?
  • 51. 17/01/2020 © Lero 2015 51 Consistent means … Does the data contain any logical errors or impossibilities? 51 Sensor Timestamp Value Location M1n 08/01/2018T00:00:00 32.5 NEB, NUIG M1n 09/01/2018T00:00:00 21.2 NEB, NUIG M1n 10/01/2018T00:00:00 0 NEB, NUIG M1n 11/01/2018T00:00:00 23.5 NEB, NUIG M1n 12/01/2018T00:00:00 -1.23 NEB, NUIG M1n 13/01/2018T00:00:00 26.1 NEB, NUIG Example: What issues can you identify from this table? Are these errors? How can we identify them? ➔ Possible solutions: Filtering and Outliers detection.
  • 52. 17/01/2020 © Lero 2015 52 Up-to-Date means … Is the data updated regularly? 52 A sensor moved to a new location. What implications can this have? Can you think of a case where it doesn’t matter whether or not the data are kept up to date?
  • 53. 17/01/2020 © Lero 2015 53 Techniques for Data Preparation 53
  • 54. 17/01/2020 © Lero 2015 54 Minimal Data Preparation Pipeline 54 Observation Quality Enhancement Understanding the format of the data and its elements Classification, Aggregation, Filtering, Enrichment, etc. Modeling Identify relevant attributes and representation format
  • 55. 17/01/2020 © Lero 2015 55 Step 1: Observation This step involves the descriptive analysis (auditing) of individual data resources Data observations can be: – Highly structured: by having a predefined checklist of observational attributes (e.g., format, attributes, frequency, volume, language, etc.) – Semi-structured: by having an ad-hock checklist of observational attributes 55 • Cons: – Can be time consuming • Pros: – Define contextual information about the data – Provides good and early insights into data quality issue
  • 56. 17/01/2020 © Lero 2015 56 Step 2: Modeling This step involves the use of formal techniques for creating a data model Examples of techniques: Object-Relational mapping, Relational model etc. Methodologies: – Top-down: predefined information about the data – Bottom-up: results from a reengineering effort 56
  • 57. 17/01/2020 © Lero 2015 57 Step 3.1: Classification Data classification is the process of organizing data by categories for refined and targeted analysis ➔Example: Water or Energy consumption for working days vs. non working days ➔Categories depend on the intended use of the data 57
  • 58. 17/01/2020 © Lero 2015 58 Step 3.2: Aggregation Data aggregation is a data mining process that summarizes the data with respect to certain criteria/dimensions. Data aggregations help increase search performance Facilitates data reporting and analysis Types of aggregations: Sum, Count, Min/Max , AVG, etc. Aggregation strategies and levels: temporal (hourly, daily, etc.), source-based (resources hierarchy), location-based (outlet, room, area, building), etc. 58 The level of aggregations depends on the available data and its intended use Example of useful aggregations: Hourly traffic congestion level per road. Quarterly Inflation price
  • 59. 17/01/2020 © Lero 2015 59 Data filtering is the process of refining data sets by removing data items that do not comply to certain criteria Example: Keep data with positive water consumption values Filters depend on the context of the observations (negative values may be meaningful in installations where water flows in both directions on a pipe) 59 Step 3.3: Filtering Content-based Filtering – Selecting data items based on their values (e.g., keep only positive values) Policy-based Filtering – Filtering rules are defined as constrains similar to access control mechanisms (e.g., for security reasons) Statistical Filtering – Identify a baseline for a content-based filtering – Baselines are determined from historical data analysis – Outliers detection Hybrid Filtering – Combination of filtering options Filtering Types
  • 60. 17/01/2020 © Lero 2015 6060 Step 3.3: Filtering Outliers Detection Value inconsistent with rest of the dataset – Global Outlier Special outliers – Local Outlier • Observations inconsistent with their neighborhoods • A local instability or discontinuity ➢ Low quality measurements: faulty collectors, manual errors, wrong calibrations of devices ➢ Network issues: problems with data transmission from data sources to the data management platform ➢ Missing values or redundant values: can create wrong aggregations ➢ Correct but exceptional data! Causes of Outliers
  • 61. 17/01/2020 © Lero 2015 61 Outlier Detection Approaches Deviation-based outlier detection – Sequential exception Distance-based outlier detection – Index-based, nested-loop, cell-based, local-outliers Statistical-based outlier detection – Distribution-based, depth-based 61
  • 62. 17/01/2020 © Lero 2015 62 Distance-based Outlier Detection 62 • General idea: – Judge a point based on the distance to its neighbors – Several variants proposed • Basic Assumption: – Normal data objects have a dense neighborhood – Outliers are far apart from their neighbors • Basic Model: – Given a radius – A point is considered an outlier if at least 𝜫 percent of all other points have a distance to 𝜫 less than 𝝴
  • 63. 17/01/2020 © Lero 2015 63 Step 3.4: Enrichment This step supplements/adds additional information to the data. Possible techniques: – Additional information can be accessed from other resources – Use of services such as translation, value conversion, adding a zip code, etc. – [In case of semantic linked data] Linking to other concepts through new predicates. 63
  • 64. 17/01/2020 © Lero 2015 64 Summary 64 Discussed Land Use Land Cover Discussed Satellite Imaging and Classification Discussed Case study on Urban Growth Modelling Discussed the challenges of developing decision support systems with Open Data (e.g., need for accurate trusted information) Explained the nature and types of data issues in (Open) Data: different formats, missing values, Discussed techniques for identifying data quality issues Discussed data preparation and cleaning strategies (e.g., data clustering, filtering, etc.) Identified a minimal data preparation pipeline
  • 65. 17/01/2020 © Lero 2015 6565 Rahm, Erhard, and Hong Hai Do. "Data cleaning: Problems and current approaches." IEEE Data Eng. Bull. 23.4 (2000): 3-13. Assigned Reading https://landsat.gsfc.nasa.gov/pdf_archive/How2make.pdf
  • 66. 17/01/2020 © Lero 2015 66 Acknowledgments I created this material from several resources: – https://study.com/academy/lesson/geospatial-data-definition-example.html – Data from USGS – http://www.splibtarang.com/index.php – Yadav, Piyush, Shamsuddin Ladha, Shailesh Deshpande, and Edward Curry. " Computational Model for Urban Growth Using Socioeconomic Latent Parameters ", In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 65-78. Springer, Cham, 2018 – NASA, Landsat Website – Data from https://data.gov.ie , – A ppt by David Corn, “Data Quality and Data Cleaning1” – A ppt by Eric Poulin and Colin Yu, “Outlier Detection and Analysis” – A paper by Erhard Rahm and Hong Hai Do, ” Data Cleaning: Problems and Current Approaches” – A ppt by Cameron Brooks, “Lets Build a Smarter Planet: IBM Smarter Water Management” 66
  • 67. 17/01/2020 © Lero 2015 67 Further Reading For further readings I recommend the following books 67 Book Link
  • 68. 17/01/2020 © Lero 2015 68 Assignments Group Assignment Total 100 marks Two Sections Section 1- (30 marks) – Objective- Classify a given Landsat Images of a Dublin region of two years using QGIS software and find one major change that you can see between two images – Marking Scheme: Report 100% (30 marks) Section2- (70 marks) – Objective: Create a complete and clean dataset by merging three datasets – Dataset: Real world data from https://data.gov.ie • Playing pitches around Dublin • Multiple formats (minimum 2 are required) • Data completion using other sources – Tools: Python or Java – Marking scheme: • Report 50% (35 marks) • Code/Analytics 50% (35 marks)
  • 69. 17/01/2020 © Lero 2015 69 Guidelines For Group Two people in each group Fill the group information by 21st Jan , 5pm. (Link Given Below) Those who will not fill will be assigned random groups. For any doubt you can mail me on piyush.yadav@insight-centre.org Assignment Due: Jan 30th Midnight https://docs.google.com/spreadsheets/d/1eTwNF6-OqvSGKZtv0WWREgRjnKt8w_OATEH6b18unJQ/edit?usp=sharing