SlideShare a Scribd company logo
Data Quality
In Real Estate
Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo
Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference
About Geophy
● Goal to map all buildings in the world
● Provide a quality score for each building
○ Based on location, building status, history, environmental metrics, etc
● Semantic platform
○ RDF eases the data integration process
● Team of 45 with aim to double by next year
Real Estate is a very complex domain
Really!
Possible constraints on addresses?
● An address will start with, or at least include, a building number.
● When there is a building number, it will be all-numeric.
● No buildings are numbered zero
● Well, at the very least no buildings have negative numbers
● A building number will only be used once per street
● A building will only have one number
● A building name won't also be a number
● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses
Geophy [set of] ontologies
● 13 ontologies (+ 9 external)
● 125 Classes
○ Buildings
○ Addresses
○ Companies
○ [...]
● 720 properties
○ 500 datatype
○ 160 relation properties
● Growing...
Quality is expensive
● Quality of source data
○ Free, open, closed data sources, etc.
● Data clean up process
○ Violations, deduplication, precision, etc.
○ How much time and effort can one afford?
How much quality is good enough?
Fitness for use
Quality of ...
● Source data
○ Accuracy of the source
● Translation of source data
○ RDF mappings, rml, d2rq, scripts etc.
● Model design
○ Modelling quality
○ Data fitting on schema
● Model definition
○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc
○ Semantics i.e RDFS, OWL DL/RL/FULL, etc
Evolution & quality
Data evolves
so do ontologies
so do RDF mappings
so does code
so do SPARQL queries
so do constraints
http://aligned-project.eu
Scaling quality ...
● Thousands of triples
● Millions of triples
● Billions of triples
● ?
Try to move validation in the K range (when possible)
Validate closer to the source
Validate the model
Validate the RDF mappings
Validate RDF mapping excerpts
Validate instance data
Automate, automate & automate
Can you spot the error?
rdfs:label ⇒ rdf:langString
:foo rdfs:label ″foo @en″ .
Automate, automate & automate
Can you spot the error?
rdfs:label ⇒ rdf:langString
:foo rdfs:label ″foo @en″ .
:foo rdfs:label ″foo″@en .
CI/CD is your buddy
● Integrate validation with your CI/CD
○ Choose tools & technologies wisely
○ Jenkins, Travis, Gitlab, TeamCity
● Fail the build until data issues are fixed
● Data integration validation checks
○ Standalone datasets can pass CI
Thank you for your attention
Questions?

More Related Content

PDF
Graph databases & data integration v2
PDF
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PDF
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
PDF
HyperGraphQL
PDF
Evolution of the Graph Schema
PDF
Jesús Barrasa
PPTX
Semantic Variation Graphs the case for RDF & SPARQL
PDF
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Graph databases & data integration v2
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
HyperGraphQL
Evolution of the Graph Schema
Jesús Barrasa
Semantic Variation Graphs the case for RDF & SPARQL
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j

What's hot (20)

PDF
Debunking some “RDF vs. Property Graph” Alternative Facts
PDF
JSON-LD and SHACL for Knowledge Graphs
PDF
TinkerPop 2020
PDF
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
PPTX
Semantic Cartography: Using ontologies to create adaptable tools for text exp...
PDF
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...
PDF
Theory behind Image Compression and Semantic Search
PDF
Christian Jakenfelds
PPTX
LD4KD 2015 - Demos and tools
PPTX
Cogapp Open Studios 2012 - Adventures with Linked Data
PDF
Managing RDF data with graph databases
PDF
Open data easy, explicit and fast
PDF
Why is JSON-LD Important to Businesses - Franz Inc
PDF
Presentation shexer
PPTX
NoSQL Roundup
PDF
Dirk Goldhahn: Introduction to the German Wortschatz Project
PDF
Indexing, searching, and aggregation with redi search and .net
PDF
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
PPTX
Deriving an Emergent Relational Schema from RDF Data
PDF
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Debunking some “RDF vs. Property Graph” Alternative Facts
JSON-LD and SHACL for Knowledge Graphs
TinkerPop 2020
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
Semantic Cartography: Using ontologies to create adaptable tools for text exp...
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...
Theory behind Image Compression and Semantic Search
Christian Jakenfelds
LD4KD 2015 - Demos and tools
Cogapp Open Studios 2012 - Adventures with Linked Data
Managing RDF data with graph databases
Open data easy, explicit and fast
Why is JSON-LD Important to Businesses - Franz Inc
Presentation shexer
NoSQL Roundup
Dirk Goldhahn: Introduction to the German Wortschatz Project
Indexing, searching, and aggregation with redi search and .net
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Deriving an Emergent Relational Schema from RDF Data
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Ad

Similar to Data quality in Real Estate (20)

PDF
RDF Data Quality Assessment - connecting the pieces
PDF
Mutation management in BIM models during O&M: presentation - Niels van de Ven
PDF
Neanex - Semantic Construction with Graphs
KEY
Make Life Suck Less (Building Scalable Systems)
PDF
Data Quality
PPTX
CIB W78 2015 - Keynote "The Web of Construction Data:Pathways and Opportunities"
PPTX
ACM SIGMOD SBD2016 - Querying and reasoning over large scale building dataset...
KEY
Make Life Suck Less (Building Scalable Systems)
PDF
On the relation between Model View Definitions (MVDs) and Linked Data technol...
PPTX
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
PDF
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
PDF
Querying and reasoning over large scale building datasets: an outline of a pe...
PDF
SQL on everything, in memory
PPTX
SustainablePlaces_ifcOWL_applications_2015-09-17
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
UGent Research Projects on Linked Data in Architecture and Construction
PDF
Building on Multi-Model Databases
PDF
Using SPARQL and SPIN for Data Quality Management on the Semantic Web
PPTX
LOC presentation 2020: Future of openBIM standards
PDF
Building product suggestions for a BIM model based on rule sets and a semant...
RDF Data Quality Assessment - connecting the pieces
Mutation management in BIM models during O&M: presentation - Niels van de Ven
Neanex - Semantic Construction with Graphs
Make Life Suck Less (Building Scalable Systems)
Data Quality
CIB W78 2015 - Keynote "The Web of Construction Data:Pathways and Opportunities"
ACM SIGMOD SBD2016 - Querying and reasoning over large scale building dataset...
Make Life Suck Less (Building Scalable Systems)
On the relation between Model View Definitions (MVDs) and Linked Data technol...
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
Querying and reasoning over large scale building datasets: an outline of a pe...
SQL on everything, in memory
SustainablePlaces_ifcOWL_applications_2015-09-17
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
UGent Research Projects on Linked Data in Architecture and Construction
Building on Multi-Model Databases
Using SPARQL and SPIN for Data Quality Management on the Semantic Web
LOC presentation 2020: Future of openBIM standards
Building product suggestions for a BIM model based on rule sets and a semant...
Ad

More from Dimitris Kontokostas (12)

PDF
Introduction to apache kafka
PDF
Data quality assessment - connecting the pieces...
PDF
8th DBpedia meeting / California 2016
PDF
Semantically enhanced quality assurance in the jurion business use case
PDF
Graph databases & data integration - the case of RDF
PDF
DBpedia past, present & future
PDF
DBpedia+ / DBpedia meeting in Dublin
PDF
DBpedia ♥ Commons
PDF
NLP Data Cleansing Based on Linguistic Ontology Constraints
PDF
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
PDF
DBpedia Viewer - LDOW 2014
ODP
DBpedia i18n - Amsterdam Meeting (30/01/2014)
Introduction to apache kafka
Data quality assessment - connecting the pieces...
8th DBpedia meeting / California 2016
Semantically enhanced quality assurance in the jurion business use case
Graph databases & data integration - the case of RDF
DBpedia past, present & future
DBpedia+ / DBpedia meeting in Dublin
DBpedia ♥ Commons
NLP Data Cleansing Based on Linguistic Ontology Constraints
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
DBpedia Viewer - LDOW 2014
DBpedia i18n - Amsterdam Meeting (30/01/2014)

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Monthly Chronicles - July 2025
Cloud computing and distributed systems.

Data quality in Real Estate

  • 1. Data Quality In Real Estate Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference
  • 2. About Geophy ● Goal to map all buildings in the world ● Provide a quality score for each building ○ Based on location, building status, history, environmental metrics, etc ● Semantic platform ○ RDF eases the data integration process ● Team of 45 with aim to double by next year
  • 3. Real Estate is a very complex domain Really!
  • 4. Possible constraints on addresses? ● An address will start with, or at least include, a building number. ● When there is a building number, it will be all-numeric. ● No buildings are numbered zero ● Well, at the very least no buildings have negative numbers ● A building number will only be used once per street ● A building will only have one number ● A building name won't also be a number ● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses
  • 5. Geophy [set of] ontologies ● 13 ontologies (+ 9 external) ● 125 Classes ○ Buildings ○ Addresses ○ Companies ○ [...] ● 720 properties ○ 500 datatype ○ 160 relation properties ● Growing...
  • 6. Quality is expensive ● Quality of source data ○ Free, open, closed data sources, etc. ● Data clean up process ○ Violations, deduplication, precision, etc. ○ How much time and effort can one afford? How much quality is good enough? Fitness for use
  • 7. Quality of ... ● Source data ○ Accuracy of the source ● Translation of source data ○ RDF mappings, rml, d2rq, scripts etc. ● Model design ○ Modelling quality ○ Data fitting on schema ● Model definition ○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc ○ Semantics i.e RDFS, OWL DL/RL/FULL, etc
  • 8. Evolution & quality Data evolves so do ontologies so do RDF mappings so does code so do SPARQL queries so do constraints http://aligned-project.eu
  • 9. Scaling quality ... ● Thousands of triples ● Millions of triples ● Billions of triples ● ? Try to move validation in the K range (when possible)
  • 10. Validate closer to the source Validate the model Validate the RDF mappings Validate RDF mapping excerpts Validate instance data
  • 11. Automate, automate & automate Can you spot the error? rdfs:label ⇒ rdf:langString :foo rdfs:label ″foo @en″ .
  • 12. Automate, automate & automate Can you spot the error? rdfs:label ⇒ rdf:langString :foo rdfs:label ″foo @en″ . :foo rdfs:label ″foo″@en .
  • 13. CI/CD is your buddy ● Integrate validation with your CI/CD ○ Choose tools & technologies wisely ○ Jenkins, Travis, Gitlab, TeamCity ● Fail the build until data issues are fixed ● Data integration validation checks ○ Standalone datasets can pass CI
  • 14. Thank you for your attention Questions?