SlideShare a Scribd company logo
Integrating Dynamically-Computed Data and Web APIs
into “Virtual” Databases and Knowledge Graphs
Enabling transparent SQL/SPARQL access to both static and dynamically-computed data
Francesco Corcoglioniti
2022-11-11
postdoc @ KRDB, Free University of Bolzano,
supported by HIVE Fusion Grant project (2021-2022), OntoCRM project (2022-2024), and Ontopic s.r.l
Background
Data is increasingly available via Web APIs
• access to 3rd-party and/or dynamically-computed data
• access to data-related services, e.g., text search
Some APIs’ statisticsa
• 83% of all Internet traffic belongs to API-based services
• 2M+ API repositories on GitHub
• 90% of developers use APIs
• 30% of development time spent on coding APIs
Complex data access problem for applications operating on
data from both databases and APIs
a
https://nordicapis.com/20-impressive-api-economy-statistics/
RDB Sources
API Sources
SQL
calls
Application
complex
data access
problem
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 1/16
Simplify API Access via “Virtual” Databases (VDBs) or “Virtual” Knowledge Graphs (VKGs)
RDB Sources
Virtual Database (VDB)
API Sources
SQL
SQL
calls
Application
RDB Sources
Virtual Knowledge Graph (VKG)
API Sources
SPARQL
SQL
calls
Application • unified data access:
applications operate on
a single DB/KG data
source via a declarative
data manipulation
language (DML)
• virtual DB/KG: its data
is (mostly) kept in the
original sources (no ETL)
• data federation setting:
VDB/VKG queries run by
orchestrating source
sub-queries and API
calls
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 2/16
Example Scenario – Extend Open Data Hub (ODH) with Semantic Search
Answer hybrid queries like:
• get (plot) IRI, description, rating &
location of accommodations ...
• whose rating is 3 stars or more
(structured constraint) and ...
• whose EN description matches the
search string “horse riding” (text
constraint)
Semantic search: improved text search
that aims at capturing and leveraging
text meaning (vs term matching only)
• e.g., via BERT-based model from
Sentence Transformers library
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 3/16
VDB Specification – SQL/MED
SQL/MED allows federating multiple sources in a virtual database (VDB)
• standardized SQL extension supported by some data federation systems like Teiid
• VDB as a set of schemas mapped to foreign data sources accessed via wrappers/translators
• we extend Teiid with a new service translator for accessing APIs
Example using Teiid with our extensions:
CREATE DATABASE vdb_example OPTIONS ( "... connection options for federated sources ..." );
USE DATABASE vdb_example;
CREATE SERVER db_source FOREIGN DATA WRAPPER postgresql; -- define RDB source with schema 'db'
CREATE SCHEMA db SERVER db_source; -- using 'postgresql' translator to access it
CREATE SERVER srv_source FOREIGN DATA WRAPPER service; -- define API source with schema 'srv'
CREATE SCHEMA srv SERVER srv_source; -- using 'service' translator to access it
IMPORT FOREIGN SCHEMA public FROM SERVER db_source INTO db OPTIONS ( importer.catalog 'public' );
SET SCHEMA srv;
-- CREATE FOREIGN TABLE / PROCEDURE statements mapped to API operations (API bindings)
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 4/16
VDB Specification – API Bindings
API operations as SQL/MED procedures
• input tuple → 0..n output tuples
• URL, method, request/response templates
CREATE FOREIGN PROCEDURE api_semsearch_query (
query VARCHAR
) RETURNS TABLE (
query VARCHAR,
id VARCHAR,
score DOUBLE,
excerpt VARCHAR
) OPTIONS (
"method" 'post',
"url" 'http://semsearch:8080/query',
"requestBody" '{"query": "{query}", "n": 100}',
"responseBody" '{"matches": [{
"id": "{id}",
"score": "{score}",
"excerpt": "{excerpt}" }] }'
);
API data as SQL/MED virtual tables
• linked to API operations/procedures
• each procedure defines an access pattern
CREATE FOREIGN TABLE vt_semsearch_match (
query VARCHAR NOT NULL,
id VARCHAR NOT NULL,
score DOUBLE NOT NULL,
excerpt VARCHAR NOT NULL,
PRIMARY KEY (query, id)
) OPTIONS ( "select" 'api_semsearch_query' );
CREATE FOREIGN TABLE vt_semsearch_index (
id VARCHAR PRIMARY KEY,
text VARCHAR NOT NULL
) OPTIONS (
"UPDATABLE" 'true',
"upsert" 'api_semsearch_store',
"delete" 'api_semsearch_clear'
);
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 5/16
Query Translation & Execution
Given a VDB defined using SQL/MED + API Bindings and an input query over the VDB
• Teiid splits the query into sub-queries based on translator capabilities and cost heuristics
• sub-queries are sent to translators & Teiid handles remaining operations (e.g., federated joins)
Example SQL query
SELECT s.score,
s.excerpt,
a."AccoCategoryId",
a."AccoDetail-en-Name",
a."AccoDetail-en-City"
FROM srv.vt_semsearch_match AS s
JOIN db.v_accommodationsopen AS a
ON s.id = a."Id"
WHERE s.query = 'horse riding'
ORDER BY s.score DESC
LIMIT 10
Execution plan
LimitNode (limit = 10)
SortNode (s.score DESC)
ProjectNode (s.score, ... a."AccoDetail-en-City")
JoinNode (s.id = a."Id", merge join strategy)
AccessNode (API)
SELECT id, excerpt, score
FROM vt_semsearch_match
WHERE query = ’horse riding’
AccessNode (RDB)
SELECT "Id", "AccoDetail-en-Name",
"AccoDetail-en-City",
FROM v_accommodationsopen
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 6/16
Query Translation & Execution – Push-down of Projection, Filtering, Sorting, Slicing
Special input attributes map API capabilities related to standard relational operators
• filtering: return/process only objects matching some criteria (e.g., attribute = or ≥ constant)
• projection: include/exclude certain attributes in returned results
• sorting: sort results according to a certain attribute and direction (ascending/descending)
• slicing: return only a given page of all possible results
CREATE FOREIGN PROCEDURE api_station_data_from_to (
stype VARCHAR NOT NULL,
sname VARCHAR NOT NULL,
tname VARCHAR NOT NULL,
__min_inclusive__mvaliddate DATE NOT NULL, -- filter push down (conditions min <= mvaliddate <= max)
__max_inclusive__mvaliddate DATE NOT NULL,
__limit__ INTEGER -- slicing push down
) RETURNS TABLE ( ... )
) OPTIONS ( ... );
Partial/complete push down of these operators whenever possible
• allows offloading computation to the API (e.g., sorting)
• allows reducing costs by manipulating & transferring less data
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 7/16
Query Translation & Execution – Exploiting Bulk API Operations
Bulk API operations operate on multiple input tuples, such as lookup by set of IDs or bulk store
• their use enables better performance due to less API calls
• useful to speed-up dependent joins (using IN operator) between RDBMS and API data
A A
RDBMS table R virtual table S bulk API operation
(A input attribute)
⨝R.A = S.A
SELECT A, …
FROM R
WHERE …
1
SELECT A, …
FROM S
WHERE A IN (a1, a2, …)
AND …
3
2 Extract values of join
attribute A: a1, a2, …
API bindings
4 Bulk API calls with
multiple input tuples for
different values of A:
a1, a2, …
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 8/16
Data Materialization
Data materialization: required by API operations that cannot be invoked at query time
• operations too expensive to call at query time (e.g., align API and DB identifiers)
• operations instrumental to the use of external APIs (e.g., text indexing in a search engine)
Solution #1: materialized views in Teiid (or other data federation system used)
Solution #2: dedicated materialization engine for
flexibly executing arbitrary materialization rules:
• identifier – for documentation & diagnostics
• target – the system-managed computed table
(possibly virtual) where data is stored
• source – arbitrary SQL query (over any tables)
that produces the data to store
rules:
- id: index_accommodation_texts
target: vt_semsearch_index
source: |-
SELECT "Id" AS id,
"AccoDetail-en-Longdesc" AS text
FROM v_accommodationsopen
WHERE "AccoDetail-en-Longdesc"
IS NOT NULL
- ... other rules ...
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 9/16
Data Materialization (cont’d)
Rules (their SQL source queries) are analyzed to derive a rule dependency graph, which is mapped
to an execution plan using fixpoint rule evaluation for strongly connected components
R1 R2
R3 R4
R5
R1 R2
R3 R4
R5
sequence (
parallel (
R1,
sequence (
R2,
fixpoint (
parallel (
R3,
R4
)
)
)
),
R5
)
Rule / Table Dependencies Rule Dependencies Execution Plan
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 10/16
VKG over APIs – Ontology-Based Data Access (OBDA) & Ontop
OBDA builds a VKG on an RDB source
• an ontology defines the VKG
classes and properties (TBox)
• mappings define how to
populate each class/property
with RDB data (ABox)
• query rewriting maps VKG
queries (SPARQL) into native
queries (SQL) over the source
• Ontop open-source system
Idea: build a VDB over APIs, then
apply OBDA to convert it into a VKG
• Ontop + Teiid/service translator
VKGs for Data Access Ontop and Ontopic Developments NL Knowledge Extraction
Query answering by query rewriting
Ontology
Mappings
Data
Sources
. . .
. . .
. . .
. . .
Ontological Query q
Rewritten Query
SQL
Relational Answer
Ontological Answer
Rewriting
Unfolding
Evaluation
Result Translation
Diego Calvanese, Francesco Corcoglioniti, Guohui Xiao (unibz) VGKs for Data Access and Integration Huawei – 03/08/202
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 11/16
VKG over APIs – Ontology & Mappings Example
Ontology
schema:Accommodation a owl:Class ;
rdfs:subClassOf schema:Place ;
rdfs:label "Accommodation"@en ;
...
schema:name a owl:DatatypeProperty ;
...
hive:Match a owl:Class ...
Current ontology formalism (OWL 2 QL) reused
as is, but now also models data from APIs
Mappings
mappingId Semantic Search
target data:match/accommodation/{id}/{query}
a hive:Match;
hive:query {query}^^xsd:string;
hive:resource data:accommodation/{id};
hive:excerpt {excerpt}@en;
hive:score {score}^^xsd:decimal.
source SELECT *
FROM hiveodh.srv.vt_semsearch_match
Current VKG mapping formalism reused as is, but
data may now come from API virtual tables
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 12/16
VKG over APIs – Query Rewriting & Evaluation Example
User-supplied SPARQL query
SELECT ?h ?posLabel ?rating ?pos {
[] a hive:Match ;
hive:query "horse riding"^^xsd:string ;
hive:resource ?h ;
hive:excerpt ?excerpt ;
hive:score ?score .
?h a schema:LodgingBusiness ;
geo:defaultGeometry/geo:asWKT ?pos ;
schema:name ?name ;
schema:description ?description ;
schema:starRating/schema:ratingValue ?rating.
FILTER (?rating >= 3 && lang(?name) = 'en' &&
lang(?description) = 'en')
BIND (CONCAT(?name, " <br><br>...", ?excerpt,
"...<br><br>", ?description) AS ?posLabel)
}
ORDER BY DESC(?score) LIMIT 10
SQL query rewritten by Ontop
SELECT
v1.id,
v1.excerpt, -- fields used
v2."AccoDetail-en-Name", -- for deriving
v2."AccoDetail-en-Longdesc", -- ?posLabel
... complex expression computing rating ...,
ST_ASTEXT(v2."Geometry")
FROM
hiveodh.srv.vt_semsearch_match v1,
hiveodh.db.v_accommodationsopen v2
WHERE
v1."id" = v2."Id" AND
CAST(v1."query" AS TEXT) = 'horse riding' AND
... complex condition on rating >= 3 ... AND
... nonnull conditions for output columns ...
ORDER BY CAST(v1."score" AS DECIMAL) DESC
LIMIT 10
SQL query evaluated on the VDB by Teiid
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 13/16
VKG over APIs – ODH with Semantic Search Demo
Data sources
DB with ODH tourism data +
Semantic search API to index &
query accommodations texts
System
Ontop embedding Teiid +
materialization engine
Demo
https://hive.inf.unibz.it/
odh/vkg/
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 14/16
Overall Framework & Ongoing Work
Virtual DB (VDB) Teiid + service translator
VKG Mappings
including virtual tables,
used for query rewriting
Materialization Rules
pre-compute results of
expensive API calls
→ VDB/VKG no more
fully “virtual”
API Bindings
define how to query/update a virtual
table via API calls, if possible
→ limited access patterns RDB Sources
API Sources
Virtual Knowledge Graph (VKG) Ontop
SQL
SQL
calls
Application
(VKG-based)
Application
(VDB-based)
SQL
SPARQL
VKG Ontology
formalizes the classes/properties
(the “schema”) of the VKG,
enabling reasoning
1
3
2
Ongoing work:
1. query rewriting
tuned to VDB + APIs
2. service translator
improvements
3. change data capture
tools (e.g. Debezium)
for incremental
materialization
4. application to
analysis of static +
dynamic data in the
domain of climate
risk management
(OntoCRM project)
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 15/16
Thanks for attending!
Ontop: https://ontop-vkg.org/
Teiid: https://teiid.io/
our extensions: https://hive.inf.unibz.it/

More Related Content

PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PDF
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
PPTX
Knowledge Graph Introduction
PPTX
Semantic Web and Related Work at W3C
PPTX
Cognitive data
PDF
On demand access to Big Data through Semantic Technologies
PDF
Drill architecture 20120913
PPTX
A Real-World Implementation of Linked Data
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Knowledge Graph Introduction
Semantic Web and Related Work at W3C
Cognitive data
On demand access to Big Data through Semantic Technologies
Drill architecture 20120913
A Real-World Implementation of Linked Data

Similar to SFScon22 - Francesco Corcoglioniti - Integrating Dynamically-Computed Data and Web APIs into Virtual Databases Knowledge Graph.pdf (20)

PPT
Re-using Media on the Web: Media fragment re-mixing and playout
PDF
Data Services and the Modern Data Ecosystem
PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
PDF
Real time analytics on deep learning @ strata data 2019
PDF
Ontologies & linked open data
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
PDF
54147 Session PPT - ComplexRelationshipsMadeSimple.pdf
PPTX
Integrating Apache Phoenix with Distributed Query Engines
PDF
OrientDB: Unlock the Value of Document Data Relationships
PDF
Go fast in a graph world
PPTX
Semantic Web Standards and the Variety “V” of Big Data
PDF
Virtual Knowledge Graph by MIT Article.pdf
PDF
Towards Virtual Knowledge Graphs over Web APIs
PPTX
Aeneas:: An Extensible NoSql Enhancing Application System
PDF
Relaxing global-as-view in mediated data integration from linked data
PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PPTX
2018 05 08_biological_databases_no_sql
PPT
Facets and Pivoting for Flexible and Usable Linked Data Exploration
PPTX
Self adaptive based natural language interface for disambiguation of
PPTX
PhillyDB Talk - Beyond Batch
Re-using Media on the Web: Media fragment re-mixing and playout
Data Services and the Modern Data Ecosystem
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Real time analytics on deep learning @ strata data 2019
Ontologies & linked open data
Why and how to leverage the simplicity and power of SQL on Flink
54147 Session PPT - ComplexRelationshipsMadeSimple.pdf
Integrating Apache Phoenix with Distributed Query Engines
OrientDB: Unlock the Value of Document Data Relationships
Go fast in a graph world
Semantic Web Standards and the Variety “V” of Big Data
Virtual Knowledge Graph by MIT Article.pdf
Towards Virtual Knowledge Graphs over Web APIs
Aeneas:: An Extensible NoSql Enhancing Application System
Relaxing global-as-view in mediated data integration from linked data
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
2018 05 08_biological_databases_no_sql
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Self adaptive based natural language interface for disambiguation of
PhillyDB Talk - Beyond Batch
Ad

More from South Tyrol Free Software Conference (20)

PDF
SFSCON24 - Marina Latini - 1, 2, 3, Doc Kit!
PDF
SFSCON24 - Carmen Delgado Ivar Grimstad - Nurturing OpenJDK distribution: Ecl...
PDF
SFSCON24 - Eduardo Guerra - codEEmoji – Making code more informative with emojis
PDF
SFSCON24 - Juri Solovjov - How to start contributing and still have fun
PDF
SFSCON24 - Michal Skipala & Bruno Rossi - Monolith Splitter
PDF
SFSCON24 - Jorge Melegati - Software Engineering Automation: From early tools...
PDF
SFSCON24 - Chiara Civardi & Dominika Tasarz Sochacka - The Crucial Role of Op...
PDF
SFSCON24 - Moritz Mock, Barbara Russo & Jorge Melegati - Can Test Driven Deve...
PDF
SFSCON24 - Aurelio Buonomo & Christian Zanotti - Apisense – Easily monitor an...
PDF
SFSCON24 - Giovanni Giannotta & Orneda Lecini - Approaches to Object Detectio...
PDF
SFSCON24 - Alberto Nicoletti - The SMART Box of AURA Project
PDF
SFSCON24 - Luca Alloatti - Open-source silicon chips
PDF
SFSCON24 - Roberto Innocenti - 2025 scenario on OpenISA OpenPower Open Hardwa...
PDF
SFSCON24 - Juan Rico - Enabling global interoperability among smart devices ...
PDF
SFSCON24 - Seckin Celik & Davide Serpico - Adoption Determinants of Open Hard...
PDF
SFSCON24 - Stefan Mutschlechner - Smart Werke Meran - Lorawan Use Cases
PDF
SFSCON24 - Mattia Pizzirani - Raspberry Pi and Node-RED: Open Source Tools fo...
PDF
SFSCON24 - Attaullah Buriro - ClapMetrics: Decoding Users Genderand Age Throu...
PDF
SFSCON24 - Joseph P. De Veaugh Geiss - Opt out? Opt in? Opt Green! Bringing F...
PDF
SFSCON24 - Fulvio Mastrogiovanni - On the ethical challenges raised by robots...
SFSCON24 - Marina Latini - 1, 2, 3, Doc Kit!
SFSCON24 - Carmen Delgado Ivar Grimstad - Nurturing OpenJDK distribution: Ecl...
SFSCON24 - Eduardo Guerra - codEEmoji – Making code more informative with emojis
SFSCON24 - Juri Solovjov - How to start contributing and still have fun
SFSCON24 - Michal Skipala & Bruno Rossi - Monolith Splitter
SFSCON24 - Jorge Melegati - Software Engineering Automation: From early tools...
SFSCON24 - Chiara Civardi & Dominika Tasarz Sochacka - The Crucial Role of Op...
SFSCON24 - Moritz Mock, Barbara Russo & Jorge Melegati - Can Test Driven Deve...
SFSCON24 - Aurelio Buonomo & Christian Zanotti - Apisense – Easily monitor an...
SFSCON24 - Giovanni Giannotta & Orneda Lecini - Approaches to Object Detectio...
SFSCON24 - Alberto Nicoletti - The SMART Box of AURA Project
SFSCON24 - Luca Alloatti - Open-source silicon chips
SFSCON24 - Roberto Innocenti - 2025 scenario on OpenISA OpenPower Open Hardwa...
SFSCON24 - Juan Rico - Enabling global interoperability among smart devices ...
SFSCON24 - Seckin Celik & Davide Serpico - Adoption Determinants of Open Hard...
SFSCON24 - Stefan Mutschlechner - Smart Werke Meran - Lorawan Use Cases
SFSCON24 - Mattia Pizzirani - Raspberry Pi and Node-RED: Open Source Tools fo...
SFSCON24 - Attaullah Buriro - ClapMetrics: Decoding Users Genderand Age Throu...
SFSCON24 - Joseph P. De Veaugh Geiss - Opt out? Opt in? Opt Green! Bringing F...
SFSCON24 - Fulvio Mastrogiovanni - On the ethical challenges raised by robots...
Ad

Recently uploaded (20)

PDF
System and Network Administration Chapter 2
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
System and Network Administraation Chapter 3
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
medical staffing services at VALiNTRY
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Transform Your Business with a Software ERP System
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
System and Network Administration Chapter 2
VVF-Customer-Presentation2025-Ver1.9.pptx
System and Network Administraation Chapter 3
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms II-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
PTS Company Brochure 2025 (1).pdf.......
CHAPTER 2 - PM Management and IT Context
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
medical staffing services at VALiNTRY
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Transform Your Business with a Software ERP System
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Odoo Companies in India – Driving Business Transformation.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

SFScon22 - Francesco Corcoglioniti - Integrating Dynamically-Computed Data and Web APIs into Virtual Databases Knowledge Graph.pdf

  • 1. Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs Enabling transparent SQL/SPARQL access to both static and dynamically-computed data Francesco Corcoglioniti 2022-11-11 postdoc @ KRDB, Free University of Bolzano, supported by HIVE Fusion Grant project (2021-2022), OntoCRM project (2022-2024), and Ontopic s.r.l
  • 2. Background Data is increasingly available via Web APIs • access to 3rd-party and/or dynamically-computed data • access to data-related services, e.g., text search Some APIs’ statisticsa • 83% of all Internet traffic belongs to API-based services • 2M+ API repositories on GitHub • 90% of developers use APIs • 30% of development time spent on coding APIs Complex data access problem for applications operating on data from both databases and APIs a https://nordicapis.com/20-impressive-api-economy-statistics/ RDB Sources API Sources SQL calls Application complex data access problem Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 1/16
  • 3. Simplify API Access via “Virtual” Databases (VDBs) or “Virtual” Knowledge Graphs (VKGs) RDB Sources Virtual Database (VDB) API Sources SQL SQL calls Application RDB Sources Virtual Knowledge Graph (VKG) API Sources SPARQL SQL calls Application • unified data access: applications operate on a single DB/KG data source via a declarative data manipulation language (DML) • virtual DB/KG: its data is (mostly) kept in the original sources (no ETL) • data federation setting: VDB/VKG queries run by orchestrating source sub-queries and API calls Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 2/16
  • 4. Example Scenario – Extend Open Data Hub (ODH) with Semantic Search Answer hybrid queries like: • get (plot) IRI, description, rating & location of accommodations ... • whose rating is 3 stars or more (structured constraint) and ... • whose EN description matches the search string “horse riding” (text constraint) Semantic search: improved text search that aims at capturing and leveraging text meaning (vs term matching only) • e.g., via BERT-based model from Sentence Transformers library Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 3/16
  • 5. VDB Specification – SQL/MED SQL/MED allows federating multiple sources in a virtual database (VDB) • standardized SQL extension supported by some data federation systems like Teiid • VDB as a set of schemas mapped to foreign data sources accessed via wrappers/translators • we extend Teiid with a new service translator for accessing APIs Example using Teiid with our extensions: CREATE DATABASE vdb_example OPTIONS ( "... connection options for federated sources ..." ); USE DATABASE vdb_example; CREATE SERVER db_source FOREIGN DATA WRAPPER postgresql; -- define RDB source with schema 'db' CREATE SCHEMA db SERVER db_source; -- using 'postgresql' translator to access it CREATE SERVER srv_source FOREIGN DATA WRAPPER service; -- define API source with schema 'srv' CREATE SCHEMA srv SERVER srv_source; -- using 'service' translator to access it IMPORT FOREIGN SCHEMA public FROM SERVER db_source INTO db OPTIONS ( importer.catalog 'public' ); SET SCHEMA srv; -- CREATE FOREIGN TABLE / PROCEDURE statements mapped to API operations (API bindings) Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 4/16
  • 6. VDB Specification – API Bindings API operations as SQL/MED procedures • input tuple → 0..n output tuples • URL, method, request/response templates CREATE FOREIGN PROCEDURE api_semsearch_query ( query VARCHAR ) RETURNS TABLE ( query VARCHAR, id VARCHAR, score DOUBLE, excerpt VARCHAR ) OPTIONS ( "method" 'post', "url" 'http://semsearch:8080/query', "requestBody" '{"query": "{query}", "n": 100}', "responseBody" '{"matches": [{ "id": "{id}", "score": "{score}", "excerpt": "{excerpt}" }] }' ); API data as SQL/MED virtual tables • linked to API operations/procedures • each procedure defines an access pattern CREATE FOREIGN TABLE vt_semsearch_match ( query VARCHAR NOT NULL, id VARCHAR NOT NULL, score DOUBLE NOT NULL, excerpt VARCHAR NOT NULL, PRIMARY KEY (query, id) ) OPTIONS ( "select" 'api_semsearch_query' ); CREATE FOREIGN TABLE vt_semsearch_index ( id VARCHAR PRIMARY KEY, text VARCHAR NOT NULL ) OPTIONS ( "UPDATABLE" 'true', "upsert" 'api_semsearch_store', "delete" 'api_semsearch_clear' ); Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 5/16
  • 7. Query Translation & Execution Given a VDB defined using SQL/MED + API Bindings and an input query over the VDB • Teiid splits the query into sub-queries based on translator capabilities and cost heuristics • sub-queries are sent to translators & Teiid handles remaining operations (e.g., federated joins) Example SQL query SELECT s.score, s.excerpt, a."AccoCategoryId", a."AccoDetail-en-Name", a."AccoDetail-en-City" FROM srv.vt_semsearch_match AS s JOIN db.v_accommodationsopen AS a ON s.id = a."Id" WHERE s.query = 'horse riding' ORDER BY s.score DESC LIMIT 10 Execution plan LimitNode (limit = 10) SortNode (s.score DESC) ProjectNode (s.score, ... a."AccoDetail-en-City") JoinNode (s.id = a."Id", merge join strategy) AccessNode (API) SELECT id, excerpt, score FROM vt_semsearch_match WHERE query = ’horse riding’ AccessNode (RDB) SELECT "Id", "AccoDetail-en-Name", "AccoDetail-en-City", FROM v_accommodationsopen Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 6/16
  • 8. Query Translation & Execution – Push-down of Projection, Filtering, Sorting, Slicing Special input attributes map API capabilities related to standard relational operators • filtering: return/process only objects matching some criteria (e.g., attribute = or ≥ constant) • projection: include/exclude certain attributes in returned results • sorting: sort results according to a certain attribute and direction (ascending/descending) • slicing: return only a given page of all possible results CREATE FOREIGN PROCEDURE api_station_data_from_to ( stype VARCHAR NOT NULL, sname VARCHAR NOT NULL, tname VARCHAR NOT NULL, __min_inclusive__mvaliddate DATE NOT NULL, -- filter push down (conditions min <= mvaliddate <= max) __max_inclusive__mvaliddate DATE NOT NULL, __limit__ INTEGER -- slicing push down ) RETURNS TABLE ( ... ) ) OPTIONS ( ... ); Partial/complete push down of these operators whenever possible • allows offloading computation to the API (e.g., sorting) • allows reducing costs by manipulating & transferring less data Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 7/16
  • 9. Query Translation & Execution – Exploiting Bulk API Operations Bulk API operations operate on multiple input tuples, such as lookup by set of IDs or bulk store • their use enables better performance due to less API calls • useful to speed-up dependent joins (using IN operator) between RDBMS and API data A A RDBMS table R virtual table S bulk API operation (A input attribute) ⨝R.A = S.A SELECT A, … FROM R WHERE … 1 SELECT A, … FROM S WHERE A IN (a1, a2, …) AND … 3 2 Extract values of join attribute A: a1, a2, … API bindings 4 Bulk API calls with multiple input tuples for different values of A: a1, a2, … Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 8/16
  • 10. Data Materialization Data materialization: required by API operations that cannot be invoked at query time • operations too expensive to call at query time (e.g., align API and DB identifiers) • operations instrumental to the use of external APIs (e.g., text indexing in a search engine) Solution #1: materialized views in Teiid (or other data federation system used) Solution #2: dedicated materialization engine for flexibly executing arbitrary materialization rules: • identifier – for documentation & diagnostics • target – the system-managed computed table (possibly virtual) where data is stored • source – arbitrary SQL query (over any tables) that produces the data to store rules: - id: index_accommodation_texts target: vt_semsearch_index source: |- SELECT "Id" AS id, "AccoDetail-en-Longdesc" AS text FROM v_accommodationsopen WHERE "AccoDetail-en-Longdesc" IS NOT NULL - ... other rules ... Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 9/16
  • 11. Data Materialization (cont’d) Rules (their SQL source queries) are analyzed to derive a rule dependency graph, which is mapped to an execution plan using fixpoint rule evaluation for strongly connected components R1 R2 R3 R4 R5 R1 R2 R3 R4 R5 sequence ( parallel ( R1, sequence ( R2, fixpoint ( parallel ( R3, R4 ) ) ) ), R5 ) Rule / Table Dependencies Rule Dependencies Execution Plan Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 10/16
  • 12. VKG over APIs – Ontology-Based Data Access (OBDA) & Ontop OBDA builds a VKG on an RDB source • an ontology defines the VKG classes and properties (TBox) • mappings define how to populate each class/property with RDB data (ABox) • query rewriting maps VKG queries (SPARQL) into native queries (SQL) over the source • Ontop open-source system Idea: build a VDB over APIs, then apply OBDA to convert it into a VKG • Ontop + Teiid/service translator VKGs for Data Access Ontop and Ontopic Developments NL Knowledge Extraction Query answering by query rewriting Ontology Mappings Data Sources . . . . . . . . . . . . Ontological Query q Rewritten Query SQL Relational Answer Ontological Answer Rewriting Unfolding Evaluation Result Translation Diego Calvanese, Francesco Corcoglioniti, Guohui Xiao (unibz) VGKs for Data Access and Integration Huawei – 03/08/202 Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 11/16
  • 13. VKG over APIs – Ontology & Mappings Example Ontology schema:Accommodation a owl:Class ; rdfs:subClassOf schema:Place ; rdfs:label "Accommodation"@en ; ... schema:name a owl:DatatypeProperty ; ... hive:Match a owl:Class ... Current ontology formalism (OWL 2 QL) reused as is, but now also models data from APIs Mappings mappingId Semantic Search target data:match/accommodation/{id}/{query} a hive:Match; hive:query {query}^^xsd:string; hive:resource data:accommodation/{id}; hive:excerpt {excerpt}@en; hive:score {score}^^xsd:decimal. source SELECT * FROM hiveodh.srv.vt_semsearch_match Current VKG mapping formalism reused as is, but data may now come from API virtual tables Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 12/16
  • 14. VKG over APIs – Query Rewriting & Evaluation Example User-supplied SPARQL query SELECT ?h ?posLabel ?rating ?pos { [] a hive:Match ; hive:query "horse riding"^^xsd:string ; hive:resource ?h ; hive:excerpt ?excerpt ; hive:score ?score . ?h a schema:LodgingBusiness ; geo:defaultGeometry/geo:asWKT ?pos ; schema:name ?name ; schema:description ?description ; schema:starRating/schema:ratingValue ?rating. FILTER (?rating >= 3 && lang(?name) = 'en' && lang(?description) = 'en') BIND (CONCAT(?name, " <br><br>...", ?excerpt, "...<br><br>", ?description) AS ?posLabel) } ORDER BY DESC(?score) LIMIT 10 SQL query rewritten by Ontop SELECT v1.id, v1.excerpt, -- fields used v2."AccoDetail-en-Name", -- for deriving v2."AccoDetail-en-Longdesc", -- ?posLabel ... complex expression computing rating ..., ST_ASTEXT(v2."Geometry") FROM hiveodh.srv.vt_semsearch_match v1, hiveodh.db.v_accommodationsopen v2 WHERE v1."id" = v2."Id" AND CAST(v1."query" AS TEXT) = 'horse riding' AND ... complex condition on rating >= 3 ... AND ... nonnull conditions for output columns ... ORDER BY CAST(v1."score" AS DECIMAL) DESC LIMIT 10 SQL query evaluated on the VDB by Teiid Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 13/16
  • 15. VKG over APIs – ODH with Semantic Search Demo Data sources DB with ODH tourism data + Semantic search API to index & query accommodations texts System Ontop embedding Teiid + materialization engine Demo https://hive.inf.unibz.it/ odh/vkg/ Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 14/16
  • 16. Overall Framework & Ongoing Work Virtual DB (VDB) Teiid + service translator VKG Mappings including virtual tables, used for query rewriting Materialization Rules pre-compute results of expensive API calls → VDB/VKG no more fully “virtual” API Bindings define how to query/update a virtual table via API calls, if possible → limited access patterns RDB Sources API Sources Virtual Knowledge Graph (VKG) Ontop SQL SQL calls Application (VKG-based) Application (VDB-based) SQL SPARQL VKG Ontology formalizes the classes/properties (the “schema”) of the VKG, enabling reasoning 1 3 2 Ongoing work: 1. query rewriting tuned to VDB + APIs 2. service translator improvements 3. change data capture tools (e.g. Debezium) for incremental materialization 4. application to analysis of static + dynamic data in the domain of climate risk management (OntoCRM project) Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 15/16
  • 17. Thanks for attending! Ontop: https://ontop-vkg.org/ Teiid: https://teiid.io/ our extensions: https://hive.inf.unibz.it/