SlideShare a Scribd company logo
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/1
Outline
• Introduction
• Background
• Distributed Database Design
• Database Integration
• Semantic Data Control
• Distributed Query Processing
• Multimedia Query Processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
➡ Web Models
➡ Web Search and Querying
➡ Distributed XML Processing
• Current Issues
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/2
Web Overview
• Publicly indexable web:
➡ More than 25 billion static HTML pages.
➡ Over 53 billion pages in dynamic web
• Deep web (hidden web)
➡ Over 500 billion documents
• Most Internet users gain access to the web using search engines.
• Very dynamic
➡ 23% of web pages change daily.
➡ 40% of commercial pages change daily.
• Static versus dynamic web pages
• Publicly indexable web (PIW) versus hidden web (or deep web)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/3
Properties of Web Data
• Lack of a schema
➡ Data is at best “semi-structured”
➡ Missing data, additional attributes, “similar” data but not identical
• Volatility
➡ Changes frequently
➡ May conform to one schema now, but not later
• Scale
➡ Does it make sense to talk about a schema for Web?
➡ How do you capture “everything”?
• Querying difficulty
➡ What is the user language?
➡ What are the primitives?
➡ Aren’t search engines or metasearch engines sufficient?
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/4
Web Graph
• Nodes: pages; edges: hyperlinks
• Properties
➡ Volatile
➡ Sparse
➡ Self-organizing
➡ Small-world network
➡ Power law network
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/5
Structure of the Web
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/6
Web Data Modeling
• Can’t depend on a strict schema to structure the data
• Data are self-descriptive
{name: {first:“Tamer”, last: “Ozsu”}, institution:
“University of Waterloo”, salary: 300000}
• Usually represented as an edge-labeled graph
➡ XML can also be modeled this way
name
institution
salary
first last
“Tamer” “Ozsu”
“UW” 300000
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/7
Search Engine Architecture
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/8
Web Crawling
• What is a crawler?
• Crawlers cannot crawl the whole Web. It should try to visit the “most
important” pages first.
• Importance metrics:
➡ Measure the importance of a Web page
➡ Ranking
✦ Static
✦ Dynamic
• Ordering metric
➡ How to choose the next page to crawl
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/9
Static Ranking - PageRank
• Quality of a page determined by the number of incoming links and the
importance of the pages of those links
• PageRank of page pi:
➡ Bpi
: backlink pages of pi (i.e., pages that point to pi)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/10
Ordering Metrics
• Breadth-first search
➡ Visit URLs in the order they were discovered
• Random
➡ Randomly choose one of the unvisited pages in the queue
• Incorporate importance metric
➡ Model a random surfer: when on page P, choose one of the URLs on that page
with equal probability d or jump to a random page with probability (1-d)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/11
Web Crawler Types
• Many Web pages change frequently, so the crawler has to revisit already
crawled pages  incremental crawlers
• Some search engines specialize in searching pages belonging to a
particular topic  focused crawlers
• Search engines use multiple crawlers sitting on different machines and
running in parallel. It is important to coordinate these parallel crawlers to
prevent overlapping  parallel crawlers
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/12
Indexing
• Structure index
➡ Link structure
• Text index
➡ Indexing the content
➡ Suffix arrays, inverted index, signature files
➡ Inverted index most common
• Difficulties of inverted index
➡ The huge size of the Web
➡ The rapid change makes it hard to maintain
➡ Storage vs. performance efficiency
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/13
Web Querying
• Why Web Querying?
➡ It is not always easy to express information requests using keywords.
➡ Search engines do not make use of Web topology and document structure in
queries.
• Early Web Query Approaches
➡ Structured (Similar to DBMSs): Data model + Query Language
➡ Semi-structured: e.g. Object Exchange Model (OEM)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/14
Web Querying
• Question Answering (QA) Systems
➡ Finding answers to natural language questions, e.g. What is Computer?
➡ Analyze the question and try to guess what type of information that is
required.
➡ Not only locate relevant documents but also extract answers from them.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/15
Approaches to Web Querying
• Search engines and metasearchers
➡ Keyword-based
➡ Category-based
• Semistructured data querying
• Special Web query languages
• Question-Answering
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/16
Semistructured Data Querying
• Basic principle: Consider Web as a collection of semistructured data and
use those techniques
• Uses an edge-labeled graph model of data
• Example systems & languages:
➡ Lore/Lorel
➡ UnQL
➡ StruQL
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/17
OEM Model
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/18
Lorel Example
Find the titles of documents written by Patrick Valduriez
SELECT D.title
FROM bib.doc D
WHERE bib.doc(.authors)?.author = “Patrick Valduriez”
Find the authors of all books whose price is under $100
SELECT D(.authors)?.author
FROM bib.doc D
WHERE D.what = “Books” AND D.price < 100
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/19
Evaluation
• Advantages
➡ Simple and flexible
➡ Fits the natural link structure of Web pages
• Disadvantages
➡ Data model too simple (no record construct or ordered lists)
➡ Graph can become very complicated
✦ Aggregation and typing combined
✦ DataGuides
➡ No differentiation between connection between documents and subpart
relationships
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/20
DataGuide
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/21
Web Query Languages
• Basic principle: Take into account the documents’ content and internal
structure as well as external links
• The graph structures are more complex
• First generation
➡ Model the web as interconnected collection of atomic objects
➡ WebSQL
➡ W3QS
➡ WebLog
• Second generation
➡ Model the web as a linked collection of structured objects
➡ WebOQL
➡ StruQL
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/22
WebSQL Examples
• Simple search for all documents about “hypertext”
SELECT D.URL, D.TITLE
FROM DOCUMENT D
SUCH THAT D MENTIONS “hypertext”
WHERE D.TYPE = “text/html”
• Find all links to applets from documents about “Java”
SELECT A.LABEL, A.HREF
FROM DOCUMENT D
SUCH THAT D MENTIONS “Java”
ANCHOR A
SUCH THAT BASE = X
WHERE A.LABEL = “applet”
Demonstrates two scoping methods and a search for links.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/23
WebSQL Examples (cont’d)
• Find documents that have string “database” in their title that are reachable
from the ACM Digital Library home page through paths of length ≤ 2
containing only local links.
SELECT D.URL, D.TITLE
FROM DOCUMENT D
SUCH THAT ”http://www.acm.org/dl”=|->|->-> D
WHERE D.TITLE CONTAINS “database”
Demonstrates the use of different link types.
• Find documents mentioning “Computer Science” and all documents that
are linked to them through paths of length ≤ 2 containing only local links
SELECT D1.URL, D1.TITLE, D2.URL, D2.TITLE
FROM DOCUMENT D1
SUCH THAT D1 MENTIONS “Computer Science”
DOCUMENT D2
SUCH THAT D1=|->|->-> D2
Demonstrates combination of content and structure specification in a query.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/24
WebOQL
➡ Prime
➡ Peek
➡ Hang
➡ Concatenate
➡ Head
➡ Tail
➡ String pattern match
(~)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/25
WebOQL Examples
Find the titles and abstracts of all documents authored by
“Ozsu”
SELECT [y.title, y‟.URL]
FROM x IN dbDocuments, y in x‟
WHERE y.authors ~ “Ozsu”
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/26
Evaluation
• Advantages
➡ More powerful data model - Hypertree
✦ Ordered edge-labeled tree
✦ Internal and external arcs
➡ Language can exploit different arc types (structure of the Web pages can be
accessed)
➡ Languages can construct new complex structures.
• Disadvantages
➡ You still need to know the graph structure
➡ Complexity issue
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/27
Question-Answer Approach
• Basic principle: Web pages that could contain the answer to the user query
are retrieved and the answer extracted from them.
• NLP and information extraction techniques
• Used within IR in a closed corpus; extensions to Web
• Examples
➡ QASM
➡ Ask Jeeves
➡ Mulder
➡ WebQA
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/28
QA Systems
• Analyze and classify the question, depending on the expected answer type.
• Using IR techniques, retrieve documents which are expected to contain the
answer to the question.
• Analyze the retrieved documents and decide on the answer.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/29
Question-Answer System
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/30
Searching The Hidden Web
• Publicly indexable web (PIW) vs. hidden web
• Why is Hidden Web important?
➡ Size: huge amount of data
➡ Data quality
• Challenges:
➡ Ordinary crawlers cannot be used.
➡ The data in hidden databases can only be accessed through a search interface.
➡ Usually, the underlying structure of the database is unknown.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/31
Searching The Hidden Web
• Crawling the Hidden Web
➡ Submit queries to the search interface of the database
✦ By analyzing the search interface, trying to fill in the fields for all
possible values from a repository
✦ By using agents that find search forms, learn to fill them, and retrieve
the result pages
➡ Analyze the returned result pages
✦ Determine whether they contain results or not
✦ Use templates to extract information
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/32
Searching The Hidden Web
• Metasearching
➡ Database selection – Query Translation – Result Merging
➡ Database selection is based on Content Summaries.
➡ Content Summary Extraction:
✦ RS-Ord and RS-Lrd
✦ Focused Probing with Database Categorization
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/33
Searching The Hidden Web
• Metasearching
➡ Database Selection:
✦ Find the best databases to evaluate a given query.
✦ bGlOSS
✦ Selection from categorized databases
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/34
Distributed XML
• XML increasingly used for encoding web data  data representation
language
➡ Web 2.0
• Data exchange language
➡ Web services
• Used to encode or annotate non-web semistructured or unstructured data
• Data repository sizes growing  use distribution
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/35
XML Overview
• Similarities to semistructured models discussed earlier
• Data is divided into pieces called elements
• Elements can be nested, but not overlapped
➡ Nesting represents hierarchical relationships between the elements
• Elements have attributes
• Elements can also have relationships to other elements (ID-IDREF)
• Can be represented as a graph, but usually simplified as a tree
➡ Root element
➡ Zero or more child elements representing nested subelements
➡ Document order over elements
➡ Attributes are also shown as nodes
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/36
XML Schema
• Can be represented by a Document Type Definition (DTD) or XMLSchema
• Simpler definition using a schema graph: 5-tuple , , s, m,
➡ is an alphabet of XML document node types
➡ × is a set of edges between node types
✦ e=( 1, 2) ∈ denotes item of type 1 may contain an item of type 2
➡ s: {ONCE, OPT, MULT}
✦ ONCE: item of type 1 must contain exactly one item of type 2
✦ OPT: item of type 1 may or may not contain an item of type 2
✦ MULT: item of type 1 may contain multiple items of type 2
➡ m: {string}
✦ m( ) denotes the domain of text content of an item of type
➡ is the root node type
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/37
Example
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/38
Path Expressions
• A list of steps
• Each step consists of
➡ An axis (13 of them)
➡ A name test
➡ Zero or more qualifiers
➡ Last step is called a return step
• Syntax
➡ Path::= Step(“/”Step)*
➡ Step::= axis”::”NameTest(Qualifier)*
➡ NameTest::=
ElementName|AttributeName|”*”
➡ Qualifier::=“[”Expre”]”
➡ Expr::=Path(Comp Atomic)?
➡ Comp:==“=”|“>”|“<”|“>=”|“<=”|“!=”
➡ Atomic::==“`”String”’”
child (/)
descendent
descendent-or-self (//)
parent
attribute (/@)
self (.)
ancestor
ancestor-or-self
following-sibling
following
preceding-sibling
preceding
namespace
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/39
Path Expression Example
• /author[.//last = “Valduriez”]//book[price < 100]
• Name constraints
➡ Correspond to name tests
• Structural constraints
➡ Correspond to axes
• Value constraints
➡ Correspond to value comparisons
• Query pattern tree (QTP)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/40
XQuery
• FLWOR expression
➡ “for”, “let”, “where”, “order by”, “return” clauses
➡ Each clause can reference path expressions or other FLWOR expressions
➡ Similar to SQL select-from-where-aggregate but operating on a list of XML
document tree nodes
• Example: Return a list of books with their title and price ordered by their
attribute names
let $col := collection(“bib”)
for $author in $col/author
order by $author.name
for $b in $author/pubs/book
let $title := $b/title
let $price := $b/price
return $title, $price
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/41
XML Query Processing
Techniques
• Large object (LOB) approach
➡ Store original XML documents as-is in a LOB column
➡ Similar to storing documents in a file system
➡ Simple to implement and provides byte-level fidelity
➡ Slow in processing queries due to XML parsing at query execution time
• Extended relational approach
➡ Shred XML documents into object-relational tables and columns and stored in
relational or object-relational systems
➡ If properly designed, can perform very well (uses well established relational
technology)
➡ Insertion, fragment extraction, structural update, and document
reconstruction require considerable effort
• Native approach
➡ Use a tree-structured data model and introduce operators that are optimized
for tree navigation, insertion, deletion, and update
➡ Usually well balanced tradeoff among criteria
➡ Requires specialized processing and optimization techniques
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/42
Path Expression Evaluation
• Navigational approach
➡ Built on top of native XML storage systems
➡ Match the QTP by traversing the XML document tree
➡ Query-driven: each location step is translated into an algebraic operator
which performs the navigation
➡ Data-driven: builds an automaton for a path expression and executes the
automaton by navigating the XML document tree
• Join-based approach
➡ Usually based on extended relational storage systems
➡ Each location step is associated with an input list of elements whose names
match with the name test of the step
➡ Two lists of adjacent location steps are joined based on their structural
relationship (structural join operation)
• Evaluation
➡ Join-based efficient in evaluating dependent axes, but not as efficient as
navigational in queries that have only child axes
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/43
• Ad hoc fragmentation
➡ No explicit, schema-based fragmentation specification
➡ Arbitrarily cut edges in XML document graphs
➡ Example: Active XML
✦ Static part of document: XML data
✦ Dynamic part of document: function call to web services
• Original document
<author>
<name>J. Doe</name>
…
<call fun=“getPub(„J.
Doe‟)” />
</author>
• After service call
<author>
<name>J. Doe</name>
…
<pubs>
<book> … </book>
…
</pubs>
</author>
Fragmenting XML Data
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/44
Fragmenting XML Data (cont’d)
• Structure-based fragmentation
➡ Fragmentation is based on characteristics of schema or data
➡ Similar to relational fragmentation
• Horizontal fragmentation
➡ Based on selection
➡ Specified by set of predicates on data
• Vertical fragmentation
➡ Based on projection
➡ Specified by fragmenting the set of node types in the schema
• Hybrid fragmentation
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/45
Horizontal Fragmentation
Let D = {d1, d2, … , dn} be a collection of document trees such that each di ∈ D
follows the same schema. Define a set of predicates P = {p0, p1, …, pl−1} such
that d ∈ D: unique pi ∈ P where pi(d). Then F = {{d ∈ D| pi(d)} | pi ∈ P} is a
horizontal fragmentation of D.
• Fragmentation tree patterns (FTP)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/46
Horizontal Fragmentation
Example – Instance
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/47
Vertical Fragmentation
Let 〈 , , s, m, 〉 be a schema graph. Then we can define a vertical
fragmentation function :  F where F is a partitioning of .
• Fragment that has the root element is called the root fragment
• Child fragment and parent fragment are defined
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/48
Vertical Fragmentation Example
– Schema
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/49
Vertical Fragmentation Example
– Instance
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/50
Proxy Nodes
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/51
Optimizing Distributed XML
• Data shipping versus query shipping
✦ Data shipping moves data from where it is to where the query is posed and
processes it there
✦ XQuery built-in functionality: fn:doc(URI)
✦ Simple to implement
✦ Provides only inter-query parallelism (no intra-query parallelism)
✦ Requires sufficient storage space at each site to hold data
✦ Large data moving overhead
➡ Query shipping (or function shipping) decomposes the query such that each
subquery is evaluated at a site where the data reside
✦ Better parallelization
✦ Difficult in the context of XML – both the function and its parameters need to be
shipped
✦ Packaging required to implement call-by-value semantics
✓ Serialization of the subtree rooted at the parameter node is packaged and shipped
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/52
Data Shipping in XML
• Difficulties with packaging:
➡ There may be some axes (e.g., parent, preceding-sibling) that are now
“downward” from the parameter node requiring access to data that may not
be available in the subtree of the parameter node.
➡ Node identity problems: Two identical nodes passed as parameters and
returned as results are represented by call-by-value semantics as two different
nodes.
➡ Document order of nodes – serialization may not obey these
➡ Interaction between different subqueries that access the same document on a
given peer
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/53
Query Shipping Approach
• Partial function evaluation
➡ Given f(x,y), partial evaluation would compute f on one of the inputs and
generate a partial answer, which would be another function f’ that is only
dependent on the second input.
➡ Query is considered a function and data fragments its inputs.
➡ Processing outline:
✦ Phase 1:
✓ Coordinating site assigns the query to those sites that hold a relevant fragment.
✓ Each site evaluates the query (in parallel) on its local data.
✓ For some data nodes, the value of each query qualifier is known; for others, the value of some
qualifiers is a Boolean formula whose value is not yet fully determined
✦ Phase 2: The selection part of the query is (partially) evaluated  for each node
of each fragment, either it is part of the query’s answer, or it is not even a
candidate to be part of the answer
✦ Phase 3: Check candidate nodes again to determine which ones are indeed in the
result and send them to the coordinator
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/54
Query Shipping Approach
(cont’d)
• XRPC
➡ Introduce a remote procedure call functionality through
execute at {Expr}{FunApp(ParamList)}
where Expr is the explicit or computed URI of the peer where FunApp() is
to be applied.
➡ Call-by-projection
✦ A runtime projection technique to minimize message sizes.
✦ Preserves node identities and structural properties of XML node parameters as
well.
✦ A node parameter is analyzed to see how it is used by a remote function.
✦ Only those descendants of the node parameter that are actually used by the
remote function are serialized into the request message.
✦ Nodes outside the subtree of nth node parameter are added to the request
message if they are needed by the remote function.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/55
Query Shipping Approach
(cont’d)
• Use the fragmentation specification and the query specification
• The following considers vertical fragmentation
GQTP Subqueries

More Related Content

PPTX
Database , 15 Object DBMS
PPTX
Database , 5 Semantic
PPTX
Database ,16 P2P
PPTX
Database ,18 Current Issues
PPTX
Database , 12 Reliability
PPTX
Database ,14 Parallel DBMS
PPTX
Database , 6 Query Introduction
PPTX
Database ,10 Transactions
Database , 15 Object DBMS
Database , 5 Semantic
Database ,16 P2P
Database ,18 Current Issues
Database , 12 Reliability
Database ,14 Parallel DBMS
Database , 6 Query Introduction
Database ,10 Transactions

What's hot (19)

PPTX
Database , 13 Replication
PDF
9 virtual memory management
PPTX
NoSQL Evolution
PPT
40 demand paging
PPTX
Hbase hivepig
PPT
IWMW 1997: WWW Caching
PDF
Week 17 slides 1 7 multidimensional, parallel, and distributed database
PDF
Operating Systems - Virtual Memory
PDF
MySQL Cluster page management (2014)
PPT
IWMW 1997: Database-WWW Integration
PPTX
SHADOW PAGING and BUFFER MANAGEMENT
ODP
Gluster 3.3 deep dive
PPT
Virtual Memory
PDF
Scalability
PPTX
Appache Cassandra
PDF
NoSQL-Database-Concepts
PPT
PPTX
Caching in drupal
PPTX
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Database , 13 Replication
9 virtual memory management
NoSQL Evolution
40 demand paging
Hbase hivepig
IWMW 1997: WWW Caching
Week 17 slides 1 7 multidimensional, parallel, and distributed database
Operating Systems - Virtual Memory
MySQL Cluster page management (2014)
IWMW 1997: Database-WWW Integration
SHADOW PAGING and BUFFER MANAGEMENT
Gluster 3.3 deep dive
Virtual Memory
Scalability
Appache Cassandra
NoSQL-Database-Concepts
Caching in drupal
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Ad

Viewers also liked (20)

PPTX
Database ,2 Background
PDF
Ethernet Technology
PDF
Anwar e-sabiri(complete)
PPTX
Hank Iving Media Plan
PDF
Coyaima ie. juan xxiii manual de convivencia
PDF
Virgen de Chiquinquirá en Colombia
PDF
Mariquita iet francisco nuñez pedrozo manual convivencia antiguo
PPTX
College Students
PPT
Prezentacja.1
PPT
Gsm (Part 3)
PDF
Muslim Contributions in Mathematics
PPTX
Local Search Hawaii Michael Dorausch PubCon SEO
PPT
[DSBW Spring 2010] Unit 10: XML and Web And beyond
PPT
A Data Fusion System for Spatial Data Mining, Analysis and Improvement Silvij...
PDF
8 ontology integration and interoperability (onto i op)
PPTX
Ontology integration - Heterogeneity, Techniques and more
PDF
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
PPT
[ABDO] Data Integration
PDF
Pal gov.tutorial2.session13 2.gav and lav integration
Database ,2 Background
Ethernet Technology
Anwar e-sabiri(complete)
Hank Iving Media Plan
Coyaima ie. juan xxiii manual de convivencia
Virgen de Chiquinquirá en Colombia
Mariquita iet francisco nuñez pedrozo manual convivencia antiguo
College Students
Prezentacja.1
Gsm (Part 3)
Muslim Contributions in Mathematics
Local Search Hawaii Michael Dorausch PubCon SEO
[DSBW Spring 2010] Unit 10: XML and Web And beyond
A Data Fusion System for Spatial Data Mining, Analysis and Improvement Silvij...
8 ontology integration and interoperability (onto i op)
Ontology integration - Heterogeneity, Techniques and more
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
[ABDO] Data Integration
Pal gov.tutorial2.session13 2.gav and lav integration
Ad

Similar to Database , 17 Web (20)

PPTX
DBMS Notes for BSC Students for all batch
PPTX
1 introduction
PPTX
Database , 4 Data Integration
PPTX
Query_Introduction and query optimization
PPTX
AUERY.pptxHDSOILDKCJSIDVCBIDCSDCJNSOIDCNSOD
PDF
Getting to Real-Time in a Multi-Model Architecture
PPTX
dbms introduction.pptx
PDF
Database Creation and Management with ML
PPTX
Database Integration in Distributed Database
PPTX
PPTX
INDUSTRIAL TRAINING Presentation on Web Development. (2).pptx
PDF
6-Query_Intro (5).pdf
PPT
DBMS - Introduction.ppt
PPTX
Database , 1 Introduction
PPTX
1 introduction DDBS
PPTX
NoSQL: An Analysis
PPTX
NoSQLDatabases
PPTX
1 introduction ddbms
PPTX
History and Introduction to NoSQL over Traditional Rdbms
PPTX
Relational databases store data in tables
DBMS Notes for BSC Students for all batch
1 introduction
Database , 4 Data Integration
Query_Introduction and query optimization
AUERY.pptxHDSOILDKCJSIDVCBIDCSDCJNSOIDCNSOD
Getting to Real-Time in a Multi-Model Architecture
dbms introduction.pptx
Database Creation and Management with ML
Database Integration in Distributed Database
INDUSTRIAL TRAINING Presentation on Web Development. (2).pptx
6-Query_Intro (5).pdf
DBMS - Introduction.ppt
Database , 1 Introduction
1 introduction DDBS
NoSQL: An Analysis
NoSQLDatabases
1 introduction ddbms
History and Introduction to NoSQL over Traditional Rdbms
Relational databases store data in tables

More from Ali Usman (17)

PPT
Cisco Packet Tracer Overview
PDF
Islamic Arts and Architecture
PPTX
Database ,11 Concurrency Control
PPTX
Database , 8 Query Optimization
PPTX
Database ,7 query localization
PPTX
Database, 3 Distribution Design
DOCX
Processor Specifications
PDF
Fifty Year Of Microprocessor
PDF
Discrete Structures lecture 2
PDF
Discrete Structures. Lecture 1
PDF
Muslim Contributions in Medicine-Geography-Astronomy
PDF
Muslim Contributions in Geography
PDF
Muslim Contributions in Astronomy
DOCX
Processor Specifications
PDF
Ptcl modem (user manual)
PDF
Nimat-ul-ALLAH shah wali
PDF
Osi protocols
Cisco Packet Tracer Overview
Islamic Arts and Architecture
Database ,11 Concurrency Control
Database , 8 Query Optimization
Database ,7 query localization
Database, 3 Distribution Design
Processor Specifications
Fifty Year Of Microprocessor
Discrete Structures lecture 2
Discrete Structures. Lecture 1
Muslim Contributions in Medicine-Geography-Astronomy
Muslim Contributions in Geography
Muslim Contributions in Astronomy
Processor Specifications
Ptcl modem (user manual)
Nimat-ul-ALLAH shah wali
Osi protocols

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
cuic standard and advanced reporting.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Spectroscopy.pptx food analysis technology
A comparative analysis of optical character recognition models for extracting...
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
Programs and apps: productivity, graphics, security and other tools
gpt5_lecture_notes_comprehensive_20250812015547.pdf
sap open course for s4hana steps from ECC to s4
cuic standard and advanced reporting.pdf
A Presentation on Artificial Intelligence
Chapter 3 Spatial Domain Image Processing.pdf

Database , 17 Web

  • 1. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/1 Outline • Introduction • Background • Distributed Database Design • Database Integration • Semantic Data Control • Distributed Query Processing • Multimedia Query Processing • Distributed Transaction Management • Data Replication • Parallel Database Systems • Distributed Object DBMS • Peer-to-Peer Data Management • Web Data Management ➡ Web Models ➡ Web Search and Querying ➡ Distributed XML Processing • Current Issues
  • 2. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/2 Web Overview • Publicly indexable web: ➡ More than 25 billion static HTML pages. ➡ Over 53 billion pages in dynamic web • Deep web (hidden web) ➡ Over 500 billion documents • Most Internet users gain access to the web using search engines. • Very dynamic ➡ 23% of web pages change daily. ➡ 40% of commercial pages change daily. • Static versus dynamic web pages • Publicly indexable web (PIW) versus hidden web (or deep web)
  • 3. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/3 Properties of Web Data • Lack of a schema ➡ Data is at best “semi-structured” ➡ Missing data, additional attributes, “similar” data but not identical • Volatility ➡ Changes frequently ➡ May conform to one schema now, but not later • Scale ➡ Does it make sense to talk about a schema for Web? ➡ How do you capture “everything”? • Querying difficulty ➡ What is the user language? ➡ What are the primitives? ➡ Aren’t search engines or metasearch engines sufficient?
  • 4. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/4 Web Graph • Nodes: pages; edges: hyperlinks • Properties ➡ Volatile ➡ Sparse ➡ Self-organizing ➡ Small-world network ➡ Power law network
  • 5. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/5 Structure of the Web
  • 6. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/6 Web Data Modeling • Can’t depend on a strict schema to structure the data • Data are self-descriptive {name: {first:“Tamer”, last: “Ozsu”}, institution: “University of Waterloo”, salary: 300000} • Usually represented as an edge-labeled graph ➡ XML can also be modeled this way name institution salary first last “Tamer” “Ozsu” “UW” 300000
  • 7. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/7 Search Engine Architecture
  • 8. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/8 Web Crawling • What is a crawler? • Crawlers cannot crawl the whole Web. It should try to visit the “most important” pages first. • Importance metrics: ➡ Measure the importance of a Web page ➡ Ranking ✦ Static ✦ Dynamic • Ordering metric ➡ How to choose the next page to crawl
  • 9. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/9 Static Ranking - PageRank • Quality of a page determined by the number of incoming links and the importance of the pages of those links • PageRank of page pi: ➡ Bpi : backlink pages of pi (i.e., pages that point to pi)
  • 10. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/10 Ordering Metrics • Breadth-first search ➡ Visit URLs in the order they were discovered • Random ➡ Randomly choose one of the unvisited pages in the queue • Incorporate importance metric ➡ Model a random surfer: when on page P, choose one of the URLs on that page with equal probability d or jump to a random page with probability (1-d)
  • 11. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/11 Web Crawler Types • Many Web pages change frequently, so the crawler has to revisit already crawled pages  incremental crawlers • Some search engines specialize in searching pages belonging to a particular topic  focused crawlers • Search engines use multiple crawlers sitting on different machines and running in parallel. It is important to coordinate these parallel crawlers to prevent overlapping  parallel crawlers
  • 12. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/12 Indexing • Structure index ➡ Link structure • Text index ➡ Indexing the content ➡ Suffix arrays, inverted index, signature files ➡ Inverted index most common • Difficulties of inverted index ➡ The huge size of the Web ➡ The rapid change makes it hard to maintain ➡ Storage vs. performance efficiency
  • 13. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/13 Web Querying • Why Web Querying? ➡ It is not always easy to express information requests using keywords. ➡ Search engines do not make use of Web topology and document structure in queries. • Early Web Query Approaches ➡ Structured (Similar to DBMSs): Data model + Query Language ➡ Semi-structured: e.g. Object Exchange Model (OEM)
  • 14. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/14 Web Querying • Question Answering (QA) Systems ➡ Finding answers to natural language questions, e.g. What is Computer? ➡ Analyze the question and try to guess what type of information that is required. ➡ Not only locate relevant documents but also extract answers from them.
  • 15. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/15 Approaches to Web Querying • Search engines and metasearchers ➡ Keyword-based ➡ Category-based • Semistructured data querying • Special Web query languages • Question-Answering
  • 16. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/16 Semistructured Data Querying • Basic principle: Consider Web as a collection of semistructured data and use those techniques • Uses an edge-labeled graph model of data • Example systems & languages: ➡ Lore/Lorel ➡ UnQL ➡ StruQL
  • 17. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/17 OEM Model
  • 18. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/18 Lorel Example Find the titles of documents written by Patrick Valduriez SELECT D.title FROM bib.doc D WHERE bib.doc(.authors)?.author = “Patrick Valduriez” Find the authors of all books whose price is under $100 SELECT D(.authors)?.author FROM bib.doc D WHERE D.what = “Books” AND D.price < 100
  • 19. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/19 Evaluation • Advantages ➡ Simple and flexible ➡ Fits the natural link structure of Web pages • Disadvantages ➡ Data model too simple (no record construct or ordered lists) ➡ Graph can become very complicated ✦ Aggregation and typing combined ✦ DataGuides ➡ No differentiation between connection between documents and subpart relationships
  • 20. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/20 DataGuide
  • 21. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/21 Web Query Languages • Basic principle: Take into account the documents’ content and internal structure as well as external links • The graph structures are more complex • First generation ➡ Model the web as interconnected collection of atomic objects ➡ WebSQL ➡ W3QS ➡ WebLog • Second generation ➡ Model the web as a linked collection of structured objects ➡ WebOQL ➡ StruQL
  • 22. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/22 WebSQL Examples • Simple search for all documents about “hypertext” SELECT D.URL, D.TITLE FROM DOCUMENT D SUCH THAT D MENTIONS “hypertext” WHERE D.TYPE = “text/html” • Find all links to applets from documents about “Java” SELECT A.LABEL, A.HREF FROM DOCUMENT D SUCH THAT D MENTIONS “Java” ANCHOR A SUCH THAT BASE = X WHERE A.LABEL = “applet” Demonstrates two scoping methods and a search for links.
  • 23. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/23 WebSQL Examples (cont’d) • Find documents that have string “database” in their title that are reachable from the ACM Digital Library home page through paths of length ≤ 2 containing only local links. SELECT D.URL, D.TITLE FROM DOCUMENT D SUCH THAT ”http://www.acm.org/dl”=|->|->-> D WHERE D.TITLE CONTAINS “database” Demonstrates the use of different link types. • Find documents mentioning “Computer Science” and all documents that are linked to them through paths of length ≤ 2 containing only local links SELECT D1.URL, D1.TITLE, D2.URL, D2.TITLE FROM DOCUMENT D1 SUCH THAT D1 MENTIONS “Computer Science” DOCUMENT D2 SUCH THAT D1=|->|->-> D2 Demonstrates combination of content and structure specification in a query.
  • 24. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/24 WebOQL ➡ Prime ➡ Peek ➡ Hang ➡ Concatenate ➡ Head ➡ Tail ➡ String pattern match (~)
  • 25. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/25 WebOQL Examples Find the titles and abstracts of all documents authored by “Ozsu” SELECT [y.title, y‟.URL] FROM x IN dbDocuments, y in x‟ WHERE y.authors ~ “Ozsu”
  • 26. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/26 Evaluation • Advantages ➡ More powerful data model - Hypertree ✦ Ordered edge-labeled tree ✦ Internal and external arcs ➡ Language can exploit different arc types (structure of the Web pages can be accessed) ➡ Languages can construct new complex structures. • Disadvantages ➡ You still need to know the graph structure ➡ Complexity issue
  • 27. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/27 Question-Answer Approach • Basic principle: Web pages that could contain the answer to the user query are retrieved and the answer extracted from them. • NLP and information extraction techniques • Used within IR in a closed corpus; extensions to Web • Examples ➡ QASM ➡ Ask Jeeves ➡ Mulder ➡ WebQA
  • 28. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/28 QA Systems • Analyze and classify the question, depending on the expected answer type. • Using IR techniques, retrieve documents which are expected to contain the answer to the question. • Analyze the retrieved documents and decide on the answer.
  • 29. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/29 Question-Answer System
  • 30. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/30 Searching The Hidden Web • Publicly indexable web (PIW) vs. hidden web • Why is Hidden Web important? ➡ Size: huge amount of data ➡ Data quality • Challenges: ➡ Ordinary crawlers cannot be used. ➡ The data in hidden databases can only be accessed through a search interface. ➡ Usually, the underlying structure of the database is unknown.
  • 31. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/31 Searching The Hidden Web • Crawling the Hidden Web ➡ Submit queries to the search interface of the database ✦ By analyzing the search interface, trying to fill in the fields for all possible values from a repository ✦ By using agents that find search forms, learn to fill them, and retrieve the result pages ➡ Analyze the returned result pages ✦ Determine whether they contain results or not ✦ Use templates to extract information
  • 32. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/32 Searching The Hidden Web • Metasearching ➡ Database selection – Query Translation – Result Merging ➡ Database selection is based on Content Summaries. ➡ Content Summary Extraction: ✦ RS-Ord and RS-Lrd ✦ Focused Probing with Database Categorization
  • 33. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/33 Searching The Hidden Web • Metasearching ➡ Database Selection: ✦ Find the best databases to evaluate a given query. ✦ bGlOSS ✦ Selection from categorized databases
  • 34. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/34 Distributed XML • XML increasingly used for encoding web data  data representation language ➡ Web 2.0 • Data exchange language ➡ Web services • Used to encode or annotate non-web semistructured or unstructured data • Data repository sizes growing  use distribution
  • 35. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/35 XML Overview • Similarities to semistructured models discussed earlier • Data is divided into pieces called elements • Elements can be nested, but not overlapped ➡ Nesting represents hierarchical relationships between the elements • Elements have attributes • Elements can also have relationships to other elements (ID-IDREF) • Can be represented as a graph, but usually simplified as a tree ➡ Root element ➡ Zero or more child elements representing nested subelements ➡ Document order over elements ➡ Attributes are also shown as nodes
  • 36. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/36 XML Schema • Can be represented by a Document Type Definition (DTD) or XMLSchema • Simpler definition using a schema graph: 5-tuple , , s, m, ➡ is an alphabet of XML document node types ➡ × is a set of edges between node types ✦ e=( 1, 2) ∈ denotes item of type 1 may contain an item of type 2 ➡ s: {ONCE, OPT, MULT} ✦ ONCE: item of type 1 must contain exactly one item of type 2 ✦ OPT: item of type 1 may or may not contain an item of type 2 ✦ MULT: item of type 1 may contain multiple items of type 2 ➡ m: {string} ✦ m( ) denotes the domain of text content of an item of type ➡ is the root node type
  • 37. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/37 Example
  • 38. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/38 Path Expressions • A list of steps • Each step consists of ➡ An axis (13 of them) ➡ A name test ➡ Zero or more qualifiers ➡ Last step is called a return step • Syntax ➡ Path::= Step(“/”Step)* ➡ Step::= axis”::”NameTest(Qualifier)* ➡ NameTest::= ElementName|AttributeName|”*” ➡ Qualifier::=“[”Expre”]” ➡ Expr::=Path(Comp Atomic)? ➡ Comp:==“=”|“>”|“<”|“>=”|“<=”|“!=” ➡ Atomic::==“`”String”’” child (/) descendent descendent-or-self (//) parent attribute (/@) self (.) ancestor ancestor-or-self following-sibling following preceding-sibling preceding namespace
  • 39. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/39 Path Expression Example • /author[.//last = “Valduriez”]//book[price < 100] • Name constraints ➡ Correspond to name tests • Structural constraints ➡ Correspond to axes • Value constraints ➡ Correspond to value comparisons • Query pattern tree (QTP)
  • 40. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/40 XQuery • FLWOR expression ➡ “for”, “let”, “where”, “order by”, “return” clauses ➡ Each clause can reference path expressions or other FLWOR expressions ➡ Similar to SQL select-from-where-aggregate but operating on a list of XML document tree nodes • Example: Return a list of books with their title and price ordered by their attribute names let $col := collection(“bib”) for $author in $col/author order by $author.name for $b in $author/pubs/book let $title := $b/title let $price := $b/price return $title, $price
  • 41. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/41 XML Query Processing Techniques • Large object (LOB) approach ➡ Store original XML documents as-is in a LOB column ➡ Similar to storing documents in a file system ➡ Simple to implement and provides byte-level fidelity ➡ Slow in processing queries due to XML parsing at query execution time • Extended relational approach ➡ Shred XML documents into object-relational tables and columns and stored in relational or object-relational systems ➡ If properly designed, can perform very well (uses well established relational technology) ➡ Insertion, fragment extraction, structural update, and document reconstruction require considerable effort • Native approach ➡ Use a tree-structured data model and introduce operators that are optimized for tree navigation, insertion, deletion, and update ➡ Usually well balanced tradeoff among criteria ➡ Requires specialized processing and optimization techniques
  • 42. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/42 Path Expression Evaluation • Navigational approach ➡ Built on top of native XML storage systems ➡ Match the QTP by traversing the XML document tree ➡ Query-driven: each location step is translated into an algebraic operator which performs the navigation ➡ Data-driven: builds an automaton for a path expression and executes the automaton by navigating the XML document tree • Join-based approach ➡ Usually based on extended relational storage systems ➡ Each location step is associated with an input list of elements whose names match with the name test of the step ➡ Two lists of adjacent location steps are joined based on their structural relationship (structural join operation) • Evaluation ➡ Join-based efficient in evaluating dependent axes, but not as efficient as navigational in queries that have only child axes
  • 43. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/43 • Ad hoc fragmentation ➡ No explicit, schema-based fragmentation specification ➡ Arbitrarily cut edges in XML document graphs ➡ Example: Active XML ✦ Static part of document: XML data ✦ Dynamic part of document: function call to web services • Original document <author> <name>J. Doe</name> … <call fun=“getPub(„J. Doe‟)” /> </author> • After service call <author> <name>J. Doe</name> … <pubs> <book> … </book> … </pubs> </author> Fragmenting XML Data
  • 44. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/44 Fragmenting XML Data (cont’d) • Structure-based fragmentation ➡ Fragmentation is based on characteristics of schema or data ➡ Similar to relational fragmentation • Horizontal fragmentation ➡ Based on selection ➡ Specified by set of predicates on data • Vertical fragmentation ➡ Based on projection ➡ Specified by fragmenting the set of node types in the schema • Hybrid fragmentation
  • 45. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/45 Horizontal Fragmentation Let D = {d1, d2, … , dn} be a collection of document trees such that each di ∈ D follows the same schema. Define a set of predicates P = {p0, p1, …, pl−1} such that d ∈ D: unique pi ∈ P where pi(d). Then F = {{d ∈ D| pi(d)} | pi ∈ P} is a horizontal fragmentation of D. • Fragmentation tree patterns (FTP)
  • 46. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/46 Horizontal Fragmentation Example – Instance
  • 47. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/47 Vertical Fragmentation Let 〈 , , s, m, 〉 be a schema graph. Then we can define a vertical fragmentation function :  F where F is a partitioning of . • Fragment that has the root element is called the root fragment • Child fragment and parent fragment are defined
  • 48. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/48 Vertical Fragmentation Example – Schema
  • 49. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/49 Vertical Fragmentation Example – Instance
  • 50. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/50 Proxy Nodes
  • 51. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/51 Optimizing Distributed XML • Data shipping versus query shipping ✦ Data shipping moves data from where it is to where the query is posed and processes it there ✦ XQuery built-in functionality: fn:doc(URI) ✦ Simple to implement ✦ Provides only inter-query parallelism (no intra-query parallelism) ✦ Requires sufficient storage space at each site to hold data ✦ Large data moving overhead ➡ Query shipping (or function shipping) decomposes the query such that each subquery is evaluated at a site where the data reside ✦ Better parallelization ✦ Difficult in the context of XML – both the function and its parameters need to be shipped ✦ Packaging required to implement call-by-value semantics ✓ Serialization of the subtree rooted at the parameter node is packaged and shipped
  • 52. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/52 Data Shipping in XML • Difficulties with packaging: ➡ There may be some axes (e.g., parent, preceding-sibling) that are now “downward” from the parameter node requiring access to data that may not be available in the subtree of the parameter node. ➡ Node identity problems: Two identical nodes passed as parameters and returned as results are represented by call-by-value semantics as two different nodes. ➡ Document order of nodes – serialization may not obey these ➡ Interaction between different subqueries that access the same document on a given peer
  • 53. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/53 Query Shipping Approach • Partial function evaluation ➡ Given f(x,y), partial evaluation would compute f on one of the inputs and generate a partial answer, which would be another function f’ that is only dependent on the second input. ➡ Query is considered a function and data fragments its inputs. ➡ Processing outline: ✦ Phase 1: ✓ Coordinating site assigns the query to those sites that hold a relevant fragment. ✓ Each site evaluates the query (in parallel) on its local data. ✓ For some data nodes, the value of each query qualifier is known; for others, the value of some qualifiers is a Boolean formula whose value is not yet fully determined ✦ Phase 2: The selection part of the query is (partially) evaluated  for each node of each fragment, either it is part of the query’s answer, or it is not even a candidate to be part of the answer ✦ Phase 3: Check candidate nodes again to determine which ones are indeed in the result and send them to the coordinator
  • 54. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/54 Query Shipping Approach (cont’d) • XRPC ➡ Introduce a remote procedure call functionality through execute at {Expr}{FunApp(ParamList)} where Expr is the explicit or computed URI of the peer where FunApp() is to be applied. ➡ Call-by-projection ✦ A runtime projection technique to minimize message sizes. ✦ Preserves node identities and structural properties of XML node parameters as well. ✦ A node parameter is analyzed to see how it is used by a remote function. ✦ Only those descendants of the node parameter that are actually used by the remote function are serialized into the request message. ✦ Nodes outside the subtree of nth node parameter are added to the request message if they are needed by the remote function.
  • 55. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/55 Query Shipping Approach (cont’d) • Use the fragmentation specification and the query specification • The following considers vertical fragmentation GQTP Subqueries