Database , 17 Web

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.17/1
Outline
• Introduction
• Background
• Distributed Database Design
• Database Integration
• Semantic Data Control
• Distributed Query Processing
• Multimedia Query Processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
➡ Web Models
➡ Web Search and Querying
➡ Distributed XML Processing
• Current Issues

Web Overview
• Publicly indexable web:
➡ More than 25 billion static HTML pages.
➡ Over 53 billion pages in dynamic web
• Deep web (hidden web)
➡ Over 500 billion documents
• Most Internet users gain access to the web using search engines.
• Very dynamic
➡ 23% of web pages change daily.
➡ 40% of commercial pages change daily.
• Static versus dynamic web pages
• Publicly indexable web (PIW) versus hidden web (or deep web)

Properties of Web Data
• Lack of a schema
➡ Data is at best “semi-structured”
➡ Missing data, additional attributes, “similar” data but not identical
• Volatility
➡ Changes frequently
➡ May conform to one schema now, but not later
• Scale
➡ Does it make sense to talk about a schema for Web?
➡ How do you capture “everything”?
• Querying difficulty
➡ What is the user language?
➡ What are the primitives?
➡ Aren’t search engines or metasearch engines sufficient?

Web Graph
• Nodes: pages; edges: hyperlinks
• Properties
➡ Volatile
➡ Sparse
➡ Self-organizing
➡ Small-world network
➡ Power law network

Structure of the Web

Web Data Modeling
• Can’t depend on a strict schema to structure the data
• Data are self-descriptive
{name: {first:“Tamer”, last: “Ozsu”}, institution:
“University of Waterloo”, salary: 300000}
• Usually represented as an edge-labeled graph
➡ XML can also be modeled this way
name
institution
salary
first last
“Tamer” “Ozsu”
“UW” 300000

Search Engine Architecture

Web Crawling
• What is a crawler?
• Crawlers cannot crawl the whole Web. It should try to visit the “most
important” pages first.
• Importance metrics:
➡ Measure the importance of a Web page
➡ Ranking
✦ Static
✦ Dynamic
• Ordering metric
➡ How to choose the next page to crawl

Static Ranking - PageRank
• Quality of a page determined by the number of incoming links and the
importance of the pages of those links
• PageRank of page pi:
➡ Bpi
: backlink pages of pi (i.e., pages that point to pi)

Ordering Metrics
• Breadth-first search
➡ Visit URLs in the order they were discovered
• Random
➡ Randomly choose one of the unvisited pages in the queue
• Incorporate importance metric
➡ Model a random surfer: when on page P, choose one of the URLs on that page
with equal probability d or jump to a random page with probability (1-d)

Web Crawler Types
• Many Web pages change frequently, so the crawler has to revisit already
crawled pages  incremental crawlers
• Some search engines specialize in searching pages belonging to a
particular topic  focused crawlers
• Search engines use multiple crawlers sitting on different machines and
running in parallel. It is important to coordinate these parallel crawlers to
prevent overlapping  parallel crawlers

Indexing
• Structure index
➡ Link structure
• Text index
➡ Indexing the content
➡ Suffix arrays, inverted index, signature files
➡ Inverted index most common
• Difficulties of inverted index
➡ The huge size of the Web
➡ The rapid change makes it hard to maintain
➡ Storage vs. performance efficiency

Web Querying
• Why Web Querying?
➡ It is not always easy to express information requests using keywords.
➡ Search engines do not make use of Web topology and document structure in
queries.
• Early Web Query Approaches
➡ Structured (Similar to DBMSs): Data model + Query Language
➡ Semi-structured: e.g. Object Exchange Model (OEM)

Web Querying
• Question Answering (QA) Systems
➡ Finding answers to natural language questions, e.g. What is Computer?
➡ Analyze the question and try to guess what type of information that is
required.
➡ Not only locate relevant documents but also extract answers from them.

Approaches to Web Querying
• Search engines and metasearchers
➡ Keyword-based
➡ Category-based
• Semistructured data querying
• Special Web query languages
• Question-Answering

Semistructured Data Querying
• Basic principle: Consider Web as a collection of semistructured data and
use those techniques
• Uses an edge-labeled graph model of data
• Example systems & languages:
➡ Lore/Lorel
➡ UnQL
➡ StruQL

OEM Model

Lorel Example
Find the titles of documents written by Patrick Valduriez
SELECT D.title
FROM bib.doc D
WHERE bib.doc(.authors)?.author = “Patrick Valduriez”
Find the authors of all books whose price is under $100
SELECT D(.authors)?.author
FROM bib.doc D
WHERE D.what = “Books” AND D.price < 100

Evaluation
• Advantages
➡ Simple and flexible
➡ Fits the natural link structure of Web pages
• Disadvantages
➡ Data model too simple (no record construct or ordered lists)
➡ Graph can become very complicated
✦ Aggregation and typing combined
✦ DataGuides
➡ No differentiation between connection between documents and subpart
relationships

DataGuide

Web Query Languages
• Basic principle: Take into account the documents’ content and internal
structure as well as external links
• The graph structures are more complex
• First generation
➡ Model the web as interconnected collection of atomic objects
➡ WebSQL
➡ W3QS
➡ WebLog
• Second generation
➡ Model the web as a linked collection of structured objects
➡ WebOQL
➡ StruQL

WebSQL Examples
• Simple search for all documents about “hypertext”
SELECT D.URL, D.TITLE
FROM DOCUMENT D
SUCH THAT D MENTIONS “hypertext”
WHERE D.TYPE = “text/html”
• Find all links to applets from documents about “Java”
SELECT A.LABEL, A.HREF
FROM DOCUMENT D
SUCH THAT D MENTIONS “Java”
ANCHOR A
SUCH THAT BASE = X
WHERE A.LABEL = “applet”
Demonstrates two scoping methods and a search for links.

WebSQL Examples (cont’d)
• Find documents that have string “database” in their title that are reachable
from the ACM Digital Library home page through paths of length ≤ 2
containing only local links.
SELECT D.URL, D.TITLE
FROM DOCUMENT D
SUCH THAT ”http://www.acm.org/dl”=|->|->-> D
WHERE D.TITLE CONTAINS “database”
Demonstrates the use of different link types.
• Find documents mentioning “Computer Science” and all documents that
are linked to them through paths of length ≤ 2 containing only local links
SELECT D1.URL, D1.TITLE, D2.URL, D2.TITLE
FROM DOCUMENT D1
SUCH THAT D1 MENTIONS “Computer Science”
DOCUMENT D2
SUCH THAT D1=|->|->-> D2
Demonstrates combination of content and structure specification in a query.

WebOQL
➡ Prime
➡ Peek
➡ Hang
➡ Concatenate
➡ Head
➡ Tail
➡ String pattern match
(~)

WebOQL Examples
Find the titles and abstracts of all documents authored by
“Ozsu”
SELECT [y.title, y‟.URL]
FROM x IN dbDocuments, y in x‟
WHERE y.authors ~ “Ozsu”

Evaluation
• Advantages
➡ More powerful data model - Hypertree
✦ Ordered edge-labeled tree
✦ Internal and external arcs
➡ Language can exploit different arc types (structure of the Web pages can be
accessed)
➡ Languages can construct new complex structures.
• Disadvantages
➡ You still need to know the graph structure
➡ Complexity issue

Question-Answer Approach
• Basic principle: Web pages that could contain the answer to the user query
are retrieved and the answer extracted from them.
• NLP and information extraction techniques
• Used within IR in a closed corpus; extensions to Web
• Examples
➡ QASM
➡ Ask Jeeves
➡ Mulder
➡ WebQA

QA Systems
• Analyze and classify the question, depending on the expected answer type.
• Using IR techniques, retrieve documents which are expected to contain the
answer to the question.
• Analyze the retrieved documents and decide on the answer.

Question-Answer System

Searching The Hidden Web
• Publicly indexable web (PIW) vs. hidden web
• Why is Hidden Web important?
➡ Size: huge amount of data
➡ Data quality
• Challenges:
➡ Ordinary crawlers cannot be used.
➡ The data in hidden databases can only be accessed through a search interface.
➡ Usually, the underlying structure of the database is unknown.

• Crawling the Hidden Web
➡ Submit queries to the search interface of the database
✦ By analyzing the search interface, trying to fill in the fields for all
possible values from a repository
✦ By using agents that find search forms, learn to fill them, and retrieve
the result pages
➡ Analyze the returned result pages
✦ Determine whether they contain results or not
✦ Use templates to extract information

• Metasearching
➡ Database selection – Query Translation – Result Merging
➡ Database selection is based on Content Summaries.
➡ Content Summary Extraction:
✦ RS-Ord and RS-Lrd
✦ Focused Probing with Database Categorization

• Metasearching
➡ Database Selection:
✦ Find the best databases to evaluate a given query.
✦ bGlOSS
✦ Selection from categorized databases

Distributed XML
• XML increasingly used for encoding web data  data representation
language
➡ Web 2.0
• Data exchange language
➡ Web services
• Used to encode or annotate non-web semistructured or unstructured data
• Data repository sizes growing  use distribution

XML Overview
• Similarities to semistructured models discussed earlier
• Data is divided into pieces called elements
• Elements can be nested, but not overlapped
➡ Nesting represents hierarchical relationships between the elements
• Elements have attributes
• Elements can also have relationships to other elements (ID-IDREF)
• Can be represented as a graph, but usually simplified as a tree
➡ Root element
➡ Zero or more child elements representing nested subelements
➡ Document order over elements
➡ Attributes are also shown as nodes

XML Schema
• Can be represented by a Document Type Definition (DTD) or XMLSchema
• Simpler definition using a schema graph: 5-tuple , , s, m,
➡ is an alphabet of XML document node types
➡ × is a set of edges between node types
✦ e=( 1, 2) ∈ denotes item of type 1 may contain an item of type 2
➡ s: {ONCE, OPT, MULT}
✦ ONCE: item of type 1 must contain exactly one item of type 2
✦ OPT: item of type 1 may or may not contain an item of type 2
✦ MULT: item of type 1 may contain multiple items of type 2
➡ m: {string}
✦ m( ) denotes the domain of text content of an item of type
➡ is the root node type

Example

Path Expressions
• A list of steps
• Each step consists of
➡ An axis (13 of them)
➡ A name test
➡ Zero or more qualifiers
➡ Last step is called a return step
• Syntax
➡ Path::= Step(“/”Step)*
➡ Step::= axis”::”NameTest(Qualifier)*
➡ NameTest::=
ElementName|AttributeName|”*”
➡ Qualifier::=“[”Expre”]”
➡ Expr::=Path(Comp Atomic)?
➡ Comp:==“=”|“>”|“<”|“>=”|“<=”|“!=”
➡ Atomic::==“`”String”’”
child (/)
descendent
descendent-or-self (//)
parent
attribute (/@)
self (.)
ancestor
ancestor-or-self
following-sibling
following
preceding-sibling
preceding
namespace

Path Expression Example
• /author[.//last = “Valduriez”]//book[price < 100]
• Name constraints
➡ Correspond to name tests
• Structural constraints
➡ Correspond to axes
• Value constraints
➡ Correspond to value comparisons
• Query pattern tree (QTP)

XQuery
• FLWOR expression
➡ “for”, “let”, “where”, “order by”, “return” clauses
➡ Each clause can reference path expressions or other FLWOR expressions
➡ Similar to SQL select-from-where-aggregate but operating on a list of XML
document tree nodes
• Example: Return a list of books with their title and price ordered by their
attribute names
let $col := collection(“bib”)
for $author in $col/author
order by $author.name
for $b in $author/pubs/book
let $title := $b/title
let $price := $b/price
return $title, $price

XML Query Processing
Techniques
• Large object (LOB) approach
➡ Store original XML documents as-is in a LOB column
➡ Similar to storing documents in a file system
➡ Simple to implement and provides byte-level fidelity
➡ Slow in processing queries due to XML parsing at query execution time
• Extended relational approach
➡ Shred XML documents into object-relational tables and columns and stored in
relational or object-relational systems
➡ If properly designed, can perform very well (uses well established relational
technology)
➡ Insertion, fragment extraction, structural update, and document
reconstruction require considerable effort
• Native approach
➡ Use a tree-structured data model and introduce operators that are optimized
for tree navigation, insertion, deletion, and update
➡ Usually well balanced tradeoff among criteria
➡ Requires specialized processing and optimization techniques

Path Expression Evaluation
• Navigational approach
➡ Built on top of native XML storage systems
➡ Match the QTP by traversing the XML document tree
➡ Query-driven: each location step is translated into an algebraic operator
which performs the navigation
➡ Data-driven: builds an automaton for a path expression and executes the
automaton by navigating the XML document tree
• Join-based approach
➡ Usually based on extended relational storage systems
➡ Each location step is associated with an input list of elements whose names
match with the name test of the step
➡ Two lists of adjacent location steps are joined based on their structural
relationship (structural join operation)
• Evaluation
➡ Join-based efficient in evaluating dependent axes, but not as efficient as
navigational in queries that have only child axes

• Ad hoc fragmentation
➡ No explicit, schema-based fragmentation specification
➡ Arbitrarily cut edges in XML document graphs
➡ Example: Active XML
✦ Static part of document: XML data
✦ Dynamic part of document: function call to web services
• Original document
<author>
<name>J. Doe</name>
…
<call fun=“getPub(„J.
Doe‟)” />
</author>
• After service call
<author>
<name>J. Doe</name>
…
<pubs>
<book> … </book>
…
</pubs>
</author>
Fragmenting XML Data

Fragmenting XML Data (cont’d)
• Structure-based fragmentation
➡ Fragmentation is based on characteristics of schema or data
➡ Similar to relational fragmentation
• Horizontal fragmentation
➡ Based on selection
➡ Specified by set of predicates on data
• Vertical fragmentation
➡ Based on projection
➡ Specified by fragmenting the set of node types in the schema
• Hybrid fragmentation

Horizontal Fragmentation
Let D = {d1, d2, … , dn} be a collection of document trees such that each di ∈ D
follows the same schema. Define a set of predicates P = {p0, p1, …, pl−1} such
that d ∈ D: unique pi ∈ P where pi(d). Then F = {{d ∈ D| pi(d)} | pi ∈ P} is a
horizontal fragmentation of D.
• Fragmentation tree patterns (FTP)

Horizontal Fragmentation
Example – Instance

Vertical Fragmentation
Let 〈 , , s, m, 〉 be a schema graph. Then we can define a vertical
fragmentation function :  F where F is a partitioning of .
• Fragment that has the root element is called the root fragment
• Child fragment and parent fragment are defined

Vertical Fragmentation Example
– Schema

Vertical Fragmentation Example
– Instance

Proxy Nodes

Optimizing Distributed XML
• Data shipping versus query shipping
✦ Data shipping moves data from where it is to where the query is posed and
processes it there
✦ XQuery built-in functionality: fn:doc(URI)
✦ Simple to implement
✦ Provides only inter-query parallelism (no intra-query parallelism)
✦ Requires sufficient storage space at each site to hold data
✦ Large data moving overhead
➡ Query shipping (or function shipping) decomposes the query such that each
subquery is evaluated at a site where the data reside
✦ Better parallelization
✦ Difficult in the context of XML – both the function and its parameters need to be
shipped
✦ Packaging required to implement call-by-value semantics
✓ Serialization of the subtree rooted at the parameter node is packaged and shipped

Data Shipping in XML
• Difficulties with packaging:
➡ There may be some axes (e.g., parent, preceding-sibling) that are now
“downward” from the parameter node requiring access to data that may not
be available in the subtree of the parameter node.
➡ Node identity problems: Two identical nodes passed as parameters and
returned as results are represented by call-by-value semantics as two different
nodes.
➡ Document order of nodes – serialization may not obey these
➡ Interaction between different subqueries that access the same document on a
given peer

Query Shipping Approach
• Partial function evaluation
➡ Given f(x,y), partial evaluation would compute f on one of the inputs and
generate a partial answer, which would be another function f’ that is only
dependent on the second input.
➡ Query is considered a function and data fragments its inputs.
➡ Processing outline:
✦ Phase 1:
✓ Coordinating site assigns the query to those sites that hold a relevant fragment.
✓ Each site evaluates the query (in parallel) on its local data.
✓ For some data nodes, the value of each query qualifier is known; for others, the value of some
qualifiers is a Boolean formula whose value is not yet fully determined
✦ Phase 2: The selection part of the query is (partially) evaluated  for each node
of each fragment, either it is part of the query’s answer, or it is not even a
candidate to be part of the answer
✦ Phase 3: Check candidate nodes again to determine which ones are indeed in the
result and send them to the coordinator

(cont’d)
• XRPC
➡ Introduce a remote procedure call functionality through
execute at {Expr}{FunApp(ParamList)}
where Expr is the explicit or computed URI of the peer where FunApp() is
to be applied.
➡ Call-by-projection
✦ A runtime projection technique to minimize message sizes.
✦ Preserves node identities and structural properties of XML node parameters as
well.
✦ A node parameter is analyzed to see how it is used by a remote function.
✦ Only those descendants of the node parameter that are actually used by the
remote function are serialized into the request message.
✦ Nodes outside the subtree of nth node parameter are added to the request
message if they are needed by the remote function.

(cont’d)
• Use the fragmentation specification and the query specification
• The following considers vertical fragmentation
GQTP Subqueries

Database , 17 Web

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Database , 17 Web (20)

More from Ali Usman (17)

Recently uploaded (20)

Database , 17 Web