SlideShare a Scribd company logo
Aggregate Estimation Over Dynamic 
Hidden Web Databases 
Presenter: Weimo Liu (The George Washington University) 
Joint work with Saravanan Thirumuruganathan (University 
of Texas at Arlington), Nan Zhang (The George Washington 
University), and Gautam Das (University of Texas at Arlington) 
1
Outline 
 Background and Motivation 
 REISSUE-ESTIMATOR 
 RS-ESTIMATOR 
 SYSTEM DESIGN 
 Experimental Results 
 Conclusion 
2
Hidden Databases: Used Car Inventory 
 Form-like interface 
 Return top-k tuples 
3
Search Queries vs Aggregate Queries 
 Search Queries 
 SELECT * FROM D WHERE ac1 = vc1 &···& acu = vcu 
 e.g., List 2006 Ford F-150 with 4WD and 5.4L engine in Cargiant’s inventory 
 Answered by hidden database with top-k restriction 
 Aggregate Queries 
 SELECT AGGR(*) FROM D WHERE ac1 = vc1 &···& acu = vcu, 
 e.g., How many vehicles in Cargiant’s inventory have MPG > 30? 
 Cannot be answered through the public web interface 
Search query 
Aggregate query 
Web interface 
Hidden database 
4
Challenges 
 Prior work is over a static hidden database. Problems 
exist in the simple approach to tackle the dynamic case 
by repeatedly executing (at certain time interval) the 
existing “static” algorithms: 
 Daily limit number of search queries per-IP 
 Repeated executions waste a lot of search queries 
5
Outline of Technical Results 
 Baseline 
 Repeated executions of existing “static” algorithm [DJJ+10] 
 Two Algorithms 
 REISSUE-ESTIMATOR 
 We try to infer whether and how search query answers received in the 
last round change in this round. 
 RS-ESTIMATOR 
 Automatically maintains a sample of a database according to how the 
database changes. 
6
Model of Dynamic Hidden Web Databases 
 Hidden Web Database and Query Interface 
 A hidden database D with m attributes A1, …, Am. Let Ui be the 
domain for attribute Ai. For a tuple t Î D, we use t[Ai] Î Ui to 
denote the value of Ai for t. 
 SELECT * FROM D WHERE Ai1 = ui1 AND … AND Ais = uis 
where i1, …, in Î [1, m] and uij Î Uij . Let Sel(q) Î D be the 
tuples matching q. 
 Dynamic Hidden Databases 
 In most part of the paper, we consider a round-update model 
where modifications occur at the beginning instant of each 
round. 
7
Objectives of Aggregate Estimation 
 In this paper, we consider two types of aggregate 
estimation tasks over a dynamic hidden database: 
 Single-round aggregates 
 In one round 
 Average, Count, Sum 
 Trans-round aggregates 
 The current ROUND and the previous ROUND 
 |Di|-|Di-1|
Outline 
 Background and Motivation 
 REISSUE-ESTIMATOR 
 RS-ESTIMATOR 
 SYSTEM DESIGN 
 Experimental Results 
 Conclusion 
9
Query Reissuing for Multiple Rounds
The Initial Round
Valid to Overflow
Valid to Underflow (1)
Valid to Underflow (2)
Key Question: Reissue or Restart? 
 Example 1 (No change) 
 The queries issued by REISSUE-ESTIMATOR are always a 
subset of those issued by RESTART-ESTIMATOR
Key Question: Reissue or Restart? 
 Example 2 (Total change) 
 REISSUE-ESTIMATOR might end up performing worse 
than RESTART-ESTIMATOR
Algorithm REISSUE-ESTIMATOR
Outline 
 Background and Motivation 
 REISSUE-ESTIMATOR 
 RS-ESTIMATOR 
 SYSTEM DESIGN 
 Experimental Results 
 Conclusion 
18
Problem of REISSUE-ESTIMATOR 
 Example (No Change) 
 One does not need to issue many queries before realizing the 
database has changed little, and therefore reallocate the 
remaining query budget to initiate new drill downs 
 Reservoir Sampling [V85] 
 How much change should happen to the sample being 
maintained depends on how much incoming data are inserted 
to the database.
Key Ideas of RS-ESTIMATOR
Algorithm RS-ESTIMATOR
Outline 
 Background and Motivation 
 REISSUE-ESTIMATOR 
 RS-ESTIMATOR 
 SYSTEM DESIGN 
 Experimental Results 
 Conclusion 
22
Experimental Results
Outline 
 Background and Motivation 
 REISSUE-ESTIMATOR 
 RS-ESTIMATOR 
 SYSTEM DESIGN 
 Experimental Results 
 Conclusion 
24
CONCLUSION AND FUTURE WORK 
A study of estimating aggregates over 
dynamic hidden web databases 
 Query reissuing 
 Bootstrapping-based query-plan adjustment 
Future Work 
 A study of how meta data such as COUNT can be used to guide 
the design of drill downs in future rounds; 
 Given a workload of aggregate queries, how to minimize the 
total query cost for estimating all of them; 
 How to leverage both keyword search and form-like search 
interfaces provided by many web databases to further improve 
the performance of aggregate estimations.
References 
 [DJJ+10]Arjun Dasgupta, Xin Jin, Bradley Jewell, Nan 
Zhang, and Gautam Das, Unbiased Estimation of Size and 
Other Aggregates Over Hidden Web Databases, in SIGMOD 
2010. 
 [V85] J. S. Vitter, Random sampling with a reservoir. ACM 
Trans. Math. Software., 11(1):37–57, Mar. 1985. 
26
THANK YOU 
27

More Related Content

PPT
Real Time Geodemographics
PDF
CMPE275-Project1Report
PDF
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
PPTX
CS 542 -- Query Optimization
PDF
towards_analytics_query_engine
PPTX
Optimizing spatial database
ODP
Understandung Firebird optimizer, by Dmitry Yemanov (in English)
PPTX
Query evaluation and optimization
Real Time Geodemographics
CMPE275-Project1Report
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CS 542 -- Query Optimization
towards_analytics_query_engine
Optimizing spatial database
Understandung Firebird optimizer, by Dmitry Yemanov (in English)
Query evaluation and optimization

What's hot (20)

PPTX
Firebird: cost-based optimization and statistics, by Dmitry Yemanov (in English)
PPTX
Algorithms for Query Processing and Optimization of Spatial Operations
PPT
4.2 spatial data mining
PPT
5.1 mining data streams
PPT
Overview of query evaluation
PPT
SQL Optimization With Trace Data And Dbms Xplan V6
PPTX
Dfg & sg ppt (1)
PPTX
Database , 8 Query Optimization
PPT
13. Query Processing in DBMS
PPT
CCLS Internship Presentation
PDF
Query trees
PDF
Parallel Processing Technique for Time Efficient Matrix Multiplication
PPTX
Query processing and Query Optimization
PPTX
Data visualization using R
PDF
Moa: Real Time Analytics for Data Streams
PDF
Bond Graph of a One Stage Reduction Gearbox
PDF
Agreggates i
PPTX
Query-porcessing-& Query optimization
PPTX
R and Visualization: A match made in Heaven
PPTX
Cost estimation for Query Optimization
Firebird: cost-based optimization and statistics, by Dmitry Yemanov (in English)
Algorithms for Query Processing and Optimization of Spatial Operations
4.2 spatial data mining
5.1 mining data streams
Overview of query evaluation
SQL Optimization With Trace Data And Dbms Xplan V6
Dfg & sg ppt (1)
Database , 8 Query Optimization
13. Query Processing in DBMS
CCLS Internship Presentation
Query trees
Parallel Processing Technique for Time Efficient Matrix Multiplication
Query processing and Query Optimization
Data visualization using R
Moa: Real Time Analytics for Data Streams
Bond Graph of a One Stage Reduction Gearbox
Agreggates i
Query-porcessing-& Query optimization
R and Visualization: A match made in Heaven
Cost estimation for Query Optimization
Ad

Viewers also liked (18)

PDF
Lookbook ss13 lmc-updated showrooms
PDF
Bonamassa New York Lookbook SS 2012
PDF
Bod & Christensen Men's lookbook ss13 email
DOCX
Scott Moore relevant accomplishments bio
PDF
Susan Rep SS 2013
PPTX
Haskell in Green Land: Analyzing the Energy Behavior of a Purely Functional L...
PDF
La Pina fall 2013 lookbook (resized)
PDF
Bonamassa New York Lookbook FW 2011
PDF
Susan rep SS 2014
PPTX
Payroll
PDF
Lamarque lookbook women
PDF
LaMarque Collection Mens Lookbook SS 2014
PDF
Bergdorf goodman men's store
PDF
Digital i o
PDF
LaMarque Collection lookbook-
PDF
C language
PPTX
The Influence of the Java Collection Framework on Overall Energy Consumption
PDF
LaMarque Collection Mens lookbook fw 2014
Lookbook ss13 lmc-updated showrooms
Bonamassa New York Lookbook SS 2012
Bod & Christensen Men's lookbook ss13 email
Scott Moore relevant accomplishments bio
Susan Rep SS 2013
Haskell in Green Land: Analyzing the Energy Behavior of a Purely Functional L...
La Pina fall 2013 lookbook (resized)
Bonamassa New York Lookbook FW 2011
Susan rep SS 2014
Payroll
Lamarque lookbook women
LaMarque Collection Mens Lookbook SS 2014
Bergdorf goodman men's store
Digital i o
LaMarque Collection lookbook-
C language
The Influence of the Java Collection Framework on Overall Energy Consumption
LaMarque Collection Mens lookbook fw 2014
Ad

Similar to Vldb14 (20)

PDF
E132833
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
PDF
Don't optimize my queries, organize my data!
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
PDF
SAS Training session - By Pratima
PPT
Elementary Concepts of data minig
PDF
Hybrid Knowledge Bases for Real-Time Robotic Reasoning
PPTX
Dynamic_Cloud_Application_Redistribution_Performance_Optimization
PDF
Auto-Pilot for Apache Spark Using Machine Learning
PDF
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
PDF
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
PPT
probabilistic ranking
PPSX
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
PDF
accessible-streaming-algorithms
PPT
A Pragmatic Approach to Semantic Repositories Benchmarking
PPT
Chapter15
PPTX
Presentación Oracle Database Migración consideraciones 10g/11g/12c
PPTX
Presentation_BigData_NenaMarin
E132833
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Don't optimize my queries, organize my data!
Cost-Based Optimizer in Apache Spark 2.2
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
SAS Training session - By Pratima
Elementary Concepts of data minig
Hybrid Knowledge Bases for Real-Time Robotic Reasoning
Dynamic_Cloud_Application_Redistribution_Performance_Optimization
Auto-Pilot for Apache Spark Using Machine Learning
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
probabilistic ranking
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
accessible-streaming-algorithms
A Pragmatic Approach to Semantic Repositories Benchmarking
Chapter15
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentation_BigData_NenaMarin

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Nekopoi APK 2025 free lastest update
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Introduction to Artificial Intelligence
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
top salesforce developer skills in 2025.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Design an Analysis of Algorithms I-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
Nekopoi APK 2025 free lastest update
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Introduction to Artificial Intelligence
Designing Intelligence for the Shop Floor.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Design an Analysis of Algorithms II-SECS-1021-03
How to Choose the Right IT Partner for Your Business in Malaysia
PTS Company Brochure 2025 (1).pdf.......
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
ai tools demonstartion for schools and inter college
L1 - Introduction to python Backend.pptx
Reimagine Home Health with the Power of Agentic AI​
top salesforce developer skills in 2025.pdf
Softaken Excel to vCard Converter Software.pdf
Understanding Forklifts - TECH EHS Solution
Design an Analysis of Algorithms I-SECS-1021-03

Vldb14

  • 1. Aggregate Estimation Over Dynamic Hidden Web Databases Presenter: Weimo Liu (The George Washington University) Joint work with Saravanan Thirumuruganathan (University of Texas at Arlington), Nan Zhang (The George Washington University), and Gautam Das (University of Texas at Arlington) 1
  • 2. Outline  Background and Motivation  REISSUE-ESTIMATOR  RS-ESTIMATOR  SYSTEM DESIGN  Experimental Results  Conclusion 2
  • 3. Hidden Databases: Used Car Inventory  Form-like interface  Return top-k tuples 3
  • 4. Search Queries vs Aggregate Queries  Search Queries  SELECT * FROM D WHERE ac1 = vc1 &···& acu = vcu  e.g., List 2006 Ford F-150 with 4WD and 5.4L engine in Cargiant’s inventory  Answered by hidden database with top-k restriction  Aggregate Queries  SELECT AGGR(*) FROM D WHERE ac1 = vc1 &···& acu = vcu,  e.g., How many vehicles in Cargiant’s inventory have MPG > 30?  Cannot be answered through the public web interface Search query Aggregate query Web interface Hidden database 4
  • 5. Challenges  Prior work is over a static hidden database. Problems exist in the simple approach to tackle the dynamic case by repeatedly executing (at certain time interval) the existing “static” algorithms:  Daily limit number of search queries per-IP  Repeated executions waste a lot of search queries 5
  • 6. Outline of Technical Results  Baseline  Repeated executions of existing “static” algorithm [DJJ+10]  Two Algorithms  REISSUE-ESTIMATOR  We try to infer whether and how search query answers received in the last round change in this round.  RS-ESTIMATOR  Automatically maintains a sample of a database according to how the database changes. 6
  • 7. Model of Dynamic Hidden Web Databases  Hidden Web Database and Query Interface  A hidden database D with m attributes A1, …, Am. Let Ui be the domain for attribute Ai. For a tuple t Î D, we use t[Ai] Î Ui to denote the value of Ai for t.  SELECT * FROM D WHERE Ai1 = ui1 AND … AND Ais = uis where i1, …, in Î [1, m] and uij Î Uij . Let Sel(q) Î D be the tuples matching q.  Dynamic Hidden Databases  In most part of the paper, we consider a round-update model where modifications occur at the beginning instant of each round. 7
  • 8. Objectives of Aggregate Estimation  In this paper, we consider two types of aggregate estimation tasks over a dynamic hidden database:  Single-round aggregates  In one round  Average, Count, Sum  Trans-round aggregates  The current ROUND and the previous ROUND  |Di|-|Di-1|
  • 9. Outline  Background and Motivation  REISSUE-ESTIMATOR  RS-ESTIMATOR  SYSTEM DESIGN  Experimental Results  Conclusion 9
  • 10. Query Reissuing for Multiple Rounds
  • 15. Key Question: Reissue or Restart?  Example 1 (No change)  The queries issued by REISSUE-ESTIMATOR are always a subset of those issued by RESTART-ESTIMATOR
  • 16. Key Question: Reissue or Restart?  Example 2 (Total change)  REISSUE-ESTIMATOR might end up performing worse than RESTART-ESTIMATOR
  • 18. Outline  Background and Motivation  REISSUE-ESTIMATOR  RS-ESTIMATOR  SYSTEM DESIGN  Experimental Results  Conclusion 18
  • 19. Problem of REISSUE-ESTIMATOR  Example (No Change)  One does not need to issue many queries before realizing the database has changed little, and therefore reallocate the remaining query budget to initiate new drill downs  Reservoir Sampling [V85]  How much change should happen to the sample being maintained depends on how much incoming data are inserted to the database.
  • 20. Key Ideas of RS-ESTIMATOR
  • 22. Outline  Background and Motivation  REISSUE-ESTIMATOR  RS-ESTIMATOR  SYSTEM DESIGN  Experimental Results  Conclusion 22
  • 24. Outline  Background and Motivation  REISSUE-ESTIMATOR  RS-ESTIMATOR  SYSTEM DESIGN  Experimental Results  Conclusion 24
  • 25. CONCLUSION AND FUTURE WORK A study of estimating aggregates over dynamic hidden web databases  Query reissuing  Bootstrapping-based query-plan adjustment Future Work  A study of how meta data such as COUNT can be used to guide the design of drill downs in future rounds;  Given a workload of aggregate queries, how to minimize the total query cost for estimating all of them;  How to leverage both keyword search and form-like search interfaces provided by many web databases to further improve the performance of aggregate estimations.
  • 26. References  [DJJ+10]Arjun Dasgupta, Xin Jin, Bradley Jewell, Nan Zhang, and Gautam Das, Unbiased Estimation of Size and Other Aggregates Over Hidden Web Databases, in SIGMOD 2010.  [V85] J. S. Vitter, Random sampling with a reservoir. ACM Trans. Math. Software., 11(1):37–57, Mar. 1985. 26