SlideShare a Scribd company logo
QUERY AWARE DETERMINIZATION OF UNCERTAIN
OBJECTS
ABSTRACT
This paper considers the problem of determinizing probabilistic data to enable such data
to be stored in legacy systems that accept only deterministic input. Probabilistic data may be
generated by automated data analysis/enrichment techniques such as entity resolution,
information extraction, and speech processing. The legacy system may correspond to pre-
existing web applications such as Flickr, Picasa, etc. The goal is to generate a deterministic
representation of probabilistic data that optimizes the quality of the end-application built on
deterministic data. We explore such a determinization problem in the context of two different
data processing tasks—triggers and selection queries. We show that approaches such as
thresholding or top-1 selection traditionally used for determinization lead to suboptimal
performance for such applications. Instead, we develop a query-aware strategy and show its
advantages over existing solutions through a comprehensive empirical evaluation over real and
synthetic datasets.
EXISTING SYSTEM:
Determinizing Probabilistic Data. While we are not aware of any prior work that directly
addresses the issue of determinizing probabilistic data as studied in this paper, the works that are
very related to ours are . They explore how to determinize answers to a query over a
probabilistic database. In contrast, we are interested in best deterministic representation of data
(and not that of a answer to a query) so as to continue to use existing end-applications that take
only deterministic input. The differences in the two problem settings lead to different challenges.
Authors in address a problem that chooses the set of uncertain objects to be cleaned, in order to
achieve the best improvement in the quality of query answers. However, their goal is to improve
quality of single query, while ours is to optimize quality of overall query workload. Also, they
focus on how to select the best set of objects and each selected object is cleaned by human
clarification, whereas we determinize all objects automatically. These differences essentially lead
to different optimization challenges. Another related area is MAP inference in graphical model,
which aims to find the assignment to each variable that jointly maximizes the probability defined
by the model. The determinization problem for the cost-based metric can be potentially viewed
as an instance of MAP inference problem. If we view the problem that way, the challenge
becomes that of developing fast and high-quality approximate algorithm for solving the
corresponding NP-hard problem. Section 3.3 exactly provides such algorithms, heavily
optimized and tuned to specifically our problem setting. Probabilistic Data Models. A variety
of advanced probabilistic data models have been proposed in the past. Our focus however was
determinizing probabilistic objects, such as image tags and speech output, for which the
probabilistic attribute model suffices. We note that determining probabilistic data stored in more
advanced probabilistic models such as And/Xor tree might also be interesting. Extending our
work to deal with data of such complexity remains an interesting future direction of work.
PROPOSED SYSTEM:
Overall, the main contributions of this paper are:
• We introduce the problem of determinizing probabilistic data. Given a workload of
triggers/queries, the main challenge is to find the deterministic representation of the data which
would optimize certain quality metrics of the answer to these triggers/queries.
• We propose a framework that solves the problem of determinization by minimizing the
expected cost of the answer to queries. We develop a branchand- bound algorithm that finds an
approximate near-optimal solution to the resulting NP-hard problem.
• We address the problem of determinizing a collection of objects to optimize set-based quality
metrics, such as F-measure. We develop an efficient algorithm that reaches near-optimal quality.
• We extend the solutions to handle a data model where mutual exclusion exists among tags. We
show that correlations among tags can be leveraged in our solutions to get better results. We also
demonstrate that our solutions are designed to handle various types of queries.
• We empirically demonstrate that the proposed algorithms are very efficient and reach high-
quality results that are very close to those of the optimal solution. We also demonstrate that they
are robust to small changes in the original query workload.
Module 1
Data Quality
Data quality refers to the level of quality of Data. There are many definitions of data quality but
data is generally considered high quality if, "they are fit for their intended uses
in operations, decision making and planning." (J. M. Juran). Alternatively, data is deemed of
high quality if it correctly represents the real-world construct to which it refers. Furthermore,
apart from these definitions, as data volume increases, the question of internal consistency within
data becomes significant, regardless of fitness for use for any particular external purpose. The
people's views on data quality can often be in disagreement, even when discussing the same set
of data used for the same purpose. A considerable amount of data quality research involves
investigating and describing various categories of desirable attributes (or dimensions) of data.
These lists commonly include accuracy, correctness, currency, completeness andrelevance.
Nearly 200 such terms have been identified and there is little agreement in their nature (are these
concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software
engineers may recognize this as a similar problem to "ilities".
Data quality control is the process of controlling the usage of data with known quality
measurements for an application or a process. This process is usually done after a Data Quality
Assurance (QA) process, which consists of discovery of data inconsistency and correction.
Data QA processes provides following information to Data Quality Control (QC):
 Severity of inconsistency
 Incompleteness
 Accuracy
 Precision
 Missing / Unknown
The Data QC process uses the information from the QA process to decide to use the data for
analysis or in an application or business process. For example, if a Data QC process finds that
the data contains too many errors or inconsistencies, then it prevents that data from being used
for its intended process which could cause disruption. For example, providing invalid
measurements from several sensors to the automatic pilot feature on an aircraft could cause it to
crash. Thus, establishing data QC process provides the protection of usage of data control and
establishes safe information usage.
Module 2
Determinization Problem
Having defined the notation, we now can define the determinization problem:
Definition 1 (Determinization). Given a set of uncertain objects O = {O1,O2, . . . ,O|O|}, a
query workload Q and a quality metric F, the goal of the deteriminization problem is for each
object O ∈ O to select from WO a set of tags AO ⊆ WO as the deterministic representation of O,
such that F(O,Q) is optimized.
The ground truth tags G are not associated with uncertain objects and thus not known to any
approach to the determinization problem. Therefore, such algorithms cannot directly measure the
quality F(O,Q) during their execution. Thus, in the following section, we will present our
solution that is based on maximizing the expected quality measure E(F(O,Q)). Before we
describe our solution to the determinization problem, we note that a baseline algorithm for
determinization is a thresholding (threshold-cut) approach. The thresholding approach employs a
pre-defined threshold τ . For each object O it composes its answer set A by choosing wi tags from
W such that P(wi) ≥ τ . The advantage of this approach is that it is computationally efficient and
potentially can be tuned to a particular dataset O and workload Q by changing τ . The main
drawback of this approach is that it is unaware of the query workload (―query-unaware") and
thus does not necessarily optimize the given quality metrics, which leads to lower quality.
Module 3
Branch and Bound Algorithm
Instead of performing a brute-force enumeration, we can employ a faster branch and bound (BB)
technique. The approach discovers answer sets in a greedy fashion so that answer sets with lower
cost tend to be discovered first.
Outline of the BB algorithm
The algorithm uses a branch and bound strategy to explore a search tree. Each node v in the tree
corresponds to a partial tag selection where decisions of whether to include certain tags in the
answer set or not have been made. We capture the partial selection of tags using the concept of
sequence. The performance of the algorithm depends upon the choice of the node (amongst all
the non-leaf nodes) to branch, the computation of upper and lower bounds of a node, and the
choice of the tag used to expand the node.We next describe these components of the algorithm in
detail.
Module 4
Node Selection
The algorithm maintains a priority queue H for picking a node v that contains the most promising
sequence Sv to continue with. Among the nodes in the priority queue, which one should we
choose to branch next? Let us consider sequence Sv that corresponds to a node v ∈ H. For v we
define Av as the set of answer sequences that correspond to the leaf nodes derivable from node v
via the branching procedure described above. That is, Av corresponds to the leaf nodes of the
subtree rooted at node v. For example, or node v2 of the tree in Fig. 3, Av2 = {Sv8 , Sv9 , Sv10 ,
Sv11 }. Then for node v let mv = min A∈Av E(cost(A,G,Q)) be the value of the minimum
expected cost among these sequences. Notice that if we knew the exact value of mv, it would
have been an ideal key for the priority queue H, since it would lead to the quickest way to find
the best sequences. The problem is that the exact value of mv is unknown when v is branched,
since the subtree rooted at v is not yet constructed at that moment. Even though mv is unknown it
is possible to quickly determine good lower and upper bounds on its value _v ≤ mv ≤ hv, without
comparing scores of each sequence in Av. For any node v, if Sv is an answer sequence then the
bounds are equal to the cost of the sequence itself, otherwise the bounds are computed as
explained next.
Module 5
Query-Level Optimization
The performance of the BB algorithm can be significantly improved further by employing query-
level optimizations. For a given sequence Sv of node v, we might be able to exactly determine
the expected cost of certain queries (e.g., Cases 1 and 2 above), even if Sv is not an answer
sequence. In other words, for these queries we can get their cost without expanding Sv into the
corresponding answer sequences A ∈ Av. We refer to such cost as fixed cost of v, and to such
queries as fixed or decided queries with respect to O. Thus, each node v is also associated with
the set of undecided queries Qv, its fixed cost ˆcv, as well as its lower and upper bounds ˆlv and
ˆhv on non-fixed cost. Note that lv = ˆcv +ˆlv and hv = ˆcv + ˆhv. Fixed cost and fixed queries give
us an opportunity to make the BB algorithm even more efficient by reducing the number of
queries involved in computing the expected cost. When the algorithm performs a branching at a
node, it creates two new nodes vyes and vno, by making yes/no decision for one undecided tag w.
Since one additional tag is now decided for each new node v, it is possible that expected cost for
some query Q ∈ Qv can be computed (Case 1 and 2). If so, Q becomes a fixed query for the new
node and will not be considered in future computations derived from that node. Based on the
relation between Sv and Q, there are three different cases for updating ˆcv, ˆlv and ˆhv
CONCLUSION:
In this paper we have considered the problem of determinizing uncertain objects to enable such
data to be stored in pre-existing systems, such as Flickr, that take only deterministic input. The
goal is to generate deterministic representations that optimize the quality of answers to
queries/triggers that execute over the deterministic data representation. We have proposed
efficient determinization algorithms that are orders of magnitude faster than the enumeration
based optimal solution but achieves almost the same quality as the optimal solution. As future
work, we plan to explore determinization techniques in the context of applications, wherein users
are also interested in retrieving objects in a ranked order.
REFERENCES
[1] D. V. Kalashnikov, S. Mehrotra, J. Xu, and N. Venkatasubramanian, ―A semantics-based
approach for speech annotation of images,‖ IEEE Trans. Knowl. Data Eng., vol. 23, no. 9, pp.
1373–1387, Sept. 2011.
[2] J. Li and J. Wang, ―Automatic linguistic indexing of pictures by a statistical modeling
approach,‖ IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1075–1088, Sept. 2003.
[3] C. Wangand, F. Jing, L. Zhang, and H. Zhang, ―Image annotation refinement using random
walk with restarts,‖ in Proc. 14th Annu. ACM Int. Conf. Multimedia, New York, NY, USA,
2006.
[4] B. Minescu, G. Damnati, F. Bechet, and R. de Mori, ―Conditional use of word lattices,
confusion networks and 1-best string hypotheses in a sequential interpretation strategy,‖ in Proc.
ICASSP, 2007.
[5] R. Nuray-Turan, D. V. Kalashnikov, S. Mehrotra, and Y. Yu, ―Attribute and object selection
queries on objects with probabilistic attributes,‖ ACM Trans. Database Syst., vol. 37, no. 1,
Article 3, Feb. 2012.
[6] J. Li and A. Deshpande, ―Consensus answers for queries over probabilistic databases,‖ in
Proc. 28th ACM SIGMOD-SIGACTSIGART Symp. PODS, New York, NY, USA, 2009.
[7] M. B. Ebarhimi and A. A. Ghorbani, ―A novel approach for frequent phrase mining in web
search engine query streams,‖ in Proc. 5th Annu. Conf. CNSR, Frederlcton, NB, Canada, 2007.
[8] S. Bhatia, D. Majumdar, and P. Mitra, ―Query suggestions in the absence of query logs,‖ in
Proc. 34th Int. ACM SIGIR, Beijing, China, 2011.
[9] C. Manning and H. Schutze, Foundations of Statistical Natural Language Processing,
Cambridge, MA, USA: MIT Press, 1999.
[10] D. V. Kalashnikov and S. Mehrotra, ―Domain-independent data cleaning via analysis of
entity-relationship graph,‖ ACM Trans. Database Syst., vol. 31, no. 2, pp. 716–767, Jun. 2006.
[11] K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis, ―On-line index selection for shifting
workloads,‖ in Proc. IEEE 23rd Int. Conf. Data Eng. Workshop, Istanbul, Turkey, 2007.
[12] P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann, ―Predictable
performance for unpredictable workloads,‖ in Proc. VLDB, Lyon, France, 2009.
[13] R. Cheng, J. Chen, and X. Xie, ―Cleaning uncertain data with quality guarantees,‖ in Proc.
VLDB, Auckland, New Zealand, 2008.

More Related Content

DOCX
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
PDF
Analysis of Textual Data Classification with a Reddit Comments Dataset
PDF
Explanations in Data Systems
PDF
E1802023741
PDF
Survey on Various Classification Techniques in Data Mining
PDF
A Decision Tree Based Classifier for Classification & Prediction of Diseases
PDF
Are we really including all relevant evidence
PPTX
Data mining: Classification and prediction
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Analysis of Textual Data Classification with a Reddit Comments Dataset
Explanations in Data Systems
E1802023741
Survey on Various Classification Techniques in Data Mining
A Decision Tree Based Classifier for Classification & Prediction of Diseases
Are we really including all relevant evidence
Data mining: Classification and prediction

What's hot (20)

PDF
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
PPTX
04 Classification in Data Mining
PDF
An integrated mechanism for feature selection
PDF
sigir2020
PDF
Incremental learning from unbalanced data with concept class, concept drift a...
PPT
Classification and prediction
PDF
Hypothesis on Different Data Mining Algorithms
PPT
Strategies for Metabolomics Data Analysis
PDF
Classification By Clustering Based On Adjusted Cluster
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PDF
Relationships Among Classical Test Theory and Item Response Theory Frameworks...
PPT
Data.Mining.C.6(II).classification and prediction
PDF
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
PDF
My experiment
PPTX
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
PDF
An enhanced pairwise search approach for generating
PDF
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
PDF
When deep learners change their mind learning dynamics for active learning
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
04 Classification in Data Mining
An integrated mechanism for feature selection
sigir2020
Incremental learning from unbalanced data with concept class, concept drift a...
Classification and prediction
Hypothesis on Different Data Mining Algorithms
Strategies for Metabolomics Data Analysis
Classification By Clustering Based On Adjusted Cluster
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
Relationships Among Classical Test Theory and Item Response Theory Frameworks...
Data.Mining.C.6(II).classification and prediction
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
My experiment
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
An enhanced pairwise search approach for generating
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
When deep learners change their mind learning dynamics for active learning
Ad

Viewers also liked (14)

PDF
Protecting location privacy in sensor networks against a global eavesdropper
PDF
Webservicerecommendationviaexploitinglocationandqosinformation
DOCX
PANDA: PUBLIC AUDITING FOR SHARED DATA WITH EFFICIENT USER REVOCATION IN THE ...
DOCX
Mobile data gathering with load balanced clustering and dual data uploading i...
DOCX
Malware propagation in large scale networks
DOCX
Detection and rectification of distorted fingerprints
DOC
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
DOCX
A scalable and reliable matching service for
PDF
Scalable and secure sharing of personal health
DOCX
Audit free cloud storage via deniable attribute-based encryption
DOCX
Tweet segmentation and its application to named entity recognition
DOCX
Control cloud data access privilege and anonymity with fully anonymous attrib...
PDF
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
DOCX
Interpreting the public sentiment variations ons on twitter
Protecting location privacy in sensor networks against a global eavesdropper
Webservicerecommendationviaexploitinglocationandqosinformation
PANDA: PUBLIC AUDITING FOR SHARED DATA WITH EFFICIENT USER REVOCATION IN THE ...
Mobile data gathering with load balanced clustering and dual data uploading i...
Malware propagation in large scale networks
Detection and rectification of distorted fingerprints
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
A scalable and reliable matching service for
Scalable and secure sharing of personal health
Audit free cloud storage via deniable attribute-based encryption
Tweet segmentation and its application to named entity recognition
Control cloud data access privilege and anonymity with fully anonymous attrib...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
Interpreting the public sentiment variations ons on twitter
Ad

Similar to Query aware determinization of uncertain (20)

PDF
Query Aware Determinization of Uncertain Objects
DOCX
IEEE Projects 2015 | Query aware determinization of uncertain objects
DOC
Query aware determinization of uncertain objects
DOCX
Query aware determinization of uncertain
PDF
Query aware determinization of uncertain objects
PDF
Query aware determinization of uncertain objects
PDF
Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
PDF
[Slides] Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons
PDF
Efficient Query Processing Using Machine Learning
PPTX
Blinkdb
PPTX
Wrokflow programming and provenance query model
PDF
Query aware determinization of uncertain objects .
PDF
Understanding the Value of Software Engineering Technologies
PDF
OPTIMAL CHOICE: NEW MACHINE LEARNING PROBLEM AND ITS SOLUTION
PDF
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
PPT
tutorial.ppt
PDF
Workload-aware materialization for efficient variable elimination on Bayesian...
PDF
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
ODP
Evolutionary and Swarm Computing for scaling up the Semantic Web
Query Aware Determinization of Uncertain Objects
IEEE Projects 2015 | Query aware determinization of uncertain objects
Query aware determinization of uncertain objects
Query aware determinization of uncertain
Query aware determinization of uncertain objects
Query aware determinization of uncertain objects
Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
[Slides] Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons
Efficient Query Processing Using Machine Learning
Blinkdb
Wrokflow programming and provenance query model
Query aware determinization of uncertain objects .
Understanding the Value of Software Engineering Technologies
OPTIMAL CHOICE: NEW MACHINE LEARNING PROBLEM AND ITS SOLUTION
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
tutorial.ppt
Workload-aware materialization for efficient variable elimination on Bayesian...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
Evolutionary and Swarm Computing for scaling up the Semantic Web

More from Shakas Technologies (20)

DOCX
A Review on Deep-Learning-Based Cyberbullying Detection
DOCX
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
DOCX
A Novel Framework for Credit Card.
DOCX
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
DOCX
NS2 Final Year Project Titles 2023- 2024
DOCX
MATLAB Final Year IEEE Project Titles 2023-2024
DOCX
Latest Python IEEE Project Titles 2023-2024
DOCX
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
DOCX
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
DOCX
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
DOCX
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
DOCX
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
DOCX
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
DOCX
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
DOCX
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
DOCX
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
DOCX
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
DOCX
Fighting Money Laundering With Statistics and Machine Learning.docx
DOCX
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
DOCX
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
A Review on Deep-Learning-Based Cyberbullying Detection
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Novel Framework for Credit Card.
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
NS2 Final Year Project Titles 2023- 2024
MATLAB Final Year IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Fighting Money Laundering With Statistics and Machine Learning.docx
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...

Recently uploaded (20)

PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
master seminar digital applications in india
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Basic Mud Logging Guide for educational purpose
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Pre independence Education in Inndia.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Module 4: Burden of Disease Tutorial Slides S2 2025
O7-L3 Supply Chain Operations - ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
Sports Quiz easy sports quiz sports quiz
Renaissance Architecture: A Journey from Faith to Humanism
Final Presentation General Medicine 03-08-2024.pptx
Complications of Minimal Access Surgery at WLH
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
2.FourierTransform-ShortQuestionswithAnswers.pdf
Institutional Correction lecture only . . .
PPH.pptx obstetrics and gynecology in nursing
master seminar digital applications in india
Supply Chain Operations Speaking Notes -ICLT Program
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
RMMM.pdf make it easy to upload and study
Basic Mud Logging Guide for educational purpose
human mycosis Human fungal infections are called human mycosis..pptx
Pre independence Education in Inndia.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra

Query aware determinization of uncertain

  • 1. QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS ABSTRACT This paper considers the problem of determinizing probabilistic data to enable such data to be stored in legacy systems that accept only deterministic input. Probabilistic data may be generated by automated data analysis/enrichment techniques such as entity resolution, information extraction, and speech processing. The legacy system may correspond to pre- existing web applications such as Flickr, Picasa, etc. The goal is to generate a deterministic representation of probabilistic data that optimizes the quality of the end-application built on deterministic data. We explore such a determinization problem in the context of two different data processing tasks—triggers and selection queries. We show that approaches such as thresholding or top-1 selection traditionally used for determinization lead to suboptimal performance for such applications. Instead, we develop a query-aware strategy and show its advantages over existing solutions through a comprehensive empirical evaluation over real and synthetic datasets. EXISTING SYSTEM: Determinizing Probabilistic Data. While we are not aware of any prior work that directly addresses the issue of determinizing probabilistic data as studied in this paper, the works that are very related to ours are . They explore how to determinize answers to a query over a probabilistic database. In contrast, we are interested in best deterministic representation of data (and not that of a answer to a query) so as to continue to use existing end-applications that take only deterministic input. The differences in the two problem settings lead to different challenges. Authors in address a problem that chooses the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. However, their goal is to improve quality of single query, while ours is to optimize quality of overall query workload. Also, they focus on how to select the best set of objects and each selected object is cleaned by human clarification, whereas we determinize all objects automatically. These differences essentially lead to different optimization challenges. Another related area is MAP inference in graphical model,
  • 2. which aims to find the assignment to each variable that jointly maximizes the probability defined by the model. The determinization problem for the cost-based metric can be potentially viewed as an instance of MAP inference problem. If we view the problem that way, the challenge becomes that of developing fast and high-quality approximate algorithm for solving the corresponding NP-hard problem. Section 3.3 exactly provides such algorithms, heavily optimized and tuned to specifically our problem setting. Probabilistic Data Models. A variety of advanced probabilistic data models have been proposed in the past. Our focus however was determinizing probabilistic objects, such as image tags and speech output, for which the probabilistic attribute model suffices. We note that determining probabilistic data stored in more advanced probabilistic models such as And/Xor tree might also be interesting. Extending our work to deal with data of such complexity remains an interesting future direction of work. PROPOSED SYSTEM: Overall, the main contributions of this paper are: • We introduce the problem of determinizing probabilistic data. Given a workload of triggers/queries, the main challenge is to find the deterministic representation of the data which would optimize certain quality metrics of the answer to these triggers/queries. • We propose a framework that solves the problem of determinization by minimizing the expected cost of the answer to queries. We develop a branchand- bound algorithm that finds an approximate near-optimal solution to the resulting NP-hard problem. • We address the problem of determinizing a collection of objects to optimize set-based quality metrics, such as F-measure. We develop an efficient algorithm that reaches near-optimal quality. • We extend the solutions to handle a data model where mutual exclusion exists among tags. We show that correlations among tags can be leveraged in our solutions to get better results. We also demonstrate that our solutions are designed to handle various types of queries. • We empirically demonstrate that the proposed algorithms are very efficient and reach high- quality results that are very close to those of the optimal solution. We also demonstrate that they are robust to small changes in the original query workload.
  • 3. Module 1 Data Quality Data quality refers to the level of quality of Data. There are many definitions of data quality but data is generally considered high quality if, "they are fit for their intended uses in operations, decision making and planning." (J. M. Juran). Alternatively, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes significant, regardless of fitness for use for any particular external purpose. The people's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. These lists commonly include accuracy, correctness, currency, completeness andrelevance. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognize this as a similar problem to "ilities". Data quality control is the process of controlling the usage of data with known quality measurements for an application or a process. This process is usually done after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction. Data QA processes provides following information to Data Quality Control (QC):  Severity of inconsistency  Incompleteness  Accuracy  Precision  Missing / Unknown The Data QC process uses the information from the QA process to decide to use the data for analysis or in an application or business process. For example, if a Data QC process finds that the data contains too many errors or inconsistencies, then it prevents that data from being used for its intended process which could cause disruption. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to
  • 4. crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage. Module 2 Determinization Problem Having defined the notation, we now can define the determinization problem: Definition 1 (Determinization). Given a set of uncertain objects O = {O1,O2, . . . ,O|O|}, a query workload Q and a quality metric F, the goal of the deteriminization problem is for each object O ∈ O to select from WO a set of tags AO ⊆ WO as the deterministic representation of O, such that F(O,Q) is optimized. The ground truth tags G are not associated with uncertain objects and thus not known to any approach to the determinization problem. Therefore, such algorithms cannot directly measure the quality F(O,Q) during their execution. Thus, in the following section, we will present our solution that is based on maximizing the expected quality measure E(F(O,Q)). Before we describe our solution to the determinization problem, we note that a baseline algorithm for determinization is a thresholding (threshold-cut) approach. The thresholding approach employs a pre-defined threshold τ . For each object O it composes its answer set A by choosing wi tags from W such that P(wi) ≥ τ . The advantage of this approach is that it is computationally efficient and potentially can be tuned to a particular dataset O and workload Q by changing τ . The main drawback of this approach is that it is unaware of the query workload (―query-unaware") and thus does not necessarily optimize the given quality metrics, which leads to lower quality. Module 3 Branch and Bound Algorithm
  • 5. Instead of performing a brute-force enumeration, we can employ a faster branch and bound (BB) technique. The approach discovers answer sets in a greedy fashion so that answer sets with lower cost tend to be discovered first. Outline of the BB algorithm The algorithm uses a branch and bound strategy to explore a search tree. Each node v in the tree corresponds to a partial tag selection where decisions of whether to include certain tags in the answer set or not have been made. We capture the partial selection of tags using the concept of sequence. The performance of the algorithm depends upon the choice of the node (amongst all the non-leaf nodes) to branch, the computation of upper and lower bounds of a node, and the choice of the tag used to expand the node.We next describe these components of the algorithm in detail. Module 4 Node Selection The algorithm maintains a priority queue H for picking a node v that contains the most promising sequence Sv to continue with. Among the nodes in the priority queue, which one should we choose to branch next? Let us consider sequence Sv that corresponds to a node v ∈ H. For v we define Av as the set of answer sequences that correspond to the leaf nodes derivable from node v via the branching procedure described above. That is, Av corresponds to the leaf nodes of the subtree rooted at node v. For example, or node v2 of the tree in Fig. 3, Av2 = {Sv8 , Sv9 , Sv10 , Sv11 }. Then for node v let mv = min A∈Av E(cost(A,G,Q)) be the value of the minimum expected cost among these sequences. Notice that if we knew the exact value of mv, it would have been an ideal key for the priority queue H, since it would lead to the quickest way to find the best sequences. The problem is that the exact value of mv is unknown when v is branched, since the subtree rooted at v is not yet constructed at that moment. Even though mv is unknown it is possible to quickly determine good lower and upper bounds on its value _v ≤ mv ≤ hv, without comparing scores of each sequence in Av. For any node v, if Sv is an answer sequence then the
  • 6. bounds are equal to the cost of the sequence itself, otherwise the bounds are computed as explained next. Module 5 Query-Level Optimization The performance of the BB algorithm can be significantly improved further by employing query- level optimizations. For a given sequence Sv of node v, we might be able to exactly determine the expected cost of certain queries (e.g., Cases 1 and 2 above), even if Sv is not an answer sequence. In other words, for these queries we can get their cost without expanding Sv into the corresponding answer sequences A ∈ Av. We refer to such cost as fixed cost of v, and to such queries as fixed or decided queries with respect to O. Thus, each node v is also associated with the set of undecided queries Qv, its fixed cost ˆcv, as well as its lower and upper bounds ˆlv and ˆhv on non-fixed cost. Note that lv = ˆcv +ˆlv and hv = ˆcv + ˆhv. Fixed cost and fixed queries give us an opportunity to make the BB algorithm even more efficient by reducing the number of queries involved in computing the expected cost. When the algorithm performs a branching at a node, it creates two new nodes vyes and vno, by making yes/no decision for one undecided tag w. Since one additional tag is now decided for each new node v, it is possible that expected cost for some query Q ∈ Qv can be computed (Case 1 and 2). If so, Q becomes a fixed query for the new node and will not be considered in future computations derived from that node. Based on the relation between Sv and Q, there are three different cases for updating ˆcv, ˆlv and ˆhv CONCLUSION: In this paper we have considered the problem of determinizing uncertain objects to enable such data to be stored in pre-existing systems, such as Flickr, that take only deterministic input. The goal is to generate deterministic representations that optimize the quality of answers to queries/triggers that execute over the deterministic data representation. We have proposed
  • 7. efficient determinization algorithms that are orders of magnitude faster than the enumeration based optimal solution but achieves almost the same quality as the optimal solution. As future work, we plan to explore determinization techniques in the context of applications, wherein users are also interested in retrieving objects in a ranked order. REFERENCES [1] D. V. Kalashnikov, S. Mehrotra, J. Xu, and N. Venkatasubramanian, ―A semantics-based approach for speech annotation of images,‖ IEEE Trans. Knowl. Data Eng., vol. 23, no. 9, pp. 1373–1387, Sept. 2011. [2] J. Li and J. Wang, ―Automatic linguistic indexing of pictures by a statistical modeling approach,‖ IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1075–1088, Sept. 2003. [3] C. Wangand, F. Jing, L. Zhang, and H. Zhang, ―Image annotation refinement using random walk with restarts,‖ in Proc. 14th Annu. ACM Int. Conf. Multimedia, New York, NY, USA, 2006. [4] B. Minescu, G. Damnati, F. Bechet, and R. de Mori, ―Conditional use of word lattices, confusion networks and 1-best string hypotheses in a sequential interpretation strategy,‖ in Proc. ICASSP, 2007. [5] R. Nuray-Turan, D. V. Kalashnikov, S. Mehrotra, and Y. Yu, ―Attribute and object selection queries on objects with probabilistic attributes,‖ ACM Trans. Database Syst., vol. 37, no. 1, Article 3, Feb. 2012. [6] J. Li and A. Deshpande, ―Consensus answers for queries over probabilistic databases,‖ in Proc. 28th ACM SIGMOD-SIGACTSIGART Symp. PODS, New York, NY, USA, 2009. [7] M. B. Ebarhimi and A. A. Ghorbani, ―A novel approach for frequent phrase mining in web search engine query streams,‖ in Proc. 5th Annu. Conf. CNSR, Frederlcton, NB, Canada, 2007. [8] S. Bhatia, D. Majumdar, and P. Mitra, ―Query suggestions in the absence of query logs,‖ in Proc. 34th Int. ACM SIGIR, Beijing, China, 2011. [9] C. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, Cambridge, MA, USA: MIT Press, 1999.
  • 8. [10] D. V. Kalashnikov and S. Mehrotra, ―Domain-independent data cleaning via analysis of entity-relationship graph,‖ ACM Trans. Database Syst., vol. 31, no. 2, pp. 716–767, Jun. 2006. [11] K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis, ―On-line index selection for shifting workloads,‖ in Proc. IEEE 23rd Int. Conf. Data Eng. Workshop, Istanbul, Turkey, 2007. [12] P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann, ―Predictable performance for unpredictable workloads,‖ in Proc. VLDB, Lyon, France, 2009. [13] R. Cheng, J. Chen, and X. Xie, ―Cleaning uncertain data with quality guarantees,‖ in Proc. VLDB, Auckland, New Zealand, 2008.