CS 542 -- Query Execution

CS 542 Database Management SystemsQuery ExecutionJ Singh March 21, 2011

This meetingData Models for NoSQL DatabasesPreliminariesWhat are we shooting for?Reference Material for Benchmarks posted in blogSome slides from TPC-C SIGMOD ‘97 PresentationQuery ExecutionSort: Chapter 15Join: Sections 16.1 – 16.4

Data Models forNoSQL DatabasesClass Discussion at Next Meeting. How would you represent many-to-many relationships? Also many-to-one and one-to-one.Cassandra. Brian CardMongoDB. AnniesDuctanRedis. Jonathan GlumacGoogle App Engine. Sahel MastoureshghAmazon SimpleDB. ZahidMianCouchDB. Robert Van Reenen3-minute presentation (on 3/21) for 20 bonus points

What are we shooting for?Good benchmarksDefine the playing fieldSet the performance agendaMeasure release-to-release progressSet goals (e.g., 10,000 tpmC, < 50 $/tpmC) Something managers can understand (!)Benchmark abuse BenchmarketingBenchmark wars more $ on ads than developmentTo keep abuses to a minimum, Benchmarks are defined with precision and read like they are legal documents (example).Some companies include specific prohibitions against publishing benchmark results in their license agreements

Benchmarks have a LifetimeGood benchmarks drive industry and technology forward.At some point, all reasonable advances have been made.Benchmarks can become counter productive by encouraging artificial optimizations.So, even good benchmarks become obsolete over time.

Database BenchmarksRelational Database (OLTP) BenchmarksTPC = Transaction Processing Performance CouncilDe facto industry standards body for OLTP performanceMost TPC specs, info, results are on the web page: http://www.tpc.orgTPC-C has been the workhorse of the industry, more in a minuteTPC-E is more comprehensiveDifferent problem spaces require different benchmarksOther benchmarks for analytics / decision support systemsTwo papers referenced on the course website on NoSQL / MapReduceBenchmarks define the problem set, not the technologyE.g., if managing documents, create and use a document management benchmark, not one that was created to show off the capabilities of your DB.

TPC-C’s Five TransactionsWorkload DefinitionTransactions operate against a database of nine tablesTransactions:New-order: enter a new order from a customerPayment: update customer balance to reflect a paymentDelivery: deliver orders (done as a batch transaction)Order-status: retrieve status of customer’s most recent orderStock-level: monitor warehouse inventorySpecifies size of each tableSpecifies # of users and workflow (next slide)Specifies configuration requirements must be ACID, failure tolerant, distributed, …Response time requirement: 90% of each type of transaction must have a response time <= 5 seconds, except stock-level which is <= 20 seconds.Result:How many TPC-C transactions can be supported?What is the $/tpm cost

TPC-C Workflow1Select txn from menu:1. New-Order 45%2. Payment 43% 3. Order-Status 4%4. Delivery 4%5. Stock-Level 4%Cycle Time Decomposition(typical values, in seconds, for weighted average txn)Menu = 0.3Keying = 9.6Txn RT = 2.1Think = 11.4Average cycle time = 23.42Measure menu Response TimeInput screenKeying time3Measure txn Response TimeOutput screenThink timeGo back to 1

TPC-C Results (by DBMS, as of 5/9/97)Stating the obvious…These results are not a comparison of databasesThey are a comparison of databases for the specific problem specified by the TPC-C benchmarkEnsuring a level playing field is essential when defining a benchmark and conducting measurementsWitness the Pavlo/Dean debate

Benchmarks for Other DatabasesClass Discussion at Next Meeting. What benchmarks are appropriate for Key-value stores?Document databases?Network databases?Geospatial databases?Genomic databases?Time series databases?Other?General discussion, no bonus pointsPlease let me know if I may call on you, and for which?

An example to work withBut first we must revisit Relational Algebra…Database: City, Country, CountryLanguage database.Example query: All cities in Finland with a population at least double of ArubaSELECT [xyz]FROM City, CountryWHERECity.CountryCode = 'fin' ANDCountry.Code = 'abw' ANDCity.population > 2*Country.population;

Relational OperatorsSelection Basics IdempotentCommutativeSelection ConjunctionsUseful when pruningSelection DisjunctionsEquivalent to UNIONS

Selection and Cross ProductWhen Selection is followed by a Cross Product,for A(R  S), Break A into three conditions such that A = r⋀ s⋀rs wherer only has the set of attributes only in Rs only has the set of attributes only in Srs, has the set of attributes in both R and SThen, the following holds:A(R  S) = r⋀ s⋀ rs(R  S) = rs(r(R)  s(S))In case you forgot…R ⋈A S = A(R  S)This result helps us compute Theta-joins!Review Chapter 2 of the textbook for more; back to the example…

An example to work withDatabase: City, Country, CountryLanguage database.Example query: All cities in Finland with a population at least double of ArubaSELECT [xyz] FROM City, CountryWHERECity.CountryCode = 'fin' ANDCountry.Code = 'abw' ANDCity.population > 2*Country.population;Algebra Representationxyz((T.cc = 'fin' ⋀ Y.cc = 'abw' ⋀ T.pop > 2*Y.pop) (T  Y)), orcontinued…

Example: Algebra ManipulationAlgebra Representationxyz((T.cc = 'fin' ⋀ Y.cc = 'abw'⋀T.pop > 2*Y.pop) (T  Y)), orxyz( ( T.pop > 2*Y.pop) ( (T.cc = 'fin' ) (T)   (Y.cc= 'abw' ) (Y) )Graphical Representation of Plan

Visualizing Plan ExecutionThe plan is a set of ‘operators’The operators operate in parallelOn different machines? On different processors? In different processes? In different threads? Yes, depends on the architecture.Each operator feeds its input to the next operatorThe “parallel operators” visualization allows for pipeliningThe output of one operator is the input to the nextA operator can block if its inputs are not readyDesign goal is for the operators to pipeline (if possible)Would like to start operating with partial dataTakes advantage of as much parallelism as the problem allows

Common ElementsKey metrics of each component:How much RAM does it consume?How much Disk I/O does it require?Each component is implemented as an IteratorBase class for each operator. Three methods:Open(). May block ifInput is not readyUnable to proceed till all data has been receivedGetNext(). Returns the next tuple.May block if the next tuple is not readyReturns NotFound when exhaustedClose()Performs any cleanup and terminates

Example: Table-scan operatorOpen(): passGetNext(): for b in blocks: for t in tuples of b: if valid t: return t return NotFoundClose(): passKey Metrics:RAM: 1 blockDisk I/O: Number of blocksNotes:Represents the operations T(=City) and Y(=Country)Used only if appropriate indexes don’t existCan use prefetchingNot shown here

Summary so farBenchmarks are critical for defining performance goals of the databaseTPC-C is a widely-used benchmark,TPC-E is broader in scope but less widespreadNeed to choose benchmarks to fit the problem at handA query can be parsed into primitives for executionParallelism & pipelining are essential for performance

CS-542 Database Management SystemsQuery Execution Algorithms

One-pass AlgorithmsLend themselves nicely to pipelining (with minimum blocking)Good forTable-scans (as seen)Tuple-at-a-time operations (selection and projection)Full-relation binary operations (∪, ∩, -, ⋈, ) as long as one of the operands can fit in memoryConsidering JOIN next, read others from book

Open(): read S into memoryGetNext(): for b in blocks of R: for t in tuples of b: if t matches tuple s: return join (t,s) return NotFoundClose(): passExample: JOIN (R,S)Key Metrics:RAM: Blocks(S) + 1 blockDisk I/O: Blocks(R) + Blocks(S)Notes:Can use prefetching for RNot shown here

Nested-Loop JoinsWhat if all of S won’t fit into memory? We can do it chunk-by-chunk, a ‘chunk’ is as many blocks of S that will fitAlgorithm sketch:(I/O operations shown in bold)GetNext():for c in chunks of S: for b in blocks of R: for t in tuples of b: for s in tuples of c: return join(t,s) return NotFoundKey MetricsRAM: MDisk I/O: Blocks(S) + k * Blocks(R) where k = (size(S)/#chunks)Note how quickly performance deteriorates!We can do better

Two-pass algorithmsSort-based two-pass algorithmsThe first pass does a sort on some parameter(s) of each operandThe second pass algorithm relies on the sort results and can be pipelinedHash-based two-pass algorithmsDo a prep-pass and write the result back to diskCompute the result in the second pass

Two-pass idea: sort exampleFor each of C chunks of M blocks, sort each chunk and write it backIn the example, we have 4 chunks, each 6 blocksMerge the resultKey MetricsFor the first pass:RAM: MDisk I/O: 2 * Blocks(R)For the 2nd pass:RAM: CDisk I/O: Blocks(R)

Naïve two-pass JOINSort R and S on the common attributes of the JOINMerge the sorted R and S on the common attributesSee section 15.4.9 of book for more detailsAlso known as Sort-JoinKey MetricsSortRAM: MDisk I/O: 4 * (Blocks(R) + Blocks(S))4, not 3 because we wrote the sort results backJoinRAM: 2Disk I/O: (Blocks(R) + Blocks(S))Total OperationRAM: MDisk I/O: 5 * (Blocks(R) + Blocks(S))

Efficient two-pass JOINKey MetricsSort (only pass 1)RAM: MDisk I/O: 2 * (Blocks(R) + Blocks(S))JoinRAM: 2Disk I/O: None additional (Blocks(R) + Blocks(S))Total OperationRAM: MDisk I/O: 3 * (Blocks(R) + Blocks(S))Main idea:Combine pass 2 of the sort with join

Hash JoinMain Idea:Pass 1: Dividetuples in R and S into m hash bucketsRead a block of R (or S)For each tuple in that block, find its hash i and move it to hash bucket i.Keep one block for each hash bucket in memoryWrite it out to disk when fullPass 2: For each iRead buckets Ri and Si and do their join.Key MetricsRAM: MDisk I/O: 3 * (Blocks(R) + Blocks(S))Disk I/O can be less if:Hash the bigger relation firstExpect that many of the buckets will still be in memory

Index-based AlgorithmsRefresher course on indexes and clusteringThe basic idea:Use the index to locate records and thus cut down on I/O

Index-based SelectionIf the relation T has a clustering index on cc,All tuples will be contiguousDisk I/O: Blocks(T)/V(T, 'fin')Where V(T,cc) is the number of tuples with cc = 'fin‘Sort of…If the relation T does not have a clustering index on cc,Tuples could be scatteredDisk I/O: Tuples(T)/V(T, 'fin')Big difference!Consider the selection (T.cc= 'fin' ) (T)

Index-based JOINIf, say, R has an index on Y,Same as a two-pass JOIN except that we don’t have to first sort/hash on RIf clustering index, Disk I/O,Blocks(R)/V(R,Y) + 3 * Blocks(S)Otherwise,Tuples(R)/V(R,Y) + 3 * Blocks(S)If both R and S are indexed,Disk I/O is reduced even furtherConsider the JOINR(X,Y) ⋈ S(Y,Z), where Y is the common set of attributes of R and S

SummaryExecution primitives forpipeliningOne-pass algorithms should be used wherever possibleTwo-pass algorithms can usually be used no matter how big the problemIndexes help and should be taken advantage of where possible

Query OptimizationBased on slides from Prof. Garcia-Molina

Desired Endpoint x=1 AND y=2 AND z<5 (R)R ⋈ S ⋈ UExample Physical Query Planstwo-passhash-join101 buffersFilter(x=1 AND z<5)materializeIndexScan(R,y=2)two-passhash-join101 buffersTableScan(U)TableScan(R)TableScan(S)

OutlineConvert SQL query to a parse treeSemantic checking: attributes, relation names, typesConvert to a logical query plan (relational algebra expression)deal with subqueriesImprove the logical query planuse algebraic transformationsgroup together certain operatorsevaluate logical plan based on estimated size of relations Convert to a physical query plansearch the space of physical plans choose order of operationscomplete the physical query plan

Improving the Logical Query PlanThere are numerous algebraic laws concerning relational algebra operationsBy applying them to a logical query plan judiciously, we can get an equivalent query plan that can be executed more efficientlyNext we'll survey some of these laws

Relational Operators (revisited)Selection Basics IdempotentCommutativeSelection ConjunctionsUseful when pruningSelection DisjunctionsEquivalent to UNIONS

Laws Involving SelectionSelections usually reduce the size of the relationUsually good to do selections early, i.e., "push them down the tree"Also can be helpful to break up a complex selection into parts

Selection and Binary OperatorsMust push selection to both arguments:C (R U S) = C (R) U C (S)Must push to first arg, optional for 2nd:C (R - S) = C (R) - SC (R - S) = C (R) - C (S)Push to at least one arg with all attributes mentioned in C:product, natural join, theta join, intersectione.g., C (R X S) = C (R) X S, if R has all the attributes in C

Pushing Selection Up the TreeSuppose we have relationsStarsIn(title,year,starName)Movie(title,year,len,inColor,studioName)and a viewCREATE VIEW MoviesOf1996 AS SELECT * FROM Movie WHERE year = 1996;and the querySELECT starName, studioName FROM MoviesOf1996 NATURAL JOIN StarsIn;

The Straightforward TreeRemember the ruleC(R ⋈S) = C(R) ⋈S ?starName,studioNameyear=1996 StarsInMovie

The Improved Logical Query PlanstarName,studioNamestarName,studioNamestarName,studioNameyear=1996year=1996 year=1996 year=1996 StarsInStarsInMovieStarsInMoviepush selectionup treepush selectiondown treeMovie

Laws Involving ProjectionsAdding a projection lower in the tree can improve performance, since often tuple size is reducedUsually not as helpful as pushing selections downConsult textbook for details, will not be on the exam

Joins and ProductsRecall from the definitions of relational algebra:R ⋈C S = C (R X S) (theta join) where C equates same-name attributes in R and STo improve a logical query plan, replace a product followed by a selection with a joinJoin algorithms are usually faster than doing product followed by selection

Summary of LQP ImprovementsSelections:push down tree as far as possibleif condition is an AND, split and push separatelysometimes need to push up before pushing downProjections:can be pushed down (sometimes, read book)Selection/product combinations:can sometimes be replaced with join

Grouping Assoc/Comm OperatorsGroup together adjacent joins, adjacent unions, and adjacent intersections as siblings in the treeSets up the logical QP for future optimization when physical QP is constructed: determine best order for doing a sequence of joins (or unions or intersections)U D E FUDEFUA B C ABC

Evaluating Logical Query PlansThe transformations discussed so far intuitively seem like good ideasBut how can we evaluate them more scientifically?Estimate size of relations, also helpful in evaluating physical query plansComing up next…

CS-542 Database Management SystemsPlan Estimation, based on slides from Prof. Garcia-Molina

Estimating Sizes of RelationsUsed in two places:to help decide between competing logical query plansto help decide between competing physical query plansNotation review:T(R): number of tuples in relation RB(R): minimum number of blocks needed to store RSo far, we’ve spelled it out Blocks(R)V(R,a): number of distinct values in R of attribute a

Requirements for Estimation RulesGive accurate estimatesAre easy (fast) to computeAre logically consistent: estimated size should not depend on how the relation is computedHere describe some simple heuristics.All we really need is a scheme that properly ranks competing plans.

Estimating Size of Selection (p1)Suppose selection condition is A = c, where A is an attribute and c is a constant.A reasonable estimate of the number of tuples in the result is:T(R)/V(R,A), i.e., original number of tuples divided by number of different values of AGood approximation if values of A are evenly distributedAlso good approximation in some other, common, situations (see textbook)

Estimating Size of Selection (p2)If condition is A < c:a good estimate is T(R)/3; intuition is that usually you ask about something that is true of less than half the tuplesIf condition is A ≠ c: a good estimate is T(R )If condition is the AND of several equalities and inequalities, estimate in series.

ExampleConsider relation R(a,b,c) with 10,000 tuples and 50 different values for attribute a.Consider selecting all tuples from R with a = 10 and b < 20.Estimate of number of resulting tuples is 10,000*(1/50)*(1/3) = 67.

Estimating Size of Selection (p3)If condition has the form C1 OR C2, use:sum of estimate for C1 and estimate for C2, unless that sum is > T(R) and the previous , orassuming C1 and C2 are independent, T(R)*(1  (1f1)*(1f2)), where f1 is fraction of R satisfying C1and f2is fraction of R satisfying C2

ExampleConsider relation R(a,b) 10,000 tuples and 50 different values for a.Consider selecting all tuples from R with a = 10 or b < 20.EstimateEstimate for a = 10 is 10,000/50 = 200Estimate for b < 20 is 10,000/3 = 3333Estimate for combined condition is200 + 3333 = 3533 or10,000*(1  (1  1/50)*(1  1/3)) = 3466Different, but not really

Estimating Size of Natural JoinAssume join is on a single attribute Y.Some possibilities:R and S have disjoint sets of Y values, so size of join is 0Y is the key of S and a foreign key of R, so size of join is T(R)All the tuples of R and S have the same Y value, so size of join is T(R)*T(S)We need some assumptions…

Join Estimation RuleExpected number of tuples in result isT(R)*T(S) / max(V(R,Y),V(S,Y))Why? Suppose V(R,Y) ≤ V(S,Y).There are T(R) tuples in R.Each of them has a 1/V(S,Y) chance of joining with a given tuple of S, creating T(S)/V(S,Y) new tuples

ExampleSuppose we haveR(a,b) with T(R) = 1000 and V(R,b) = 20S(b,c) with T(S) = 2000, V(S,b) = 50, and V(S,c) = 100U(c,d) with T(U) = 5000 and V(U,c) = 500What is the estimated size of R ⋈S ⋈U?First join R and S (on attribute b): estimated size of result, X, is T(R)*T(S)/max(V(R,b),V(S,b)) = 40,000number of values of c in X is the same as in S, namely 100Then join X with U (on attribute c): estimated size of result is T(X)*T(U)/max(V(X,c),V(U,c)) = 400,000

Summary of Estimation RulesProjection: exactly computableProduct: exactly computableSelection: reasonable heuristicsJoin: reasonable heuristicsThe other operators are harder to estimate…

Estimating Size ParametersEstimating the size of a relation depended on knowing T(R) and V(R,a)'sEstimating cost of a physical algorithm depends on also knowing B(R).How can the query compiler learn them?Scan relation to learn T, V's, and then calculate BCan also keep a histogram of the values of attributes. Makes estimating join results more accurateRecomputed periodically, after some time or some number of updates, or if DB administrator thinks optimizer isn't choosing good plans

Heuristics to Reduce Cost of LQPFor each transformation of the tree being considered, estimate the "cost" before and after doing the transformationAt this point, "cost" only refers to sizes of intermediate relations (we don't yet know about number of disk I/O's)Sum of sizes of all intermediate relations is the heuristic: if this sum is smaller after the transformation, then incorporate it

Why couldn’t we… A few questions to exploreNoSQL has also been described as NoJOINCould we use the techniques discussed here to implement JOINs on a NoSQL database?Could we implement the parallel operators as MapReduce jobs?Suitable topics in case you have not yet chosen a project

Update on ProjectsConsider includingbenchmark results in your presentationThere is no need to submit your codeKey fragments can be included in your report, as seen in numerous papersDo include design of the code in your reportDo not submit code. It will not be evaluatedPace yourselfPlan to finish up your project coding in 2 weeks (by 4/4)Plan to write and perfect your report and PPT after thatBudget your presentation time carefully.How is it going?

Next weekQuery OptimizationSuggested topic?We have half-a-lecture open to cover any topics of interest to everyone

CS 542 -- Query Execution

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to CS 542 -- Query Execution (20)

More from J Singh (20)

CS 542 -- Query Execution