Does Table Column Order Affect UPDATE Performance?

Does Table Column Order Affect UPDATE Performance?

Both you and I have heard this before, "When creating an Oracle table, place the most commonly accessed column at the beginning of the table CREATE statement because performance will be better."

My previous blog post addressed SELECT statements. It was pretty obvious, that starting with the second column performance began to immediately degrade. By the 100th column, SELECT statement elapsed time increased three times!

But what about UPDATE statements? The answer to that question is the focus of this article. In my next article, I will compare and contrast the SELECT and UPDATE results and also make some inferences regarding why the performance changes so dramatically and how to apply this in your DBA work.

How I Designed The Experiment

With an experiment like this, it is easy to get overwhelmed with all the possible experimental variations. I decided on one variation with the objective of learning if there is a statistically significant performance change. That's it. Not how much of a difference or not at what column number does performance significantly change. And I'm not trying to determine if the column type makes a difference or if the column length makes a difference. I just want to know two things about my experiment:

  • Does column order make a difference?
  • How can I apply the results to my work?

That is it. Otherwise, I would never get the experiment completed and posted.

However, I have made everything available to you. This way, you can alter the experiment and do some research yourself.

This is what I chose for this experiment focused on UPDATE statements.

  • 5 million row table
  • 180 columns
  • Alternating column types; number and varchar2
  • Numeric columns contain a uniformly distributed random number with a length from 1 to 20 digits.
  • Varchar2 columns contain uniformly distributed random printable character strings with lengths from 1 to 20 characters.
  • Gathered 35 samples for each column, but did not use the first four to reduce the impact of any potential buffering.
  • The posted results contain the final 31 samples averaged for each column accessed.

Below is the UPDATE PL/SQL that I dynamically generated and executed 34 times, for each of the 180 columns. The key is I wanted to ensure only the specific column (1 to 180) was specifically named in the SQL statement. The column number is the variable column_loop_v. For details, download the Analysis Pack from the link below.

the_sql_v := 'update colorder set c'|| column_loop_v ||' = '|| equals_v ||' where rownum between '|| row_low_v ||' and ' || row_high_v ;

I created an "analysis pack" you can download HERE containing the:

  • Experiment script, so you can see it, run it, and modify and run it yourself.
  • Plotting SQLPLUS script, which pulls the experimental results and formats them into an R script, which you can then simply copy and paste into R to create the chart you see below.
  • Plot you see below.
  • Table export (COLORDER_RESULTS_UPD) of the experimental results, so you can crunch the numbers yourself.
  • Histograms I present below and the ones I do not discuss.
  • Statistical analysis step-by-step, using R, that I summarize below.

Column Order Does Make A Difference

The picture says it all... it makes a difference. And you'll forgive me if I say, "A big difference." In my specific experiment, accessing the 100th column takes about 2.8 times (i.e., 6050/2128=2.84) longer compared to the first column.

Clearly and immediately, performance degrades as we UPDATE a column other than the first column. But to satisfy any naysayers, below you can find the statistical significance test between column one and column two, plus other sweet details.

Statistical Analysis Detail

For those of you who care about the underlying statistical analysis, here are the results in all their glory.

If you click HERE you will see a text file containing exactly what I typed into R (the free statistic package) and the R results. In summary, the column one elapsed data is normally distributed (p-value = 0.08394) but the column two data is not normally distributed (p-value = 0.02829). Therefore to test if the data sets are statistically different I do not use a t-test but instead use a Wilcoxon rank sum test. The Wilcoxon resulting p-value is 0.00000 which means the data from c1 and c2 likely came from two different data sets and any differences are likely NOT due to random selection (that means, we did not just happen to pick convenient values).

If that did not make sense, the below histograms may help. Shown below are both the column one (red) and column two (blue) elapsed time data sets shown as "smooth histograms." There is quite a bit of overlap. Much more than I expected. Visually they do appear to be two very different data sets, but not as dramatic as I like to see!


Not impressed? Me either! So, I did the same analysis but comparing column one and five. Shown below are both the column one (red) and column five (blue) elapsed time data sets shown as "smooth histograms." There is no overlap at all. Even visually this appears to be two very different data sets.


What does this mean? It means the SQL UPDATE statement elapsed times when selecting from column one is statistically faster than from column two. And, the column two elapsed times are closer to column one elapsed times than ANY of the other columns, which are visually much different in the plot in the previous section above.

Looking at the above to graphics, it may seem a waste of time to go through the statistical analysis. Over the years working with lots of data sets, it is always good to have a visual representation, a numeric representation and a statistical representation. If all three line up, then we're good!

Should I Reorganize My Tables?

Like I wrote in my previous article, Please NO! Do NOT schedule table reorganizations! Before you seriously consider doing anything that extreme, run some simple performance tests.

For example, try running some tests using your real data and using real production SQL UPDATE statements. And of course, any expected benefit should directly improve the user and/or business experience.

It Seems Pretty Obvious To Me

For me, the chart is pretty convincing. Cleary table column order makes a difference in UPDATE statements. But the key is, is the effort involved worth the true benefit to your organization. That is something I can't answer. But if you have a chance to influence the order of table columns, then for sure remember this research project!

But best of all, now when someone starts talking about column ordering, you will have something definitive to say along with where anyone can go to check out the results for themselves. For me, that's valuable.

What's Next?

In my next article, I will compare the SELECT and UPDATE experimental results. But more important I think, I will infer WHY there is a performance decrease and bring this research project to a practical close.

Thanks for reading and all the best in your Oracle tuning work!

Craig.

If you have any questions or comments, feel free to email me directly at craig@orapub.com.





Krzysztof Lechowski

Prepares data for analysis. Any | All opinions personal

8y

Very interesting. I wonder if we should think of designing new database's tables in sort of reverse-natural order, a little like yoda conditions emerged. Normally we would have something like [id, amount]. I wonder if making the same [amount, id] would give any performance benefits since the id would be index / hasmapped anyways, while being virtually never updated.

Like
Reply
Frank B.

Regulatory Developer AIA / FATCA / QI using Siebel CRM, Java & ACTICO Business Rules

8y

Great you did the experiments. Never thought about it in detail but Oracle needs to calculate the offset for each record as the sum of the length of all previous columns. For fixed length column types a constant could be added to variable ones the length stored (caring also about length < 250 also). What a mess for in-memory databases when the io time cannot be blamed for everything. :-)

Like
Reply

err.. any reason you don't post AWR details? Don't you think you're getting chained rows and therefore different I/O patterns per few updates? If a row spans pages which it surely will given your test then...

Like
Reply
Thomas B. Cox

Co-Founder, Transformative Leadership Lab | 25+ Years Executive & Governance Advisor | Authorized in Harada, Hogan, PXT, and Positive Intelligence

8y

I bet I can challenge this...

Like
Reply
Craig Shallahamer

Applied AI Scientist, Viscosity | Founder, OraPub | Oracle ACE Director

8y

That’s an excellent question Ravinder Bahadur. I designed the experiment to not show the delay reason. I only wanted to see if there was a delay. However, I don’t think the delay is because of a column length difference because the elapsed time increases as the column number increases. If the cause was only because of column length, then I would expect the elapsed time to be pretty much the same. Interestingly, I am running a slightly different experiment to ensure a row is always changed by the update statement and have included in this updated experiment a random column value length to reduce the likelihood of any column length impact. That should be posted either next week or the week after. I’m looking forward to the results!

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics