As you recall, in previous parts of this series we agreed that reusing execute plans might have tremendous impact on SQL Server performance, and saw how to avoid unnecessary recompilation by proper use of temporary tables and not changing cache-key SET options.
If you missed those articles, they can be found here:
In fact, this performance boost is getting larger and larger, because of SQL Server increased complexity. This means, that from version to version you will see an increase in the time it takes to compile a query, but, in most cases, a decrease in query execution time. What I am saying, is that the first time a new query is run, there is a chance that it may actually take longer to run than in previous SQL Server version, but subsequent executions of the plan will execute considerably faster.
In this post we are going to investigate optimality-related recompilations reasons, and what we can do to avoid them, if appropriate. But first, we have to get familiar with creating and maintaining statistics by different SQL Server versions.
Statistics creation and maintenance
SQL Server collects statistics about individual columns (single-column statistics) or sets of columns (multicolumn statistics, in SQL 2008 onwards based up to 32 columns). Both kinds of statistics are used by the Query optimizer to estimate the selectivity of expressions, and thus the size of intermediate and final query results. All metainformation about a single statistics object is stored in a row in the sysindexes table, and the statistic itself is stored as statistics binary large object (statblob) in an internal table.
SQL Server collects, among others, the following statistics about table columns:
- The average key length.
- A single-column histogram, including up to 200 values of a given column (multicolumn statistics have to be created manually).
- The estimated number of rows matching the filter (for filtered statistics — filtered statistics as well as filtered indexes were introduced in SQL Server 2008), or all rows in the table (for regular statistics).
By default, statistics are created by sampling the data when the CREATE STATISTICS command is executed, or when the statistics are automatically created. On the other hand, CREATE INDEX scans the whole table, therefore, the index statistics are initially created with fullscan.
Statistics can be not only automatically created but also kept up to date. SQL Server determines whether to update statistics based on changes to column modification counters (colmodctrs):
- If the statistics is defined on a regular table, it will be considered stale when:
- The first row is inserted into the empty table.
- The number of rows in the table was less or equal to 500 and the colmodctr of the leading column of the statistics object has changed by more than 500
- The table had more than 500 and the colmodctr of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows.
- For filtered statistics, the colmodctr is first adjusted by the selectivity of the filter.
- If the statistics object is defined on a temporary table, it is invalidated as discussed above plus there is an additional threshold for recomputation at 6 rows.
Please note, that statistics are not created for table variables at all. Because SQL Server has to assume something, it will have to guesses for their cardinality. In practice, most of the time it assumes 1 row, unless the query will be compiled at a run time (for example, when FOR RECOMPILE hint is used). Since this blog is about avoiding recompilations to achieve better performance, use table variables only if there will store no more than hundred or so rows.
As I stated initially, I want to clear up any confusion surrounding automatic updates for statistics. The first thing that needs clarification is a difference between invalidating and updating. Even if a statistic becomes outdated as the result of a data modification, it will not be automatically updated after the modification completes. The statistic will automatically update the next time a query plan uses it.
Statistics are really useful
Up to date statistics, along with scheme metadata (such as column types and constraints), information about server configuration as well as current workload (such as amount of available memory) are required by Query Optimizer to find good enough execution plans.
To see why statistics are so important, turn on Actual Execution Plan, and run this simple query:
1 2 3 4 5 6
USE AdventureWorks2008R2 GO SELECT h.[SalesOrderID] FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderId WHERE h.[CustomerID] = 11000;
Then hover a mouse over the Index Seek (NonClustered) operator. A window with additional information about this operator will pop-up. Please, take note of the predicate used for this seek (CustomerID = 11000). Based on the statistics SQL Server assumed, that there would be three rows in SalesOrderHeader table for this customer (estimated number of rows equals 3), and it was right (actual number of rows also equals 3).
To see relevant statistics, execute the following query:
DBCC SHOW_STATISTICS ('Sales.SalesOrderHeader','IX_SalesOrderHeader_CustomerID');
Besides other information about the chosen statistics (all indexes have corresponding statistics) you will find one hundred fifty something distribution steps (the maximum number of distribution steps is 200). The first one has the value 11000 in RANGE_HI_KEY column, and 3 in EQ_ROWS column, which means that there are exactly three rows with CustumerID = 11000.
What about values not stored in a histogram? To answer this question, please change CustomerID to 11002 and run this query again:
1 2 3 4
SELECT h.[SalesOrderID] FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderId WHERE h.[CustomerID] = 11002;
This time hover your mouse over the arrow connecting Index Seek (NonClustered) operator with Nested Loops one. As you can see, SQL Server was more or less correct about the actual number of rows with CustomerID equals 11002.
If you have a look at the second distribution step of the same statistics, you will find that:
- There are 47 rows with CustomerID values between 11000 and 11018 (based on RANGE_ROWS value).
- There are 18 distinct CustomerID values in this range (DISTINCT_RANGE_ROWS columns has this information).
- Each CustomerID value within this range is duplicated 2,61111 times on average (AVG_RANGE_ROWS tells the story).
Knowledge about how statistics are stored and used allows us to explain why SQL Server invalidates a statistic when the first row is inserted into a table. If the statistic is created along with creation of the empty table, originally the histogram will be empty. As per the “500 rows” rule, the statistic is expected to be staled only after 500 rows have been inserted, therefore the recompilation of queries using that table will be postponed until the table contains at least 500 rows. In the meantime a potentially sub-optimal execution plan would be used.
Recompilations due to statistics changes
Without statistic the Query optimizer would have to guess (using build-in heuristic) the number of processed rows. In addition, stale statistics also lead to estimation errors. To avoid those problems, all databases have by default AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS options set to on. But as far as stored procedures are concerned, the default mechanism may lead to unnecessary recompilations — most of the time when we create tables inside our stored procedures, we can assume how big they will be.
Note for administrators. In SQL Server 2005/2008/2008 R2 updating statistics when the database option ‘auto update statistics’ is on always causes plan invalidation, even if the data and the histogram itself don’t change. Therefore, be careful with executing UPDATE STATISC for all tables as a part of your maintenance plans. The preferred solution is to use sp_updatestats stored procedure — it updates only the statistics that require updating based on their colmodctrs counters.
To illustrate this concept, let’s create (in AdventureWorks2008R2 database) the following stored procedure:
1 2 3 4 5 6 7 8 9 10 11 12 13
CREATE PROCEDURE uspStatisticsChangeRecompliationv1 AS BEGIN CREATE TABLE tab1 (col int); DECLARE @i int = 0; WHILE (@i < 10) BEGIN INSERT INTO tab1 VALUES (@i); SET @i += 1; SELECT col FROM tab1 GROUP BY col; END DROP TABLE tab1; END GO
Then start Profiler, capture a SP:Recompile event and execute our stored procedure for the first time:
EXEC uspStatisticsChangeRecompliationv1; GO
Great, no additional recompilations (the deferred compile concept was explained in the previous part of this series).
But we already know, that there is no way to avoid unnecessary recompilations when permanent tables are created inside stored procedures. If you missed the previous part of this series, please execute the stored procedure a couple more times and observe, that each and every execution trigger two recompilations due to schema changes.
Let me create a second version of this stored procedure, but this time we will use a temporary table:
1 2 3 4 5 6 7 8 9 10 11 12 13
CREATE PROCEDURE uspStatisticsChangeRecompliationv2 AS BEGIN CREATE TABLE #tab1 (col int); DECLARE @i int = 0; WHILE (@i < 10) BEGIN INSERT INTO #tab1 VALUES (@i); SET @i += 1; SELECT col FROM #tab1 GROUP BY col; END TRUNCATE TABLE #tab1; END GO
After clearing the trace window, execute this version of our stored procedure:
EXEC uspStatisticsChangeRecompliationv2; GO
Seems like using temporary tables makes things worse. Nonetheless, the subsequent execution will not trigger any recompilations, as expected. So, the lesson learned from the previous articles (the one about using temporary tables) still applies, the only thing we should do is to eliminate this additional recompilation.
This is where hints come in. The first one, KEEP PLAN, disables the “6 rows” rule, which means that temporary tables will be treated exactly like permanent ones, as far as statistics invalidation is concerned. The second one, KEEPFIXED PLAN disables recompilations due to statistics changes completely.
To see them in action, alter our stored procedure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
ALTER PROCEDURE uspStatisticsChangeRecompliationv2 AS BEGIN CREATE TABLE #tab1 (col int); DECLARE @i int = 0; WHILE (@i < 10) BEGIN INSERT INTO #tab1 VALUES (@i); SET @i += 1; SELECT col FROM #tab1 GROUP BY col OPTION (KEEP PLAN) --or KEEPFIXED PLAN if appropriate; END END GO
Then execute it three times — the first execution will trigger two necessary deferred compilations, but the following execution will not:
Lesson learned — when temporary tables are being created inside stored procedures, you probably will want to prevent additional recompilations. To do this, use KEEP PLAN or even KEEPFIXED PLAN hints.
What about Triggers?
All of the discussed plan optimality-related reasons for recompilations are applicable to triggers as well. But in addition, triggers are being recompiled when the number of rows in the inserted or deleted virtual tables changes significantly from one execution to the next. In this context “significantly” is calculated as follows:
- If the count of rows in one of those virtual tables for the current trigger execution (let’s call it c for current) is bigger than the row count of the corresponding table in the cached execution plan (p), then the log10(p) – log10(c) > 1 formula is used.
- Otherwise, log10(p) – log10(c) > 2.1 formula is applied.
Recompilation is not always a good idea, especially when the recomplied plan is the same as the orginal one.Nevertheless disabling AUTO_CREATE_STATISTICS and/or AUTO_UPDATE_STATISTICS options are not the way to deal with unnecessary recompilations — most of the time setting those options to off do more harm than good. Using table variables instead of temporary tables will eliminate recompilation, but the lack of statistics (please, remember that our goal is to eliminate additional recompilations, and table variables cardinality is only available during recompilation) often leads to horrible execution plans. Therefore, table variables should be used mostly in audit scenarios, where you need to keep information after a transaction rollback has occurred.
If query performance suffers from excessive statistics updating, you should:
1. Consider using KEEP PLAN or KEEPFIXED PLAN query hints, especially inside your stored procedures.
2. Think of turning on AUTO_UPDATE_STATISTICS_ASYNC database option. This way stale statistics will still be automatically updated, but the execution of queries that trigger this update will not wait for the statistics to be updated. Instead, queries will be executed against the existing statistics, and SQL Server will update the stale statistics automatically, but asynchronously. As a result, the current execution plan may be suboptimal, but the following ones should be OK. This option is especially useful in OLTP environments.
Note. You should always write robust code, which means that your code should not depend on any database or SQL Server settings. When those settings change, well written stored procedures will behave exactly in the same way as previously. Hence using KEEP PLAN or KEEPFIXED PLAN query hints is a preferred solution.