The Case of VARCHAR(4), or when you should not use a variable length column?

Have you ever consider using VARCHAR(4) or similar data type? If so, you really should read this post. If not, please continue reading and find out why you were right.

Don’t be cheap on data types

SQL Server, at least until SQL Server’s 2014, reads and writes whole pages (8-KB blocks of data), neither rows nor tables. This is why the average rows number that fit into a page is one of the most important factor as far as tables and indices design are concerned.

At first look choosing variable-length type for a column allows spare some space — for example, if somebody’s name is 8 characters long, in VARCHAR(25) column it takes 8 bytes, not 25 that would have been taken if this column had been declared as CHAR(25). Right, but this does not mean that you should use variable-length data types whenever possible.

Take this example: in a newly created database I will create two simple tables:

CREATE DATABASE VarCharTest;
GO

USE VarCharTest;
GO

CREATE TABLE tabVarLength (col1 iNT IDENTITY PRIMARY KEY,
                           col2 VARCHAR(4) NULL,
                           col3 CHAR(100) DEFAULT ''Filler'' NOT NULL);
GO

CREATE TABLE tabFixedLength (col1 iNT IDENTITY PRIMARY KEY,
                           col2 CHAR(4) NULL,
                           col3 CHAR(100)DEFAULT ''Filler'' NOT NULL);
GO

 

In the first table col2 is defined as VARCHAR(4), in the second one — as CHAR(4). Other columns are identical in both tables.

Now let me insert a 10 000 identical rows into those two tables. Please, take a note that the majority (more or less 80%) of values stored in col2 is 2 bytes length, roughly 10% is 3 bytes and the last 10% or so is only 1 byte length:

SET NOCOUNT ON
GO
DECLARE @counter INT = 0,
        @val INT;
WHILE @counter<10000
BEGIN
     SET @val=RAND()*110;
     INSERT INTO tabVarLength (col2)
     VALUES (@val);
     INSERT INTO tabFixedLength (col2)
     VALUES (@val);
     SET @counter+=1
END

 

As most of col2 values is 2 bytes length, this is why if I sum up the data length for both tables, the results seem to be straightforward — the variable-length column saved about 50% necessary space, right?

SELECT SUM(DATALENGTH (COL2)) FROM tabVarLength
UNION
SELECT SUM(DATALENGTH (COL2)) FROM tabFixedLength;
GO
------
19956
40000

 

Unfortunately, this is not the case. If we check how may leaf-level pages are used by both tables, and what are the record sizes, we will find that the fixed-length column actually takes less space:

SELECT page_count, avg_record_size_in_bytes, min_record_size_in_bytes, max_record_size_in_bytes, avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID(''tabVarLength''),NULL,NULL,''DETAILED'')
WHERE index_level=0
UNION
SELECT page_count, avg_record_size_in_bytes, min_record_size_in_bytes, max_record_size_in_bytes, avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID(''tabFixedLength''),NULL,NULL,''DETAILED'')
WHERE index_level=0;
GO
------
page_count    avg_record_size_in_bytes    min_record_size_in_bytes    max_record_size_in_bytes    avg_page_space_used_in_percent
145            115                        115                        115                            99,6659995058068
148            116,995                    116                        118                            99,3111440573264

 

Because both additional column are identical, the difference in row size must be due to using variable versus fixed-length data type for col2.

Row structure

To understand where those results come from, we will have take a look at the picture below (this picture was taken from old MOC course, the one about SQL Server 2000 if I am not mistaken).

image

As you can see, after the header is the fixed-data block, then null bitmap and variable-length block, if required. At the end of the slot the variable-length data is stored. Right now the most important structure is this variable-length block that contains number of variable-length columns (2 bytes) and 2-byte pointer to the end of each variable-length data. So, each variable-length column has 2 bytes overheard, except the very first one, which introduces 4 bytes overhead.

And this can be easily seen by using undocumented, but perfectly safe, DBCC IND and DBCC PAGE commands.

The first one takes database name, table name and index ID, and returns the list of pages on which this data is stored:

DBCC IND (''VarCharTest'',''tabVarLength'',1);
GO
-----
PageFID    PagePID    IAMFID    IAMPID    ObjectID    IndexID    PartitionNumber    PartitionID    iam_chain_type    PageType    IndexLevel    NextPageFID    NextPagePID    PrevPageFID    PrevPagePID
1    145    NULL     NULL    245575913    1          1            72057594039042048    In-row data    10    NULL    0    0    0    0
1    144    1        145     245575913    1          1            72057594039042048    In-row data    1    ...

 

Take a look that each page has an identification that contains file id (PageFID) and page id (PagePID). Those two numbers can be used to display the data actually stored on given page (the last parameter determines the output format):

DBCC TRACEON (3604);
DBCC PAGE (''VarCharTest'',1,144,3);
GO

 

Trace flat 3604 was switched on to redirect the DBCC PAGE output to the console.

Just after the buffer header and the page header you will find offsets (rows) data plus its interpretation. Take a look at some of this rows (the relevant parts are in bold):

Slot 0 Offset 0×60 Length 117

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS

Record Size = 117

Memory Dump @0x000000001DD7A060

0000000000000000: 30006c00 01000000 46696c6c 65722020 20202020 0.l…..Filler

0000000000000014: 20202020 20202020 20202020 20202020 20202020

0000000000000028: 20202020 20202020 20202020 20202020 20202020

000000000000003C: 20202020 20202020 20202020 20202020 20202020

0000000000000050: 20202020 20202020 20202020 20202020 20202020

0000000000000064: 20202020 20202020 03000001 00750035 33 …..u.53

Slot 0 Column 1 Offset 0×4 Length 4 Length (physical) 4

col1 = 1

Slot 0 Column 2 Offset 0×73 Length 2 Length (physical) 2

col2 = 53

Slot 0 Column 3 Offset 0×8 Length 100 Length (physical) 100

col3 = Filler

Slot 0 Offset 0×0 Length 0 Length (physical) 0

KeyHashValue = (8194443284a0)

And compare it with the same row stored in fixed-length column:

DBCC IND (''VarCharTest'',''tabFixedLength'',1);
GO
DBCC PAGE (''VarCharTest'',1,146,3);
GO

 

Slot 0 Offset 0×60 Length 115

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 115

Memory Dump @0x000000002128A060

0000000000000000: 10007000 01000000 35332020 46696c6c 65722020 ..p…..53 Filler

0000000000000014: 20202020 20202020 20202020 20202020 20202020

0000000000000028: 20202020 20202020 20202020 20202020 20202020

000000000000003C: 20202020 20202020 20202020 20202020 20202020

0000000000000050: 20202020 20202020 20202020 20202020 20202020

0000000000000064: 20202020 20202020 20202020 030000 …

Slot 0 Column 1 Offset 0×4 Length 4 Length (physical) 4

col1 = 1

Slot 0 Column 2 Offset 0×8 Length 4 Length (physical) 4

 

col2 = 53

Slot 0 Column 3 Offset 0xc Length 100 Length (physical) 100

col3 = Filler

Slot 0 Offset 0×0 Length 0 Length (physical) 0

KeyHashValue = (8194443284a0)

Index key structure

Apart from pages type 1 (data pages), SQL Server uses other page types, including index pages (pages type 2). But data pages and index pages are almost identical (there are only small differences in headers). As a result, indices with variable-length data also have the same overhead:

CREATE NONCLUSTERED INDEX IDXtabVarLength
ON tabVarLength(col2);
CREATE NONCLUSTERED INDEX IDXtabFixedLength
ON tabFixedLength(col2);
GO

SELECT page_count, avg_record_size_in_bytes, min_record_size_in_bytes, max_record_size_in_bytes, avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID(''tabVarLength''),2,NULL,''DETAILED'')
WHERE index_level=0 
UNION
SELECT page_count, avg_record_size_in_bytes, min_record_size_in_bytes, max_record_size_in_bytes, avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID(''tabFixedLength''),2,NULL,''DETAILED'')
WHERE index_level=0;
GO
------
page_count    avg_record_size_in_bytes    min_record_size_in_bytes    max_record_size_in_bytes    avg_page_space_used_in_percent
18            12                            12                            12                        96,0684210526316
20            13,995                        13                            15                        98,786755621448

 

Because this index is based on one column only, the overall difference was about 20% in favor of fixed-length column.

What about additional differences?

As you can imagine, the overhead associated with the variable-length block is not the only difference between fixed and variable-length columns. In fact, those differences are so essential that you can find them at almost any level of query processing. For example, data modifications can be logged differently. To see what I mean let me switch our sample database to simple recovery model and clear the log:

ALTER DATABASE VarCharTest SET RECOVERY SIMPLE;
GO
CHECKPOINT;
GO

 

Then let’s update one row in both tables:

UPDATE [dbo].[tabFixedLength]
SET [col2]=999, [col3] = ''Changed''
WHERE [col1] =1;

UPDATE [dbo].[tabVarLength]
SET [col2]=999, [col3] = ''Changed''
WHERE [col1] =1;

 

Remember, both commands updated exactly the same row, and in both cases the same two columns ware updated to the same values, but the log records that describe those modification are different:

SELECT Operation, [Log Record Length], AllocUnitName
FROM fn_dblog(NULL,NULL)
WHERE Operation like ''LOP_MODIFY_%'';
-----
Operation            Log Record Length    AllocUnitName
LOP_MODIFY_ROW        136                 dbo.tabFixedLength.PK__tabFixed__357D0D3ED55860A1
LOP_MODIFY_COLUMNS    156                 dbo.tabVarLength.PK__tabVarLe__357D0D3EDB944B78

 

The first update was logged as MODIFY_ROW and the record length was 136 bytes, but the second one had to be logged as MODIFY_COLUMNS and took 156 bytes.

Then probably the best known drawback of using variable-length columns was spread by Ramesh Meyyappan — if you haven’t seen his videos, visit http://www.sqlworkshops.com/.

Lessons learned:

· Variable-length column should be used when the data length varies greatly, ie. between 5 and 50 bytes.

· Think twice before introducing first variable-length column to a table.

· And think it over again if this column is going to be part of any non-clustered index.

See you

Share this article ...

Google Plus
Ihren LinkedIn Kontakten zeigen



This entry was posted in Internals, Performance and tagged , by Marcin Szeliga. Bookmark the permalink.
Marcin Szeliga

About Marcin Szeliga

Since 2006 invariably awarded Microsoft Most Valuable Professional title in the SQL category. A consultant, lecturer, authorized Microsoft trainer with 15 years’ experience, and a database systems architect. He prepared Microsoft partners for the upgrade to SQL Server 2008 and 2012 versions within the Train to Trainers program. A speaker at numerous conferences, including Microsoft Technology Summit, SQL Saturday, SQL Day, Microsoft Security Summit, Heroes Happen {Here}, as well as at user groups meetings. The author of many books and articles devoted to SQL Server.

Comments are closed.