http://www.sqlskills.com/blogs/paul/post/Inside-the-Storage-Engine-Anatomy-of-a-record.aspx
http://www.sqlskills.com/BLOGS/PAUL/category/Inside-the-Storage-Engine.aspx
Record structure
All records have the same structure, regardless of their type and use,
but the number and type of columns will be different. For instance, a
data record from a table with a complex schema may have hundreds of
columns of various types whereas an allocation bitmap record will have
a single column, filling up the whole page.
The record structure is as follows:
- record header
- 4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap
- fixed
length portion of the record, containing the columns storing data types
that have fixed lengths (e.g. bigint, char(10), datetime)
- NULL bitmap
- two bytes for count of columns in the record
- variable
number of bytes to store one bit per column in the record, regardless
of whether the column is nullable or not (this is different and simpler
than SQL Server 2000 which had one bit per nullable column only)
- this allows an optimization when reading columns that are NULL
- variable-length column offset array
- two bytes for the count of variable-length columns
- two bytes per variable length column, giving the offset to the end of the column value
- versioning tag
- this
is in SQL Server 2005 only and is a 14-byte structure that contains a
timestamp plus a pointer into the version store in tempdb
NULL bitmap optimization
So why is the NULL bitmap an optimization?
Firstly, having a null bitmap removes the need for storing special
'NULL' values for fixed-length datatypes. Without the null bitmap, how
can you tell whether a column is NULL? For fixed-length columns you'd
need to define a special 'NULL' value, which limits the effective range
of the datatype being stored. For varchar columns, the value could be a
zero-length empty string, so just checking the length doesn't work -
you'd need the special value again. For all other variable-length data
types you can just check the length. So, we nede the NULL bitmap.
Secondly, it saves CPU cycles. If there was no NULL bitmap, then there
are extra instructions executed for fixed-length and variable-length
columns.
For fixed-length:
- read in the stored column value (possibly taking a cpu data cache miss)
- load
the pre-defined NULL value for that datatype (possibly taking a cpu
data cache miss, but only for the first read in the case of a multiple
row select)
- do a comparison between the two values
For variable-length:
- calculate the offset of the variable length array
- read the number of variable length columns (possibly taking a cpu data cache miss)
- calculate the position in the variable length offset array to read
- read the column offset from it (possibly taking a cpu data cache miss)
- read
the next one too (possibly taking another cpu data cache miss, if the
offset in step 4 was on the boundary of a cache line size)
- compare them to see if they're the same
But with a NULL bitmap, all you have to do is:
- read the NULL bitmap offset (possibly taking a cpu data cache miss)
- calculate the additional offset of the NULL bit you want to read
- read it (possibly taking a cpu data cache miss)
So, its about even for a lookup of a single fixed-length column, but
for variable-length columns, and for multiple row selects, there's a
clear advantage to having the NULL bitmap.
Storage Engine
-
m_pageId
-
m_headerVersion
-
m_type
-
m_typeFlagBits
-
m_level
-
This is the level that the page is part of in the b-tree.
-
Levels are numbered from 0 at the leaf-level and increase to the single-page root level (i.e. the top of the b-tree).
-
In SQL Server 2000, the leaf level of a clustered index (with data
pages) was level 0, and the next level up (with index pages) was also
level 0. The level then increased to the root. So to determine whether
a page was truly at the leaf level in SQL Server 2000, you need to look
at the m_type as well as the m_level.
-
For all page types apart from index pages, the level is always 0.
-
m_flagBits
-
This stores a number of different flags that describe the page. For example, 0x200 means that the page has a page checksum on it (as our example page does) and 0x100 means the page has torn-page protection on it.
-
Some bits are no longer used in SQL Server 2005.
-
m_objId
- m_indexId
-
In SQL Server 2000, these
identified the actual relational object and index IDs to which the page
is allocated. In SQL Server 2005 this is no longer the case. The
allocation metadata totally changed so these instead identify what's
called the allocation unit that the page belongs to (I'll do another post that describes these later today).
-
m_prevPage
-
m_nextPage
-
These are pointers to the previous and next pages at this level of the b-tree and store 6-byte page IDs.
-
The pages in each level of an index
are joined in a doubly-linked list according to the logical order (as
defined by the index keys) of the index. The pointers do not
necessarily point to the immediately adjacent physical pages in the
file (because of fragmentation).
-
The pages on the left-hand side of a b-tree level will have the m_prevPage pointer be NULL, and those on the right-hand side will have the m_nextPage be NULL.
-
In a heap, or if an index only has a single page, these pointers will both be NULL for all pages.
-
pminlen
-
m_slotCnt
-
m_freeCnt
-
m_freeData
-
m_reservedCnt
-
m_lsn
-
m_xactReserved
-
m_xdesId
-
m_ghostRecCnt
-
m_tornBits
Note that I didn't include the fields starting with Metadata:.
That's because they're not part of a page header. During SQL Server
2005 development I did some major work rewriting the guts of DBCC PAGE
and to save everyone using it from having to do all the system table
lookups to determine what the actual object and index IDs are, I
changed DBCC PAGE to do them internally and output the results.
Slot Array
It's a very common misconception that records within a page are always
stored in logical order. This is not true. There is another
misconception that all the free-space in a page is always maintained in
one contiguous chunk. This also is not true. (Yes, the image above
shows the free space in one chunk and that very often is the case for pages that are being filled gradually.)
If
a record is deleted from a page, everything remaining on the page is
not suddenly compacted - inserters pay the cost of compaction when its
necessary, not deleters.
Consider
a completely full page - this means that record deletions cause free
space holes within the page. If a new record needs to be inserted onto
the page, and one of the holes is big enough to squeeze the record
into, why go to the bother of comapcting it? Just stick the record in
and carry on. What if the record should logically
have come at the end of all other records on the page, but we've just
inserted it in the middle - doesn't that screw things up somewhat?
No,
because the slot array is ordered and gets reshuffled as records are
inserted and deleted from pages. As long as the first slot array entry
points to the logically first record on the page, everything's fine.
Each slot entry is just a two-byte pointer into the page - so its far
more efficient to manipulate the slot array than it is to manipulate a
bunch of records on the page. Only when we know there's enough free
space contained within the page to fit in a record, but its spread
about the page do we compact the records on the page to make the free
space into a contiguous chunk.
One
interesting fact is that the slot array grows backwards from the end of
the page, so the free space is squeezed from the top by new rows, and
from the bottom by the slot array.
评论