You are on page 1of 98

Mark S. Rasmussen improve.

dk

Knowing the Internals,


Who Needs SQL Server
Anyway?

Whoami
Tech

Lead @ iPaper

Developer/DBA/Sysadmin/Project

manager/*
Comp.Sci
Blogging

@ Aarhus University
at improve.dk

@improvedk
Author

of the OrcaMDF project


2

Disclaimer
Level

650 meant to inspire, not

teach!
Based
I

on 2008 R2

have no idea...

Most

of what I say is incorrect

Background
Presentation
Formally

at Miracle Open World

started OrcaMDF

Old School Querying


private static void oldschool()
{
using (var conn = new SqlConnection("D ata Source= .;InitialCatalog= Q FD ; ")
{
conn.O pen();
var cm d = new SqlCom m and("SELECT * FRO M Persons",conn);
var reader = cm d.ExecuteReader();
w hile(reader.Read())
Console.W riteLine(reader["ID "] + ": " + reader["N am e"] + " (" + reader["A

}
}

OrcaMDF Querying
using (var m df = new M dfFile(m dfPath))
{
var scanner = new D ataScanner(m df);
var row s = scanner.ScanTable("Persons");
EntityPrinter.Print(row s);
}

using (var m df = new M dfFile(m dfPath))


{
var scanner = new D ataScanner(m df);
var row s = scanner.ScanTable("Persons")
.W here(x = > x.Field< short> (Age) < 40);
EntityPrinter.Print(row s);
}
6

Pages
The foundation of SQL Server storage

What Is a Page?
8192

bytes

Everything

stored

as pages

Undocumented DBCC
Commands & Flags
DBCC

IND

DBCC

PAGE

DBCC

TRACEON (3604)

Documented

Unofficially

in 6.5 & 7.0

documented
9

Page Header
Absolutely

no documentation

Absolutely

necessary for parsing

10

Reverse Engineering the


Header
DEMO

OrcaMDF:

PageHeader

11

Slot Array
Points

to beginning

of records in body
Defines

logical

order of records

12

Records
Data

records

Stores table data

Index

records

Stores nonclustered index data, as well as

non-leaf level clustered index data


Stored

in the FixedVar format


13

FixedVar Record Format

14

Status Bits A

15

Status Bits B

16

NULL Bitmap
Bitmap

tracking whether columns are

NULL
CEIL(#Cols
Always

/ 8) bytes

present on data pages, except

when its not


Only

trust defined bits rest may be

garbage
17

Variable Length Offset


Array

18

Example Record

CREATE TABLE
RecordTest
(
A int,
B int,
C char(5),
D varchar(10),
E varchar(20)
)
INSERT INTO RecordTest VALUES (25, 38, 'ABCD', 'Mark',
19
'Denmark')

When Is Data Present?


Fixed

length data always present

Even if null
Though not necessarily tail columns!
Variable

length data only present when not

null
Adding

nullable columns is a metadata op

Denali default value columns is metadata too!

20

Data Types
How are data types stored within a record?

Classifying Data Types


Fixed

length data types

bit, char, int, decimal, date, datetime,

float, etc.
Variable

length data types

(n)varchar, varbinary, varchar(MAX),

text, etc.
sql_variant
Please just stay away from it
22

Variable Length Data


Types
SLOBs
varchar(x), nvarchar(x), varbinary(x)

LOBs
text, ntext, image, varchar(MAX),

nvarchar(MAX), varbinary(MAX), xml


vardecimal

23

In-row (n)varchar(x)
Storage

CREATE TABLE VarcharTest


(
A varchar(4)
)
INSERT INTO VarcharTest VALUES 24

Complex Columns
DEMO
Identified

using the sign bit

0b1001001110010101

= 37.781

0b0001001110010101

= 5.013

Use

cases

Row-overflow/LOB pointers
Sparse vectors
Back pointers
25

Off-row SLOB Storage


Varchar,

nvarchar, varbinary

DEMO

26

Off-row SLOB Storage


Column

data moved to new page,

pointer left behind

27

Off-row SLOB Storage

28

Off-row SLOB Storage


B = [BLOB Inline Root] Slot 0 Column 2 Offset 0x11a1 Length (physical) 24
Level = 0
Unused = 0
UpdateSeq = 1
TimeStamp = 1298595840
Link 0
Size = 4500
RowId = (1:21:0)

tim estam p = BitConverter.ToInt64(data, 8) < < 16;


29

BLOB_FRAGMENT Record
Blob row at: Page (1:21) Slot 0 Length: 4514 Type: 3
(DATA)
Blob Id:469368832

Stored

on shared (obj-level) TextMix

pages
30

Off-row SLOB Storage


Allways
24

stored in-row if < 24 bytes

byte [BLOB Inline Root] pointer

Data stored in BLOB_FRAGMENT on


TextMix page

Timestamp == Blob ID

Performance prediction is tough


31

(MAX) LOB Storage

varchar(MAX), nvarchar(MAX),
varbinary(MAX)

The LOB that wanted to be a SLOB

Three

scenarios

[BLOB Inline Data]


[BLOB Inline Root]
[Textpointer]

DEMO
32

BLOB Inline Data


Used
Not

when data fits in record

an official LOB structure

Slot 0 Column 1 Offset 0x0 Length 0 Length (physical) 0


A = [NULL]
B = [BLOB Inline Data] Slot 0 Column 2 Offset 0x1393
Length 4 Length (physical) 4
B = 0x41424344
33

BLOB Inline Root


Can

reference up to 5 pages data, roots, trees,

etc.
12

byte header

Array

of 12 byte references

Only

used by SLOBs & (MAX) LOBs

Also

not a LOB structure (by my definition)


34

LOB Structure Records


Wrapped

in a single-column meta

record

35

DATA
Blob row at: Page (1:176) Slot 0 Length: 8054 Type: 3
(DATA)
Blob Id:1210253312

Type

Where
Size

data is actually stored

always > 64 bytes

(SMALL_ROOT)
36

DATA

How

much data can we store in a

DATA record?
8096 Page body size
8080 (8094) Theoretical max
8040 (8054)- Reality

37

INTERNAL
Blob row at: Page (1:55) Slot 0 Length: 324 Type: 2 (INTERNAL)
Blob Id: 1210253312 Level: 0 MaxLinks: 501 CurLinks: 19
Child 0 at Page (1:176) Slot 0 Size: 8040 Offset: 8040
Child 1 at Page (1:177) Slot 0 Size: 8040 Offset: 16080
Type

CurLinks
MaxLinks
Level
Size

= number of references
=?

= tree level

= computed
38

INTERNAL

Total record size = 20 + X * 16


39

Connecting the Dots

40

The Tree Grows

8096 20 2
In theory (INTERNAL)...
16

504

In reality... 500
500 8040 4,020,000
2 500 8040 8,040,000
5 500 8040 20,100,000
6 500 8040 24,120,000
41

Connecting the Dots

42

Connecting the Dots

43

Two Levels Is All It Takes

8,040,000,000

bytes / 7,48 GB

(MAX)

limit is 231-1

Many

permutations

44

Large Value Types Out of


Row
sp_tableoption

MyTable, Option,

ON/OFF
Even

more permutations

text

in row 24-7000, default 256

45

Textpointer
Used

for classic LOB types & MAX

LOB types with large value types out


of row ON
text, ntext, image

Complex

column

46

Classic Lob Structures


You

thought (MAX) was complex?

Textpointer

=evil

47

SMALL_ROOT

Type

=0

Used

when data <= 64 bytes

Min

size = 84

Data

> length = garbage


48

LARGE_ROOT_YUKON

Type

=5

Min

size = 84

Part

of LOB tree
49

Connecting the Dots

50

Connecting the Dots

51

LOB Storage Overview


Data

Varchar(X Varchar(MA Text


)
X)

NULL

0-64

0-64

0-64

100 (16 + 84)

65-8000

65-8000

65-8000 *

(+ 24 +

14)

100 + 14 + 658000

8kb
+

N/A
24 + X
100 + X
Extreme
impact
on small data

The

more data, the less of a dif

Performance

diferences

http://sqlblog.com/blogs/paul_white/archive/2011/02/23/Advanced-TSQL-Tuning-Why-Internals
-Knowledge-Matters.aspx

52

Mind the Gap


Type

Name

SMALL_ROOT

INTERNAL

DATA

LARGE_ROOT_YUKON

9+

53

Bending the Will of DBCC


Page. Again.

54

Archeology 101
Type

Name

SMALL_ROOT

LARGE_ROOT

INTERNAL

DATA

LARGE_ROOT_SHILOH

LARGE_ROOT_YUKON

SUPER_LARGE_ROOT

7
8

NULL

9+

INVALID

55

LOB Summary
<

8000 => (MAX) = (X)

>

8000 => Tree is built

Text/ntext/image

horribly inefficient

Lots of legacy details

56

Indices & Heaps


How pages are organized

Clustered Index vs Heap


Defines

how data is *physically* stored

Clustered

index

Guarantees physical order of data


Row identified by clustered key

Heap
Data stored whereever SQL Server wants to
Row identified by RID

58

B+-tree Scanning

59

Heaps
Relies
Leaf

on IAM pages

pages not linked

Except...

60

Page & Extent


Allocation
Extents, pages & objects

Extents
All

pages allocated as part of an

extent
Mixed

extents

Uniform
First

extents

8 = mixed, rest uniform

62

GAM Pages

Global Allocation Map

1 = Free, 0 = Allocated

Bitmap tracks 63,904 extents, almost 4GB

Present every 511,232 pages

GAM interval

2 / 511232, every 511232 pages

63

SGAM Pages
Shared

Global Allocation Map

= Mixed & > 0 free pages

= Either uniform or mixed w/no

free pages
Structure
3

identical to GAM

/ 511233, every 511232 pages


64

IAM Pages
Index
1

Allocation Map

= Uniformly allocated to IAM chain /

allocation unit
0

= Not owned by IAM chain / AU

No

fixed positioning!

Tracks

a GAM interval

Structure

(almost) identical to GAM


65

IAM Page Header


Storage

66

Extent Allocation Status


GAM

SGAM Any
IAM

Status

0
0
1
0
1
1
1

1
0
0
1
0
1
1

0
1
0
1
1
0
1

Mixed, all pages


allocated
Mixed, > 0 free pages
Uniform
Unallocated
Invalid
Invalid
Invalid
Invalid
67

PFS Pages
Page

Free Space

Bytemap
1

/ 8088, every 8088 pages

PFS

interval

Only

tracks fullness where necessary


68

MDF File at a Glance

69

Allocation Units

70

The Boot Page


Page

9 in primary data file

DBCC
Lots

PAGE == DBCC DBINFO

of interesting info

Physical version
Log rebuild count
Last OK CHECKDB
Last LOG backup
Name + ID
FirstSysIndexes

71

System Views & Base


Tables
The source of our parsing metadata

Needed Metadata for


Parsing
Schema
sys.tables + sys.columns / sys.indexes +

sys.index_columns

Indexes
Root page

Heaps
IAM chain root

DEMO
73

Allocation Metadata
Overview

74

DMVs Are Useless!


Just

views, no physical storage

Chicken
How

or the egg

about we take a look at those

views?
DEMO
75

Base Tables
The

basis for DMV data

Can

only be queried through the DAC

Here be dragons!

Confusing
Utilizes

column names

internal functions

76

The Holy Grail of


Metadata
SELECT

* FROM sys.sysschobjs

sysschobjs
syscolpars
sysrowsets
sysallocunits
DEMO
77

Follow the Rabbit

Boot page points to sysallocunits


(FirstSysIndexes)

Constant partition ID leads us to sysrowsets

Constant object ID leads us to sysschobjs

Using the above we can find syscolpars

DEMO
78

OPENROWSET(TABLE
RSCPROP)
SELECT
CASE c.maxinrowlen
WHEN 0 THEN p.length
ELSE c.maxinrowlen
END AS max_inrow_length,
p.xtype AS system_type_id,
p.length AS max_length,
p.prec AS PRECISION,
p.scale AS scale,
FROM
sys.sysrscols c sys.system_internals_partition_columns

OUTER APPLY
OPENROWSET(TABLE RSCPROP, c.ti) p
79

OPENROWSET(TABLE
RSCPROP)

80

OPENROWSET(TABLE
RSCPROP)
CREATE TABLE TITest
(
A binary(50),
B char(10),
C datetime2(5),
D decimal(12, 5),
E float,
F int,
G numeric(11, 4),
H nvarchar(50),
I nvarchar(max),
J time(3),
K tinyint,
L varbinary(max),
M varchar(75),
N text
)

81

OPENROWSET(TABLE
RSCPROP)
SELECT
t.name,
r.ti,
p.scale,
p.precision,
p.max_length,
p.system_type_id,
p.max_inrow_length
FROM
sys.system_internals_partition_columns p
INNER JOIN
sys.sysrscols r ON
r.rscolid = p.partition_column_id AND
r.rsid = p.partition_id
INNER JOIN
sys.types t ON
t.system_type_id = p.system_type_id AND
t.user_type_id = p.system_type_id
WHERE
partition_id = 72057594040614912

82

OPENROWSET(TABLE
RSCPROP)

83

OPENROWSET(TABLE
RSCPROP)

12973

= 0x000032AD

173

= 0xAD == 12973 & 0xFF

50

= 0x32 == (12973 & 0xFFFF00) >> 8

84

OPENROWSET(TABLE
RSCPROP)

1322
42
5

= 0x0000052A

= 0x2A == 1322 & 0xFF

= 0x5 == (1322 & 0xFF00) >> 8

25

= 20 + scale
85

OPENROWSET(TABLE
RSCPROP)

330858 = 0x00050C6A

106

= 0x6A == 330858 && 0xFF

12

= 0x0C == (330858 && 0xFF00) >> 8

= 0x05 == (330858 && 0xFF0000) >> 16

86

Creating a Type Aware TI


Parser
[Test]
public void Decimal()
{
var parser = new SysrscolTIParser(330858);
Assert.AreEqual(5, parser.Scale);
Assert.AreEqual(12, parser.Precision);
Assert.AreEqual(9, parser.MaxLength);
Assert.AreEqual(106, parser.TypeID);
Assert.AreEqual(9, parser.MaxInrowLength);

parser = new SysrscolTIParser(396138);


Assert.AreEqual(6, parser.Scale);
Assert.AreEqual(11, parser.Precision);
Assert.AreEqual(9, parser.MaxLength);
Assert.AreEqual(106, parser.TypeID);
Assert.AreEqual(9, parser.MaxInrowLength);
}

87

Data Recovery
When everything else fails

Please Dont Rely on


This
You

should always have backups

available
Make

sure to test your backups

Run

regular consistency checks

This

is a last resort measure

89

What Kind of Problems to


Expect
Torn

pages

Corrupt
Bad

pages

metadata

Accidental
How

deletes & truncations

does OrcaMDF difer?

90

Torn Pages
One

page = 16 disk sectors

First

and last sector most important

No

header

Identify object from IAM, linked list

No

slot array

Slot count in header


Identify record formats in body
91

Corrupt Pages
Checksum
Could
Treat

doesnt match content

be minor issue, probably major

like torn page

92

Bad Metadata
SQL

Server bugs

Corrupt/torn

pages

Scan

pages and identify object in header

Scan

pages and look for IAM chain

Deduce

schema

App
Docs
Record format
93

Accidental Deletes &


Truncations
Accidental

delete

Records may be ghosted


Records removed from slot array
STOP!

Accidental

truncation

Pages deallocated, physically intact


Scan pages, linked list

94

Watch Out For Instant


Initialization
Garbage
Was

may be mistaken for data

page allocated?

Look for clues in salvagable allocation

structures

95

OrcaMDF Future Plans


CorruptMdf
Verifying
Utility

class

tornbits / checksum

methods

Scan for pages belonging to object


Scan for IAM pages

Best-efort

parsing of pages
96

Questions
Nows the chance

Thank you!
Blog:
improve.dk
Twitter: @improvedk
Email: mark@improve.dk

You might also like