You are on page 1of 57

SQL Server Engine

Batch Mode
and
CPU Architectures

DBA Level 400

Stick around for RAFFLE and the AFTER EVENT!

On laser pointers and


humour . . .

About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.

Everything fits in
memory, so performance
is as good as it will get. It
fits in memory
therefore end of story

Demonstration

Which SELECT Statement Has The Lowest Elapsed Time ?


WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM
(SELECT
a = 1
FROM
master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT
d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO
FactInternetSalesBigNoSort
FROM
generator
CROSS JOIN [dbo].[DimDate] d

WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM
(SELECT
a = 1
FROM
master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT
d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO
FactInternetSalesBigSorted
FROM
generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED INDEX ccsi
ON FactInternetSalesBigNoSorted ( OrderDateKey )

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON


FactInternetSalesBigNoSort

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON


FactInternetSalesBigNoSorted
WITH (DROP_EXISTING = ON)

SELECT

SELECT

CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[dbo].[FactInternetSalesBigNoSort] f
JOIN
[DimDate] d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter

The
faste
st ?

CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[dbo].[FactInternetSalesBigSorted] f
JOIN
[DimDate] d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter

17.41Mb column store Vs. 51.7Mb column store

The
faste
st ?

How Well Do The Two Column Stores Scale On Larger Hardware ?


80000

Column store created on 1095600000 rows

70000
60000
50000
Time (ms)

40000
30000
20000
10000
0
2

10

12

14

16

18

Degree of Parallelism
Non-sorted column store

Sorted column store

20

22

24

Can We Use All Available CPU Resource ?


Memory
access
should
consume all
available
CPU
cycles ?!?

100
90
Percentage CPU Utilization

80
70
60
50
40
30
20
10
0

10

12

14

16

Degree of Parallelism
Non-sorted

Sorted

18

20

22

24

Looking For Clues

Why does the query using


the column store on presorted data run faster ?
Why can we not utilise
100% CPU capacity ?
Lets start with tried and
trusted tools and
techniques.

Wait Statistics Do Not Help Here !

Stats are for the query ran with a DOP of 24, a warm
column store object pool and the column store created on
pre-sorted data
(1095600000 rows ).
CXPACKET waits can be ignored 99.99 % of the time.

Spin Locks Do Not Provide Any Clues Either


SELECT

[CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[FactInternetSalesBig] f
JOIN
[DimDate]
d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION
(MAXDOP 24)

Executes in 775 ms for a warm column store object pool


12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU
cycles
Total spins
293,491

Could Query Costs Help Solve The Two Mysteries ?

Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( better in SQL
2014 )
Hash distribution is always uniform.
Etc. . . .
Costings based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).

Our View Of Database Engine Resource Usage Is


Based On . . .
wait stats, perfmon counters,
extended events and dynamic
management views.
We need to know and understand:
Where all our CPU cycles are
going
How the database engine utilises
the
CPU at a deep architectural level.

i Series CPU Architecture


system-on-chip ( SOC )
design with CPU cores as
the basic building block.
Utility services
provisioned by the Uncore part of the CPU die.
Three level cache
hierarchy
( four for Sandybridge+)
Memory
bus

CPU
Core

Core

L0 UOP cache
L1
CoreL1 Data
Instruction
Cache
Cache
32KB
32KB

L0 UOP cache
L1
CoreL1 Data
Instruction
Cache
Cache
32KB
32KB

L2 Unified Cache
256K

L2 Unified Cache
256K

Bi-directional ring bus

L3 Cache

Memory
ControllerTLB

Uncore
PCI
2.0

QPI QPI

Power
and
Clock

The CPU Cache Hierarchy Latencies In CPU Cycles

Memory

Main memory

167

L3 Cache Full Random access


L3 Cache In Page Random access
L3 Cache sequential access

38
18
14

L2 Cache Full Random access

11

L2 Cache In Page Random access

11

L2 Cache sequential access

11

L1 Cache In Full Random access 4


L1 Cache In Page Random access 4
L1 Cache sequential access 4
0

20

40

60

80

100

120

140

160

180

Memory Access Can Become More Costly With NUMA


Remote
memory
access

Local
memor
y
access

Local
memor
y
access

Core

L1

L2

Core

L1

L2

Core

L1

L2

Core

L1

L2

L3

L3

Core

L1

L2

Core

L1

L2

Core

L1

L2

Core

L1

L2

NUMA Node 0

NUMA Node 1

NUMA Node Remote Memory Access LatencyMemory Access Latency

An additional 20%
overhead when accessing
Foreign memory !
( from coreinfo )

Local Vs Remote Memory Access and Thread Locality

How does
SQLOS schedule
hyper-threads in
relation to
physical cores ?
( 6 cores per
socket )

CPU
socket 0
CPU
socket 1

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Main Memory Is Holding The CPU Back, Solutions .


..

Leverage the pre-fetcher as


much as possible.
Larger CPU caches
L4 Cache => Crystalwell eDram
DDR4 memory
By pass main memory
Stacked memory
Hybrid memory cubes (Intel)
High Bandwidth memory (AMD)

Main
memor
y

CPU

What The Pre-Fetcher Loves

Sequential scan
Column Store index

What The Pre-Fetcher Hates !

Hash Table

Can this
be
improved
?

Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards ) Access

Core
L1
L2
L3

n row Btree
1. Session 1 performs
an index seek, pages
not in CPU cache

Last
level
cache
miss
2. A CPU stall takes
place
( 160 clock cycles+ )
whilst the page is

n row Btree
3. The Dead CPU
stall cycles gives the
physical core the
opportunity to
nd

Obtaining An ETW Trace

Stack Walking The Database Engine

xperf on base stackwalk


profile

SQL
Statement

WPA

xperf d
stackwalk.etl

Call Stack For Query Against Column Store On Non-PreSorted Data


Hash agg lookup
weight 65,329.87

Column Store scan


weight 28,488.73

Where Is The Bottleneck In The Plan ?


Control flow

The stack trace is indicating that


the
Bottleneck is right here

Data flow

Call Stack For Query Against Column Store On PreSorted Data


Hash agg lookup
weight:
now
275.00
before
65,329.87

Column Store scan


weight

Does The OrderDateKey Column Fit In The L3


Cache ?
Table Name

FactInternetSalesBigNo
Sort

FactInternetSalesBigSor
ted

Column Name

Size (Mb)

OrderDateKey

1786182

Price1

3871

Price2

3871

Price3

3871

OrderDateKey

No , L3 cache is 20Mb in
738
size

Price1

2965127

Price2

2965127

Price3

2965127

The Case Of The Two Column Store Index Sizes:


Conclusion

Turning the memory access on


the hash aggregate table from
random to sequential probes
=
CPU savings > cost of scanning
an enlarged column store

Batch Mode Hash Joins And The Ordering Of Hash Probe Inputs Hash Joins

Row Mode
Hash join

Batch Mode
Hash join

Hash table
(shared)

Thread

Thread

Thread
Exchange

Exchange

Exchange

Exchange

B1

B3
B2

Probe input

Build input

Expensive to repartition inputs


Data skew reduces parallelism

Thread

Thread

B4
Bm

Probe input

B1
B2

B3
Bn

Build input

No repartitioning
Data skew speeds up processing
29

The Case Of The 60% CPU Utilisation Ceiling

If CPU capacity or IO
bandwidth cannot be fully
consumed, some form of
contention must be
present . . .

Batch Engine Call Stack

Throttlin
g!

The Case Of The 60% CPU Utilisation Ceiling:


Conclusion
The hash aggregate
cannot keep up with the
column store scan
=> batch engine throttles
the column store scan by
calling sleep system
calls !!!.
The integration services
engine does something very
similar, known as data flow
engine Back pressure.

The Case Of The Two Column Stores

The hash aggregate using the


column stored created on presorted data is very CPU
efficient.

Why ?

Hypothesis
Segme
nt
Sca
n

Rando
m
access

Hash
Key

Valu
e

Column Store on
Non-Pre-Sorted Data
Hash table is likely to be at the
high latency end of the cache
hierarchy.

Sca
n

Segme Sequent Hash


ial
Key
nt
access

Valu
e

Column Store on Pre-Sorted


Data
Hash table is likely to be at the
low latency end of the cache
hierarchy.

Introducing Intel VTune Amplifier XE

Investigating events at the CPU


cache, clock cycle and
instruction level requires
software outside the standard
Windows and SQL Server tool
set.
Refer to Appendix D for an overview of what General
exploration provides.

This Is What The CPU Stall Picture Looks Like Against DOP
6,000,000,000

5,000,000,000

4,000,000,000

LLC Misses

3,000,000,000

2,000,000,000

1,000,000,000

0
2

10

12

14

Degree of Parallelism
Non-sorted

Sorted

16

18

20

22

24

Which Memory Your Data Is In Matters !, Locality Matters !

CPU

Where is my data ?

Core
L1 Data
Cache

Here ?

L1
Instruction
Cache

L2 Unified Cache

Hopefull
y not
here ?!?

Here ?
Memory
bus

L3 Cache

Here ?

The Case of The CPU Pressure Point

Where are the pressure


points on the CPU and what
can be done to resolve
this ?.

CPU Pipeline Architecture

Back end can retire


up to four micro
operations per clock

Retirement

Front end can issue


four micro ops per
clock cycle.

Allocation

A Pipeline of logical
slots runs through the
processor.

CPU
Front
end

Back
end

Pipeline Bubbles Are Bad !

Back end bubbles can be


due to excessive demand
for specific types of

Bubble

Bubble
Bubble

Retirem
ent

Causes of front end


bubbles:
Bad speculation.
CPU stalls.
Data dependencies
A=B+C
E=A+D

Allocati
on

CPU
Front
end

Empty slots are referred


to as Bubbles.

Back
end

Bubble

Making Efficient Use Of The CPU In The In


Memory World
Backend Pressure
Retirement throttled
due to pressure on back
end resources ( port
saturation )

Frontend Pressure
Front end issuing < 4 uops per cycle
whilst backend is ready to accept
uops
( CPU stalls, bad speculation,
data dependencies )

C P U
Back end

D Front
A T A end
FLO
DATA FLOW
W

Lots Of KPIs To Choose From, Which To Select ?


CPU Cycles Per Retired Instruction
(CPI)
This should ideally be 0.25, anything
approaching 1.0 is bad.
Front end bound
The front end under supplying the back
end with work (lower values are
better).
Back end bound
The Back end cannot accept work from
the front end because there is

These Are The Pressure Point Statistics For The Sorted Column Store
0.8
0.7
0.6
0.5
KPI Value

0.4
0.3
0.2
0.1
0
2

10

12

14

Degree of Parallelism
CPI

Front end Bound

Back end bound

16

18

20

22

Refer to Appendix C for the


formulae from which these

24

Performance

The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine

Front
end

CP
U Back
end

Can we help the


back end keep
up with the front
end ?

Single Instruction Multiple Data ( SIMD )


A class of CPU instruction that
can process multiple data
points to be processed
simultaneously.
A form of vectorised
processing.
Once CPU stalls are
minimised, the challenge
becomes processing data on

Lowering Clock Cycles Per Instruction By


Leveraging SIMD

Using conventional
processing, adding two
arrays together, each
comprising of four
elements requires four
instructions.

B(1)

C(1)

A(2)

B(2)

C(2)

A(3)

B(3)

C(3)

A(4)

B(4)

C(4)

A(1)

Lowering Clock Cycles Per Instruction By


Leveraging SIMD

A(1)

Using SIMD
Single instruction
multiple data commands,
the addition can be
performed using a single
instruction.

A(2)

A(3)

A(4)

B(3)

B(4)

C(3)

C(4)

+
B(1)

B(2)

=
C(1)

C(2)

Does The SQL Server Database Engine Leverage SIMD Instructions ?

VTune amplifier does not provide the option to pick


out Streaming SIMD Extension (SSE) integer
events.
However, for a floating point hash aggregate we
would hope to see floating point AVX instructions in

What Have We Learned ?


The windows performance toolkit can help quantify
where CPU time is going across the database engine
call stack.
Memory access patterns matter, random Vs
sequential.
We have used Vtune amplifier to quantify the last
level cache misses of an un-order hash aggregate.
The pressure point on the CPU is at the backend,

Stick around for RAFFLE and the AFTER EVENT!


All our volunteers and organisers do not
get paid for organizing this event If you
see them, please:
Give them a hug
Shake their hand
Say thank you
Spread the word
Get involved yourself
Dont forget to thank the sponsors for
their support
Thank the speakers for donating their
time, energy and expenses

Questions ?

Contact Details
ChrisAdkin8

chris1adkin@yahoo.co.uk

http://uk.linkedin.com/in/wollatondba

Appendices

Appendix A: Instruction Execution And The CPU Front / Back Ends

Cach
e

Fetch
Decode
Execute
Branch
Predict

Front end

Decoded
Instruction
Buffer

Execut
e

Reorde
r
And
Retire

Back end

Appendix B - The CPU Front / Back Ends In Detail

Front end

Back end

Appendix C - CPU Pressure Points, Important


Calculations
Front end bound ( smaller is better )

IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)

Bad speculation

(UOPS_ISSUED.ANY UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)

Retiring

UOPS_RETIRE_SLOTS / (4 * Clock ticks)

Back end bound ( ideally, should = 1 - Retiring)

1 (Front end bound + Bad speculation + Retiring)

Appendix D - VTune Amplifier General Exploration

An illustration of what the


General exploration
analysis capability of the
tool provides

You might also like