SQL Server Batch Mode and CPU Architectures SQLSatDK

SQL Server Engine
Batch Mode
and
CPU Architectures
DBA Level 400
Stick around for RAFFLE and the AFTER EVENT!
On laser pointers and

humour . . .
About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.
Everything fits in
memory, so performance
is as good as it will get. It
fits in memory
therefore end of story
Demonstration
Which SELECT Statement Has The Lowest Elapsed Time ?

WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM
(SELECT
a = 1
FROM
master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT
d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
INTO
FactInternetSalesBigNoSort
FROM
generator
CROSS JOIN [dbo].[DimDate] d
WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM
(SELECT
a = 1
FROM
master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT
d.DateKey AS OrderDateKey
INTO
FactInternetSalesBigSorted
FROM
generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED INDEX ccsi
ON FactInternetSalesBigNoSorted ( OrderDateKey )
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON

FactInternetSalesBigNoSort
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON

FactInternetSalesBigNoSorted
WITH (DROP_EXISTING = ON)
SELECT
SELECT
CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[dbo].[FactInternetSalesBigNoSort] f
JOIN
[DimDate] d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
The
faste
st ?
CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[dbo].[FactInternetSalesBigSorted] f
JOIN
[DimDate] d
ON
17.41Mb column store Vs. 51.7Mb column store
The
faste
st ?
How Well Do The Two Column Stores Scale On Larger Hardware ?

80000
Column store created on 1095600000 rows
70000
60000
50000
Time (ms)
40000
30000
20000
10000
0
2
10
12
14
16
18
Degree of Parallelism
Non-sorted column store
Sorted column store
20
22
24
Can We Use All Available CPU Resource ?

Memory
access
should
consume all
available
CPU
cycles ?!?
100
90
Percentage CPU Utilization
80
70
60
50
40
30
20
10
0
10
12
14
16
Non-sorted
Sorted
18
20
22
24
Looking For Clues
Why does the query using

the column store on presorted data run faster ?
Why can we not utilise
100% CPU capacity ?
Lets start with tried and
trusted tools and
techniques.
Wait Statistics Do Not Help Here !
Stats are for the query ran with a DOP of 24, a warm
column store object pool and the column store created on
pre-sorted data
(1095600000 rows ).
CXPACKET waits can be ignored 99.99 % of the time.
Spin Locks Do Not Provide Any Clues Either

SELECT
[CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[FactInternetSalesBig] f
JOIN
[DimDate]
d
ON
OPTION
(MAXDOP 24)
Executes in 775 ms for a warm column store object pool

12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU
cycles
Total spins
293,491
Could Query Costs Help Solve The Two Mysteries ?
Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( better in SQL
2014 )
Hash distribution is always uniform.
Etc. . . .
Costings based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).
Our View Of Database Engine Resource Usage Is

Based On . . .
wait stats, perfmon counters,
extended events and dynamic
management views.
We need to know and understand:
Where all our CPU cycles are
going
How the database engine utilises
the
CPU at a deep architectural level.
i Series CPU Architecture

system-on-chip ( SOC )
design with CPU cores as
the basic building block.
Utility services
provisioned by the Uncore part of the CPU die.
Three level cache
hierarchy
( four for Sandybridge+)
Memory
bus
CPU
Core
Core
L0 UOP cache
L1
CoreL1 Data
Instruction
Cache
Cache
32KB
32KB
L0 UOP cache
L1
CoreL1 Data
Instruction
Cache
Cache
32KB
32KB
L2 Unified Cache
256K
L2 Unified Cache
256K
Bi-directional ring bus
L3 Cache
Memory
ControllerTLB
Uncore
PCI
2.0
QPI QPI
Power
and
Clock
The CPU Cache Hierarchy Latencies In CPU Cycles
Memory
Main memory
167
L3 Cache Full Random access

L3 Cache In Page Random access
L3 Cache sequential access
38
18
14
L2 Cache Full Random access
11
L2 Cache In Page Random access
11
L2 Cache sequential access
11
L1 Cache In Full Random access 4

L1 Cache In Page Random access 4
L1 Cache sequential access 4
0
20
40
60
80
100
120
140
160
180
Memory Access Can Become More Costly With NUMA

Remote
memory
access
Local
memor
y
access
Local
memor
y
access
Core
L1
L2
Core
L1
L2
Core
L1
L2
Core
L1
L2
L3
L3
Core
L1
L2
Core
L1
L2
Core
L1
L2
Core
L1
L2
NUMA Node 0
NUMA Node 1
NUMA Node Remote Memory Access LatencyMemory Access Latency
An additional 20%
overhead when accessing
Foreign memory !
( from coreinfo )
Local Vs Remote Memory Access and Thread Locality
How does
SQLOS schedule
hyper-threads in
relation to
physical cores ?
( 6 cores per
socket )
CPU
socket 0
CPU
socket 1
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Main Memory Is Holding The CPU Back, Solutions .

..
Leverage the pre-fetcher as

much as possible.
Larger CPU caches
L4 Cache => Crystalwell eDram
DDR4 memory
By pass main memory
Stacked memory
Hybrid memory cubes (Intel)
High Bandwidth memory (AMD)
Main
memor
y
CPU
What The Pre-Fetcher Loves
Sequential scan
Column Store index
What The Pre-Fetcher Hates !
Hash Table
Can this
be
improved
?
Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards ) Access
Core
L1
L2
L3
n row Btree
1. Session 1 performs
an index seek, pages
not in CPU cache
Last
level
cache
miss
2. A CPU stall takes
place
( 160 clock cycles+ )
whilst the page is
n row Btree
3. The Dead CPU
stall cycles gives the
physical core the
opportunity to
nd
Obtaining An ETW Trace
Stack Walking The Database Engine
xperf on base stackwalk

profile
SQL
Statement
WPA
xperf d
stackwalk.etl
Call Stack For Query Against Column Store On Non-PreSorted Data

Hash agg lookup
weight 65,329.87
Column Store scan

weight 28,488.73
Where Is The Bottleneck In The Plan ?

Control flow
The stack trace is indicating that

the
Bottleneck is right here
Data flow
Call Stack For Query Against Column Store On PreSorted Data

Hash agg lookup
weight:
now
275.00
before
65,329.87
Column Store scan

weight
Does The OrderDateKey Column Fit In The L3

Cache ?
Table Name
FactInternetSalesBigNo
Sort
FactInternetSalesBigSor
ted
Column Name
Size (Mb)
OrderDateKey
1786182
Price1
3871
Price2
3871
Price3
3871
OrderDateKey
No , L3 cache is 20Mb in
738
size
Price1
2965127
Price2
2965127
Price3
2965127
The Case Of The Two Column Store Index Sizes:

Conclusion
Turning the memory access on

the hash aggregate table from
random to sequential probes
=
CPU savings > cost of scanning
an enlarged column store
Batch Mode Hash Joins And The Ordering Of Hash Probe Inputs Hash Joins
Row Mode
Hash join
Batch Mode
Hash join
Hash table
(shared)
Thread
Thread
Thread
Exchange
Exchange
Exchange
Exchange
B1
B3
B2
Probe input
Build input
Expensive to repartition inputs

Data skew reduces parallelism
Thread
Thread
B4
Bm
Probe input
B1
B2
B3
Bn
Build input
No repartitioning
Data skew speeds up processing
29
The Case Of The 60% CPU Utilisation Ceiling
If CPU capacity or IO
bandwidth cannot be fully
consumed, some form of
contention must be
present . . .
Batch Engine Call Stack
Throttlin
g!
The Case Of The 60% CPU Utilisation Ceiling:

Conclusion
The hash aggregate
cannot keep up with the
column store scan
=> batch engine throttles
the column store scan by
calling sleep system
calls !!!.
The integration services
engine does something very
similar, known as data flow
engine Back pressure.
The Case Of The Two Column Stores
The hash aggregate using the

column stored created on presorted data is very CPU
efficient.
Why ?
Hypothesis
Segme
nt
Sca
n
Rando
m
access
Hash
Key
Valu
e
Column Store on
Non-Pre-Sorted Data
Hash table is likely to be at the
high latency end of the cache
hierarchy.
Sca
n
Segme Sequent Hash

ial
Key
nt
access
Valu
e
Column Store on Pre-Sorted

Data
Hash table is likely to be at the
low latency end of the cache
hierarchy.
Introducing Intel VTune Amplifier XE
Investigating events at the CPU

cache, clock cycle and
instruction level requires
software outside the standard
Windows and SQL Server tool
set.
Refer to Appendix D for an overview of what General
exploration provides.
This Is What The CPU Stall Picture Looks Like Against DOP
6,000,000,000
5,000,000,000
4,000,000,000
LLC Misses
3,000,000,000
2,000,000,000
1,000,000,000
0
2
10
12
14
Non-sorted
Sorted
16
18
20
22
24
Which Memory Your Data Is In Matters !, Locality Matters !
CPU
Where is my data ?
Core
L1 Data
Cache
Here ?
L1
Instruction
Cache
L2 Unified Cache
Hopefull
y not
here ?!?
Here ?
Memory
bus
L3 Cache
Here ?
The Case of The CPU Pressure Point
Where are the pressure

points on the CPU and what
can be done to resolve
this ?.
CPU Pipeline Architecture
Back end can retire

up to four micro
operations per clock
Retirement
Front end can issue

four micro ops per
clock cycle.
Allocation
A Pipeline of logical
slots runs through the
processor.
CPU
Front
end
Back
end
Pipeline Bubbles Are Bad !
Back end bubbles can be

due to excessive demand
for specific types of
Bubble
Bubble
Bubble
Retirem
ent
Causes of front end

bubbles:
Bad speculation.
CPU stalls.
Data dependencies
A=B+C
E=A+D
Allocati
on
CPU
Front
end
Empty slots are referred

to as Bubbles.
Back
end
Bubble
Making Efficient Use Of The CPU In The In

Memory World
Backend Pressure
Retirement throttled
due to pressure on back
end resources ( port
saturation )
Frontend Pressure
Front end issuing < 4 uops per cycle
whilst backend is ready to accept
uops
( CPU stalls, bad speculation,
data dependencies )
C P U
Back end
D Front
A T A end
FLO
DATA FLOW
W
Lots Of KPIs To Choose From, Which To Select ?

CPU Cycles Per Retired Instruction
(CPI)
This should ideally be 0.25, anything
approaching 1.0 is bad.
Front end bound
The front end under supplying the back
end with work (lower values are
better).
Back end bound
The Back end cannot accept work from
the front end because there is
These Are The Pressure Point Statistics For The Sorted Column Store
0.8
0.7
0.6
0.5
KPI Value
0.4
0.3
0.2
0.1
0
2
10
12
14
CPI
Front end Bound
Back end bound
16
18
20
22
Refer to Appendix C for the

formulae from which these
24
Performance
The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine
Front
end
CP
U Back
end
Can we help the

back end keep
up with the front
end ?
Single Instruction Multiple Data ( SIMD )

A class of CPU instruction that
can process multiple data
points to be processed
simultaneously.
A form of vectorised
processing.
Once CPU stalls are
minimised, the challenge
becomes processing data on
Lowering Clock Cycles Per Instruction By

Leveraging SIMD
Using conventional
processing, adding two
arrays together, each
comprising of four
elements requires four
instructions.
B(1)
C(1)
A(2)
B(2)
C(2)
A(3)
B(3)
C(3)
A(4)
B(4)
C(4)
A(1)
Lowering Clock Cycles Per Instruction By

Leveraging SIMD
A(1)
Using SIMD
Single instruction
multiple data commands,
the addition can be
performed using a single
instruction.
A(2)
A(3)
A(4)
B(3)
B(4)
C(3)
C(4)
+
B(1)
B(2)
=
C(1)
C(2)
Does The SQL Server Database Engine Leverage SIMD Instructions ?
VTune amplifier does not provide the option to pick

out Streaming SIMD Extension (SSE) integer
events.
However, for a floating point hash aggregate we
would hope to see floating point AVX instructions in
What Have We Learned ?

The windows performance toolkit can help quantify
where CPU time is going across the database engine
call stack.
Memory access patterns matter, random Vs
sequential.
We have used Vtune amplifier to quantify the last
level cache misses of an un-order hash aggregate.
The pressure point on the CPU is at the backend,
Stick around for RAFFLE and the AFTER EVENT!

All our volunteers and organisers do not
get paid for organizing this event If you
see them, please:
Give them a hug
Shake their hand
Say thank you
Spread the word
Get involved yourself
Dont forget to thank the sponsors for
their support
Thank the speakers for donating their
time, energy and expenses
Questions ?
Contact Details
ChrisAdkin8
chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
Appendices
Appendix A: Instruction Execution And The CPU Front / Back Ends
Cach
e
Fetch
Decode
Execute
Branch
Predict
Front end
Decoded
Instruction
Buffer
Execut
e
Reorde
r
And
Retire
Back end
Appendix B - The CPU Front / Back Ends In Detail
Front end
Back end
Appendix C - CPU Pressure Points, Important

Calculations
Front end bound ( smaller is better )
IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
Bad speculation
(UOPS_ISSUED.ANY UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)
Retiring
UOPS_RETIRE_SLOTS / (4 * Clock ticks)
Back end bound ( ideally, should = 1 - Retiring)
1 (Front end bound + Bad speculation + Retiring)
Appendix D - VTune Amplifier General Exploration
An illustration of what the

General exploration
analysis capability of the
tool provides

SQL Server Batch Mode and CPU Architectures SQLSatDK

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SQL Server Batch Mode and CPU Architectures SQLSatDK

Uploaded by

Copyright:

Available Formats

SQL Server Engine

DBA Level 400

Stick around for RAFFLE and the AFTER EVENT!

On laser pointers and

Which SELECT Statement Has The Lowest Elapsed Time ?

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON

17.41Mb column store Vs. 51.7Mb column store

How Well Do The Two Column Stores Scale On Larger Hardware ?

Column store created on 1095600000 rows

Sorted column store

Can We Use All Available CPU Resource ?

Looking For Clues

Why does the query using

Wait Statistics Do Not Help Here !

Spin Locks Do Not Provide Any Clues Either

Executes in 775 ms for a warm column store object pool

Could Query Costs Help Solve The Two Mysteries ?

Our View Of Database Engine Resource Usage Is

i Series CPU Architecture

Bi-directional ring bus

The CPU Cache Hierarchy Latencies In CPU Cycles

L3 Cache Full Random access

L2 Cache Full Random access

L2 Cache In Page Random access

L2 Cache sequential access

L1 Cache In Full Random access 4

Memory Access Can Become More Costly With NUMA

NUMA Node Remote Memory Access LatencyMemory Access Latency

Local Vs Remote Memory Access and Thread Locality

Main Memory Is Holding The CPU Back, Solutions .

Leverage the pre-fetcher as

What The Pre-Fetcher Loves

What The Pre-Fetcher Hates !

Obtaining An ETW Trace

Stack Walking The Database Engine

xperf on base stackwalk

Call Stack For Query Against Column Store On Non-PreSorted Data

Column Store scan

Where Is The Bottleneck In The Plan ?

The stack trace is indicating that

Call Stack For Query Against Column Store On PreSorted Data

Column Store scan

Does The OrderDateKey Column Fit In The L3

The Case Of The Two Column Store Index Sizes:

Turning the memory access on

Expensive to repartition inputs

The Case Of The 60% CPU Utilisation Ceiling

Batch Engine Call Stack

The Case Of The 60% CPU Utilisation Ceiling:

The Case Of The Two Column Stores

The hash aggregate using the

Segme Sequent Hash

Column Store on Pre-Sorted

Introducing Intel VTune Amplifier XE

Investigating events at the CPU

Which Memory Your Data Is In Matters !, Locality Matters !

The Case of The CPU Pressure Point

Where are the pressure

CPU Pipeline Architecture

Back end can retire

Front end can issue

Pipeline Bubbles Are Bad !