Professional Documents
Culture Documents
Batch Mode
and
CPU Architectures
About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.
Everything fits in
memory, so performance
is as good as it will get. It
fits in memory
therefore end of story
Demonstration
WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM
(SELECT
a = 1
FROM
master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT
d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO
FactInternetSalesBigSorted
FROM
generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED INDEX ccsi
ON FactInternetSalesBigNoSorted ( OrderDateKey )
SELECT
SELECT
CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[dbo].[FactInternetSalesBigNoSort] f
JOIN
[DimDate] d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
The
faste
st ?
CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[dbo].[FactInternetSalesBigSorted] f
JOIN
[DimDate] d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
The
faste
st ?
70000
60000
50000
Time (ms)
40000
30000
20000
10000
0
2
10
12
14
16
18
Degree of Parallelism
Non-sorted column store
20
22
24
100
90
Percentage CPU Utilization
80
70
60
50
40
30
20
10
0
10
12
14
16
Degree of Parallelism
Non-sorted
Sorted
18
20
22
24
Stats are for the query ran with a DOP of 24, a warm
column store object pool and the column store created on
pre-sorted data
(1095600000 rows ).
CXPACKET waits can be ignored 99.99 % of the time.
[CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM
[FactInternetSalesBig] f
JOIN
[DimDate]
d
ON
f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION
(MAXDOP 24)
Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( better in SQL
2014 )
Hash distribution is always uniform.
Etc. . . .
Costings based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).
CPU
Core
Core
L0 UOP cache
L1
CoreL1 Data
Instruction
Cache
Cache
32KB
32KB
L0 UOP cache
L1
CoreL1 Data
Instruction
Cache
Cache
32KB
32KB
L2 Unified Cache
256K
L2 Unified Cache
256K
L3 Cache
Memory
ControllerTLB
Uncore
PCI
2.0
QPI QPI
Power
and
Clock
Memory
Main memory
167
38
18
14
11
11
11
20
40
60
80
100
120
140
160
180
Local
memor
y
access
Local
memor
y
access
Core
L1
L2
Core
L1
L2
Core
L1
L2
Core
L1
L2
L3
L3
Core
L1
L2
Core
L1
L2
Core
L1
L2
Core
L1
L2
NUMA Node 0
NUMA Node 1
An additional 20%
overhead when accessing
Foreign memory !
( from coreinfo )
How does
SQLOS schedule
hyper-threads in
relation to
physical cores ?
( 6 cores per
socket )
CPU
socket 0
CPU
socket 1
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Main
memor
y
CPU
Sequential scan
Column Store index
Hash Table
Can this
be
improved
?
Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards ) Access
Core
L1
L2
L3
n row Btree
1. Session 1 performs
an index seek, pages
not in CPU cache
Last
level
cache
miss
2. A CPU stall takes
place
( 160 clock cycles+ )
whilst the page is
n row Btree
3. The Dead CPU
stall cycles gives the
physical core the
opportunity to
nd
SQL
Statement
WPA
xperf d
stackwalk.etl
Data flow
FactInternetSalesBigNo
Sort
FactInternetSalesBigSor
ted
Column Name
Size (Mb)
OrderDateKey
1786182
Price1
3871
Price2
3871
Price3
3871
OrderDateKey
No , L3 cache is 20Mb in
738
size
Price1
2965127
Price2
2965127
Price3
2965127
Batch Mode Hash Joins And The Ordering Of Hash Probe Inputs Hash Joins
Row Mode
Hash join
Batch Mode
Hash join
Hash table
(shared)
Thread
Thread
Thread
Exchange
Exchange
Exchange
Exchange
B1
B3
B2
Probe input
Build input
Thread
Thread
B4
Bm
Probe input
B1
B2
B3
Bn
Build input
No repartitioning
Data skew speeds up processing
29
If CPU capacity or IO
bandwidth cannot be fully
consumed, some form of
contention must be
present . . .
Throttlin
g!
Why ?
Hypothesis
Segme
nt
Sca
n
Rando
m
access
Hash
Key
Valu
e
Column Store on
Non-Pre-Sorted Data
Hash table is likely to be at the
high latency end of the cache
hierarchy.
Sca
n
Valu
e
This Is What The CPU Stall Picture Looks Like Against DOP
6,000,000,000
5,000,000,000
4,000,000,000
LLC Misses
3,000,000,000
2,000,000,000
1,000,000,000
0
2
10
12
14
Degree of Parallelism
Non-sorted
Sorted
16
18
20
22
24
CPU
Where is my data ?
Core
L1 Data
Cache
Here ?
L1
Instruction
Cache
L2 Unified Cache
Hopefull
y not
here ?!?
Here ?
Memory
bus
L3 Cache
Here ?
Retirement
Allocation
A Pipeline of logical
slots runs through the
processor.
CPU
Front
end
Back
end
Bubble
Bubble
Bubble
Retirem
ent
Allocati
on
CPU
Front
end
Back
end
Bubble
Frontend Pressure
Front end issuing < 4 uops per cycle
whilst backend is ready to accept
uops
( CPU stalls, bad speculation,
data dependencies )
C P U
Back end
D Front
A T A end
FLO
DATA FLOW
W
These Are The Pressure Point Statistics For The Sorted Column Store
0.8
0.7
0.6
0.5
KPI Value
0.4
0.3
0.2
0.1
0
2
10
12
14
Degree of Parallelism
CPI
16
18
20
22
24
Performance
The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine
Front
end
CP
U Back
end
Using conventional
processing, adding two
arrays together, each
comprising of four
elements requires four
instructions.
B(1)
C(1)
A(2)
B(2)
C(2)
A(3)
B(3)
C(3)
A(4)
B(4)
C(4)
A(1)
A(1)
Using SIMD
Single instruction
multiple data commands,
the addition can be
performed using a single
instruction.
A(2)
A(3)
A(4)
B(3)
B(4)
C(3)
C(4)
+
B(1)
B(2)
=
C(1)
C(2)
Questions ?
Contact Details
ChrisAdkin8
chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
Appendices
Cach
e
Fetch
Decode
Execute
Branch
Predict
Front end
Decoded
Instruction
Buffer
Execut
e
Reorde
r
And
Retire
Back end
Front end
Back end
Bad speculation
(UOPS_ISSUED.ANY UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)
Retiring