Professional Documents
Culture Documents
Databases
Michael R. Ault, Oracle Guru, Texas Memory Systems
Introduction
One of the new buzz word sets is “Tier Zero Storage.” No doubt you have heard it
applied to fast disks, Flash disks, and of course DDR based SSD storage systems. Tier
Zero is the fastest, lowest latency level of storage reserved for the parts of your database
or system architectures that need high speed access (low latency) to ensure the
performance of the entire system.
Many systems suffer from IO contention which results in high run queues, low CPU
usage, and high IO waits. Increasing bandwidth usually is not a solution if the minimum
latency for the existing storage media has been reached.
Usually less than 10% of your database tables and indexes need the performance offered
by Tier Zero type storage, but the trick is picking which 10%. Many times we are sure we
know that since table X or index Y is the most important it must have to go on Tier Zero;
however, in many cases this is just not true.
Typically the speed increases and latency decreases as you move from higher numbered
tiers to lower numbered tiers up the latency pyramid as is shown in Figure 1.
Of course if an object is small enough that it will be read once from storage and then
reside in the SGA buffer space, placing it on Tier Zero is a waste of Tier Zero space.
Let’s look at each of these three sources and see how they can be utilized to determine
optimal placement of assets on Tier Zero storage.
If you have worked with explain plan and the plan_table then the contents of the
V$sql_plan view should be familiar; other than the ADDRESS, SQL_ID, and
HASH_VALUE columns the plan_table and the v$sql_plan view are virtually identical.
The major difference between the plan_table table and the v$sql_plan view is that the
plan_table must be populated by request of the user with the explain plan command and
usually contains the plan for only a few specific SQL statements, while the v$sql_plan
table is automatically populated and contains the plans for all active SQL statements in
the SGA SQL area.
Through the use of queries against the v$sql_plan view after your application has been
running and established, its working set of SQL can yield detailed information about
what indexes are frequently used and what tables are accessed by inefficient (from the
storage point of view) full or partial scans. An example SQL script to pull information
about table and index access paths is shown in Figure 3.
26 rows selected.
To determine which objects are candidates for Tier Zero storage, look at the number of
paths as show by the OPTIONS column which use the object and amount of access as
shown by the FTS_MEG column. The objects with the greatest number of executions and
sized such that it is unlikely they will fit in the SGA buffer area are excellent candidates
for Tier Zero storage.
In the report in Listing 1 we see that for overall volume of access the H_LINEITEM table
and H_LINEITEM index LINEITEM_IDX2 is the first candidate for moving to Tier
Zero. In fact, about the only objects which aren’t candidates are the tables H_NATION,
H_REGION and H_SUPPLIER and the index SUPPLIER_IDX1.
The report in Figure 4 was taken from a system utilizing 28 – 15K 144 GB hard drives to
service a 300 GB data/250 GB indexes data warehouse type database. Oracle is reporting
15.7K IOPS against this setup, which is actually quite impressive. Of course most of
these IOPS out of Oracle will be bundled at the controller, usually by a factor of 16,
reducing the IOPS seen at the actual disks to around 1K. Given that each disk is capable
of between 90-150 random IOPS, that means we should be able to support about 2800
IOPS before we stress the disk system, if all of the IOPS are non-colliding. If we want to
know what kind of stress we are actually seeing, we need to compare CPU time and idle
time next. Figure 5 shows the Operating System Statistics section of the same report as in
Figure 4.
In Figure 5 we see that the idle time is huge in comparison to the buys time and that the
IO wait time is nearly equal to the busy time. This indicates that CPUs are waiting for
IOs to complete, signaling that the system is stressed at the IO subsystem. But how
stressed is the system? A key indicator of the amount of IO stress is the perceived disk
latency. Disk latency is based on disk rotation speed for the most part. A 15K RPM disk
should have a max latency of around 5 milliseconds, less if the disk is being “short
stroked;” a short stroked disk is one that is not allowed to get to more than 30% capacity,
specifically to reduce IO latency. The Tablespace IO statistics section will give us an idea
of what the Oracle system is seeing as far as latency in the IO subsystem. Figure 6 shows
the Tablespace IO section of the report.
Tablespace
------------------------------
Av Av Av Av Buffer Av Buf
Reads Reads/s Rd(ms) Blks/Rd Writes Writes/s Waits Wt(ms)
-------------- ------- ------ ------- ------------ -------- ---------- ------
HDINDEXES
1,482,896 82 568.5 11.0 0 0 110 60.6
HDDATA
1,104,429 61 948.0 46.6 4 0 0 0.0
TEMPHD
72,003 4 61.5 23.5 86,927 5 35,555 135.8
SYSAUX
11,290 1 15.3 1.6 3,462 0 0 0.0
UNDOTBS1
13 0 0.0 1.0 1,564 0 0 0.0
SYSTEM
1,158 0 1.9 1.0 272 0 1 0.0
-------------------------------------------------------------
Figure 6: Tablespace IO Report
From Figure 6 we can see that the IO latency for reads for our HDDATA and HDINDEX
tablespaces are extremely high, 568.5 and 948 milliseconds respectively. These
extremely high latencies definitely indicate the IO subsystem is being heavily stressed.
High numbers of buffer waits also indicate long wait times for writes, as we can see our
HDTEMP tablespace is experiencing large numbers of buffer waits and each buffer wait
is taking about136 milliseconds to resolve. Other than moving all of the temporary
tablespaces, data tables, and indexes to Tier Zero, how can we determine the minimum
number of objects that should be moved?
Going back to the beginning of the report we can first examine the top wait events to see
under what categories the IO stress occurs. Figure 7 shows the “Top Five Wait Events”
for our report.
Avg
wait % DB
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path read 1,142,119 124,112 109 77.3 User I/O
direct path write temp 68,261 6,543 96 4.1 User I/O
db file parallel read 16,671 5,399 324 3.4 User I/O
DB CPU 5,315 3.3
direct path read temp 70,785 4,453 63 2.8 User I/O
I’ll bet you were expecting to see db file scattered reads or db file sequential reads as the
top wait events, and you would have been correct in a non-partitioned, non-parallel query
environment. However, this database is highly partitioned and uses table, index, and
instance level (it is a 4-node RAC setup) parallel query. The direct path read wait event
indicates asynchronous reads direct into the PGA, in this case because of the parallel
queries being utilized by the system. The direct path read temp and direct path write
temp indicate that our PGA_AGGREGATE_TARGET may be insufficient for the size of
hashes and sorts being performed. So what does all this mean?
The direct path read wait events are being driven by full table/partition scans. If we
examine the “Segment Statistics” section of the AWR report we can see what objects are
undergoing the most physical IOs and, generally speaking, those will be the ones causing
our direct path read waits. If we saw db file scattered reads or db file sequential reads
we could look at the same report section to see what objects were most likely causing
issues. Figure 8 shows the pertinent “Segment Statistics” sections of the report.
Segments by Physical Reads DB/Inst: TMSTPCH/TMSTPCH1 Snaps: 1630-1635
-> Total Physical Reads: 275,467,812
-> Captured Segments account for 8.1% of Total
Obviously the H_LINEITEM table is dominating our Segment Statistics. With only 8.1%
of total IO being shown, it indicates that the physical IO is being spread over a large
number of objects, in this case, probably more H_LINEITEM and H_ORDER partitions.
From looking at all three sections we can see that the H_CUSTOMER, H_PART, and
H_SUPPLIER tables are seeing a great deal of full scans at the partition level as well.
However, with less than 50% of any of the Segment Statistics actually being shown, we
are probably missing a great deal of important statistics for determining what should be
placed on Tier Zero. However, as a start it looks like the following tables should be
placed there:
• H_LINEITEM
• H_CUSTOMER
• H_PART
• H_PARTSUPP
In addition, the temporary tablespace stress indicates that the temporary tablespace
datafiles should be moved there as well, assuming we cannot increase
PGA_AGGREGATE_TARGET to accomplish relief of the temporary IO stress by
moving the sorts and hashes into memory.
Since we are seeing less than half of the actual database activity being recorded in the
Segment Statistics area of the report, we need to look at the SQL area of the report to
determine what SQL is causing the most stress. By analyzing the top 5-10 problem SQL
statements, defined as the SQL statements that are worst performing in their area of
report, we can determine any additional objects that may need to move to Tier Zero or
other high speed, low latency storage. Figure 9 shows the applicable top five SQL
statements in the SQL physical reads section of the report.
SQL ordered by Reads DB/Inst: TMSTPCH/TMSTPCH1 Snaps: 1630-1635
-> Total Disk Reads: 275,467,812
-> Captured SQL account for 20.3% of Total
The full SQL statement for the top SQL statement from Figure 9 for Physical IOs, SQL
ID 7zcfxggv196w2, with over 20 million reads is shown in Figure 10:
select
s_name,
count(*) as numwait
from
h_supplier,
h_lineitem l1,
h_order,
h_nation
where
s_suppkey = l1.l_suppkey
and o_orderkey = l1.l_orderkey
and o_orderstatus = 'F'
and l1.l_receiptdate > l1.l_commitdate
and exists (
select
*
from
h_lineitem l2
where
l2.l_orderkey = l1.l_orderkey
and l2.l_suppkey <> l1.l_suppkey
)
and not exists (
select
*
from
h_lineitem l3
where
l3.l_orderkey = l1.l_orderkey
and l3.l_suppkey <> l1.l_suppkey
and l3.l_receiptdate > l3.l_commitdate
)
and s_nationkey = n_nationkey
and n_name = 'ROMANIA'
group by
s_name
order by
numwait desc,
s_name;
Figure 10: Full Text for Top SQL Query
Just by examining the SQL in Figure 10 we can’t really see much more than we knew
before. However, if we generate an execution plan using the explain plan command, we
get a much better look at what objects are actually being utilized by the query. Figure 11
shows the abbreviated explain plan for the query in Figure 10.
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
2 FROM TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 152520466
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
In the execution plan just the objects accessed are being shown; in this case we see the
same tables from our full table scan list in addition to the LINEITEM_IDX2 index. The
H_NATION table is very small and is usually fully cached early in any processing cycle
so we can disregard it. In fact, if we do a select on the V$SQL_PLAN table we find that
the following indexes are also being heavily used by the other queries:
26 rows selected.
So after looking at IO statistics and physical waits and doing some SQL analysis, we
have determined that the following objects should be placed on Tier Zero if space allows:
• Temporary tablespace
• H_LINEITEM table
• H_CUSTOMER table
• H_PART table
• H_PART_SUPP table
• H_ORDER table
• CUSTOMER_IDX1 index
• PARTSUPP_IDX2 index
• LINEITEM_IDX1 index
• LINEITEM_IDX2 index
The Results
What happens if we move the selected items from hard disk to Tier Zero storage (in this
case solid state devices ranging from 0.2 ms for data and index areas to .015 ms latency
for temporary areas)? The graph in Figure 12 shows the performance of 22 queries with
the objects on the SSD arrays and on the original hard drives.
SSDandHDQueries
10000
1000
Seconds
SSDAvg
100 HDTemp
HDAvg
10
1
0 2 4 6 8 10 12 14 16 18 20 22 24
Query
Figure 12 clearly shows the advantages of Tier Zero storage. On the average the queries
ran a factor of 5 times faster on the Tier Zero storage, with some queries being almost 20
times faster. Also shown in Figure 12 are the results of moving the temporary segments
to Tier Zero; note the 4 queries (9, 13, 16 and 18) whose performance was improved just
by moving the temporary tablespace. Of course moving all the suggested items to SSD
gave the greatest improvement.
Summary
Through the analysis of IO profiles, wait events, and SQL object usage we were able to
determine what objects would benefit from being moved to Tier Zero type storage.
Through the use of AWR reports and some custom queries the determination of objects
was rendered fairly easy. The final test of our selections by actually moving the objects to
Tier Zero provided a 5 fold increase in average performance.