You are on page 1of 14

Introduction to Indexes

By Gail Shaw, 2011/01/07 (first published: 2009/10/26)

Total article views: 38289 | Views in the last 30 days: 3490


Rate this | Join the discussion | Briefcase | Print

Good indexes are the key to good performance in SQL Server and the key to creating good
indexes is to understand what indexes are and how SQL Server uses them to evaluate queries.

In this first part of a three part series, I’m going to look at the very basics of what indexes are,
what types exist in SQL and how they’re used.

What is an index?
An index is a structure within SQL that is used to quickly locate specific rows within a table. It
can be useful to imaging an index at the back of a textbook when thinking about SQL indexes.
They both serve the same purpose – to find specific information quickly.

General Structure

An index is defined on one or more columns, called key columns. The key columns (also
referred to as the index key) can be likened to the terms listed in a book index. They are the
values that the index will be used to search for. As with the index found at the back of a text
book (see figure 1), the index is sorted by the key columns.

Figure 1: Book index. Image copyright Simple Talk publishing.

If an index is created with more than one key column, it is known as a composite index.
The general structure of an index is that of a balanced tree (b-tree). The index will have a single
root page, zero or more intermediate levels and then a leaf level. A page is an 8 kilobyte chunk
of the data file, with a header and footer and is identified by a combination of File ID and Page
number.

Figure 2: Index Structure

Note: Commonly the root page is shown at the top of the tree diagram and the leaf pages at the
bottom. Think of it as an inverted tree.

In the leaf level, there’s one entry for each row in the index1. The entries in the index are ordered
logically2 in the order of the index key.

The non-leaf levels of the index contain one row per page of the level below, referencing the
lowest index key value on each page. If all of those rows fit onto a single page, then that page is
considered the root and the index is only two levels deep. If all of those rows will not fit on a
single page, then one (or more) intermediate levels are added to the index.

The number of levels in an index is referred to as the depth of the index. This is an important
consideration for evaluating the efficiency of the index. The index illustrated in figure 2 has a
depth of 3.

(1) With the exception of SQL 2008’s filtered indexes, an index will have the same number of
rows at the leaf level as the table.

(2) I’m using the phrase ‘logically ordered’ because the index does not necessarily define the
physical storage of the rows. The rows are stored in a way that SQL can retrieve them ordered.

Clustered and nonclustered


There are two main types of indexes in SQL Server, the clustered index and the nonclustered
index
Clustered indexes define the logical order of the table. The leaf level of the clustered index has
the actual data pages of the table. Because of this there can only be one clustered index per table.
A table that does not have a clustered index is referred to as a heap.

Nonclustered indexes are separate from the table. The leaf level of a nonclustered index has a
pointer as part of each index row. That pointer is either the clustered index key in the cases
where the base table has a clustered index or the RID (Row Identifier) in the cases where the
table is a heap. The RID is an 8-byte structure comprised of File ID, Page Number and Slot
Index and will uniquely identify a row in the underlying heap. Either way, the each row of a
nonclustered index has a reference to the complete data row.

Index Limits
There are a number of built-in limitations on indexes

Key size

The size of an index key is limited to a maximum of 900 bytes and a maximum of 16 columns.
This is definitely a limit, not a goal, as the larger the index key gets, the more pages in the index
and the deeper the index tree. As the number of pages and the depth of the tree increases so the
index becomes less efficient to use. Larger indexes also use more storage space and result in less
efficient use of SQL’s data cache.

Number of indexes

In SQL 2005 and earlier there was a limitation of 250 indexes per table, one clustered and 249
non-clustered. In SQL 2008, with the addition of filtered indexes, that limitation was increased to
1000, one clustered and 999 non-clustered indexes.

Both of these limits are very high and there are few circumstances where a well-designed system
should approach that limit.

The reason for this is twofold.

· As the number of indexes increases so the total size occupied by the table (with all of its
indexes) increases. Sure, hard drives are cheap and storage is abundant but increasing the size of
a database has other effects, Maintenance operations (backups, restores, consistency checks and
index rebuilds) all take longer as the size of a database increases.

· Indexes have to be kept up to date as data changes and the more indexes there are on a
table, the more places the data has to be changed. If there are 10 nonclustered indexes on a table,
an insert must be done in 11 places (the table and each of those nonclustered indexes). On
databases that are mostly read-only (decision support, data warehouses) that may be acceptable.
On databases that have frequent inserts, updates and deletes (OLTP systems), the overhead
impose by multiple indexes may not be acceptable
How SQL uses indexes
If a table does not have index, the only way to find all occurrences of a value within a table is to
read the entire table. If a table has an index, it speeds up the locating of values within that index
in two ways.

1. The index is sorted in the order of the key columns. This means that once all the matching
values have been found, the remaining portion of the table can be ignored. This is the same as a
telephone directory, where once all entries with a particular surname have been found, the rest of
the book can be ignored as no further matches are possible

2. The tree structure of the index allows a divide-and-conquer approach to locating rows,
where large portions of the table can be quickly excluded from the search. This is illustrated in
Figure 3

There are four basic operations that SQL can do on an index. It can scan the index, it can seek on
the index, it can do lookups to the index and it can update the index

Scans

An index scan is a complete read of all of the leaf pages in the index. When an index scan is
done on the clustered index, it’s a table scan in all but name.

When an index scan is done by the query processor, it is always a full read of all of the leaf pages
in the index, regardless of whether all of the rows are returned. It is never a partial scan.

A scan does not only involve reading the leaf levels of the index, the higher level pages are also
read as part of the index scan.

Seeks

An index seek is an operation where SQL uses the b-tree structure to locate either a specific
value or the beginning of a range of value. For an index seek to be possible, there must be a
SARGable3 predicate specified in the query and a matching (or partially matching) index. A
matching index is one where the query predicate used a left-based subset of the index columns.
This will be examined in much greater detail in a part 3 of this series.

The seek operation is evaluated starting at the root page. Using the rows in the root page, the
query processor will locate which page in the next lower level of the index contains the 1st row
that is being searched for. It will then read that page. If that is the leaf level of the index, the seek
ends there. If it is not the leaf then the query processor again identifies which page in the next
lower level contains the specified value. This process continues until the leaf level is reached.

Once the query processor has located the leaf page containing either the specified key value or
the beginning of the specified range of key values then it reads along the leaf pages until all rows
that match the predicate have been returned. Figure 2 shows how a seek would be done on an
index when searching for the value 4.

If the index contains all the columns that the query needs, then the index is said to be covering
for that query. If the index does not contain all the columns then SQL will do a lookup to the
base table to fetch the other columns in order to process the query.

(3) SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a
predicate that is of a form that SQL can use for an index seek. For more details see:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

Lookups

Lookups occur when SQL uses an index to locate rows affected by a query but that index does
not contain all the columns required to satisfy the query, that is the index is not covering for that
query. To fetch the remaining columns, SQL does a lookup to either the clustered index or heap.

A lookup to a clustered index is essentially a single row clustered index seek and it is always a
single row seek. So if a lookup is needed for 500 rows, that involves 500 individual clustered
index seeks.

Updates

Anytime that a row is changed, those changes must be made not only in the base table (clustered
index or heap) but also in any index that contains the columns that were affected by the change.
This applies to insert, update and delete operations.

Considerations for creating indexes


I’ll be going into more detail on considerations for indexes in the next two parts, but in general:

• Clustered index should be narrow, because the clustering key is part of all nonclustered
indexes.
• Composite nonclustered indexes are generally more useful than single column indexes,
unless all queries against the table filter on one column at a time.
• Indexes should be no wider than they have to be. Too many columns wastes space and
increases the amount of places that data must be changed when an insert/update/delete
occurs.
• If an index is unique, specify that it is unique. The optimiser can sometimes use that
information to generate more optimal execution plans.
• Be careful of creating lots of indexes on frequently modified tables as it can slow down
data modifications.

In part 2 of this series I’ll be looking in more detail into clustered indexes. What they are, how
they differ from nonclustered indexes and what the considerations are for creating a clustered
index.

Introduction to Indexes: Part


2 – The clustered index
By Gail Shaw, 2011/01/14 (first published: 2009/11/11)

Total article views: 21334 | Views in the last 30 days: 879


Rate this | Join the discussion | Briefcase | Print

Good indexes are the key to good performance in SQL Server and the key to creating good
indexes is to understand what indexes are and how SQL Server uses them to evaluate queries.

In the previous article I looked at the very basics of what indexes are, what types exist in SQL
and how they’re used. In this article I’m going to take a closer look at clustered indexes, what
makes them different from nonclustered indexes, how SQL uses clustered indexes and some of
the recommendations for selecting the clustered index key.

What is a clustered index?


A clustered index is an index where the leaf level of the index contains the actual data rows of
the table. Like any other index, a clustered index is defined on one or more columns – the index
key. The key columns for a clustered index are often referred to as the clustering key.

The clustered index can be looked at as a balanced tree (b-tree) structure built on top of a table,
with the rows in the table stored logically in the order of the clustering key (see figure 1). The
clustered index essentially is the table and hence there cannot be more than one on a table.
Introduction to Indexes: Part 3 – The
nonclustered index
By Gail Shaw, 2011/01/21 (first published: 2009/11/18)

Total article views: 16286 | Views in the last 30 days: 806


Rate this | Join the discussion | Briefcase | Print

Good indexes are the key to good performance in SQL Server and the key to creating good
indexes is to understand what indexes are and how SQL Server uses them to evaluate queries.

In the previous two articles I looked at the very basics of what indexes are, what types exist in
SQL and how they’re used and then took a closer look at the clustered index.

In this article I’m going to take a closer look at nonclustered indexes, how SQL uses
nonclustered indexes and some of the recommendations for selecting useful nonclustered
indexes.

What is a nonclustered index?


A nonclustered index is the second type of index in SQL Server. They have the same b-tree
structure as the clustered index (see previous article). Unlike the clustered index nonclustered
indexes does not contain the entire data row at the leaf level. Rather, the nonclustered index
contains just the columns defined in the index, and a pointer to the actual data row. See figure 1
for a high-level architecture. Assume that the nonclustered index depicted there is based off the
clustered index depicted in the previous part.
Figure 1: Nonclustered index structure

When the underlying table is a heap (has no clustered index) then the pointer to the data row is
the RID, an 8 byte structure comprised of the File ID, Page No and Slot index, i.e. the actual
location of the row within the data file.

When the underlying table has a clustered index, the pointer to the actual data row is the
clustering key. As mentioned in the previous article, this has important considerations for the
selection of a clustering key.

There can be multiple nonclustered indexes on a table. In SQL 2005 and earlier, there was a limit
of 249 nonclustered index per table. In SQL 2008, with the introduction of filtered indexes, the
limit on indexes has been increased to 999.

Include columns
Columns specified as include columns are stored at the leaf level of the nonclustered index, but
not at the intermediate or root levels. They are not part of the index key and do not count towards
the 16 column/900 byte limit on indexes. Any data type other than the old LOB types (TEXT,
NTEXT, IMAGE) are allowed for include columns, although columns defined as
varbinary(MAX) Filestream are not permitted as include columns. The ability to specify include
columns was added in SQL 2005.

Include columns are useful in that they are columns available in the index but not contributing to
the size of the index key hence they make it easier to create covering indexes (see next section)
than could be done without them.

Typically columns that appear only in the select clause of a query and not within the WHERE,
FROM or GROUP BY are candidates for INCLUDE columns.

If LOB columns are specified as include columns, the entire contents of the LOB columns are
duplicated. It’s not just a pointer to the existing LOB pages added to the nonclustered index.

The ability to specify include columns was added in SQL 2005.

Covering indexes
Covering is not a property of an index, it is not possible to say, without additional information
that a query is covering or not covering. Instead, covering deals with an index and a query
together.

An index is said to be covering for a specific query if that index contains within it all of the
columns necessary for the query. The columns may be part of the key or they may be include
columns. This means that a query that has a covering index can be evaluated completely from
that index, without needing to go to the base table (heap or cluster) to fetch additional columns.
It has been said before that creating a covering index for a query is just about the best way to
speed a query up. This is true. It is not always possible or desirable to cover every query.
Covering every query may lead to unacceptably large indexes or unacceptably large numbers of
indexes with all the attendant problems (as mentioned in part 1)

It is possible (and often desirable) for an index to be covering for more than one query.

Filtered indexes
Filtered indexes are a new feature in SQL 2008 and they allow for an index to contain only some
of the rows in the table. Prior to this, an index would always have, at the leaf level, the same
number of rows as the table. With filtered indexes, an index can be based on just a subset of the
rows.

When creating a filtered index, a predicate is specified as part of the index creation statement.
There are several limitations on this predicate. The comparison cannot reference a computed
column, a column that is declared as a user-defined type, a spatial column or a hierarchyid
column.

A clustered index cannot be filtered as it is the actual table.

How a nonclustered index is used


Seek

For SQL to do a seek on a nonclustered index, the query must have a SARGable1 predicate
referencing the index key or a left-based subset of the index key. In addition to this, the index
must be either covering or return sufficiently small number of rows that the required lookups to
the clustered index/heap (to retrieve the remainder of the columns) is not considered too
expensive by the optimiser. If the required lookups are considered too expensive then the index
will not be used.

It usually works out that the number of rows where the optimiser decides that key/RID lookups
are too expensive is somewhere around 0.5%-1% of the total rows in the table.

(1) - SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a
predicate that is of a form that SQL can use for an index seek. For more details see:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

Scan

An index scan is a read of all of the leaf pages in the nonclustered index. A scan of a
nonclustered index is generally more efficient than a scan of the clustered index, because
nonclustered indexes are generally smaller.
A nonclustered index scan usually indicates that the index contains all the columns required by
the query but they are in the wrong order, the query predicates are not SARGable or there are no
query predicates.

Update

When a change is made to a column that is either a key column or an include column of a
nonclustered index, that index will be modified as part of the insert statement. Indexes are never
left out of sync with the underlying table data.

Inserts and deletes will affect all nonclustered indexes on a table.

Considerations for selecting nonclustered indexes


The decision as to what columns should be indexed should be based on the queries that are run
against the table. There’s no point in indexing a column that is never used in a query.

Selectivity

In general, a nonclustered index should be selective. That is, the values in the column should be
fairly unique and queries that filter on it should return small portions of the table.

The reason for this is that key/RID lookups are expensive operations and if a nonclustered index
is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of
the lookups aren’t deemed to be too high.

If SQL considers the index (or the subset of the index keys that the query would be seeking on)
insufficiently selective then it is very likely that the index will be ignored and the query executed
as a clustered index (table) scan.

It is important to note that this does not just apply to the leading column. There are scenarios
where a very unselective column can be used as the leading column, with the other columns in
the index making it selective enough to be used.

Single column vs. Multi-column indexes

In general, wider nonclustered indexes are more useful than single column nonclustered indexes.
This is because it is very unusual for SQL to use multiple nonclustered indexes on the same table
to evaluate a query. An index defined over more than one column is referred to as a composite
index.

Consider a hypothetical table (Table1) that has three indexes on it, one on Col1, one on Col2 and
one on Col3. If a query with predicates on all three of those columns is executed against that
table, it is exceedingly unlikely for SQL to seek on all three indexes and intersect the results. It is
far more likely that one of the indexes will be used; the predicate against the column within the
selected index evaluated then the rest of the columns retrieved (from the clustered index) and
then the remaining two predicates evaluated.

Were as, if there was a single index defined with all three columns, the matching rows could be
located as a single seek operation.

Figure 2: Three separate indexes

Figure 3: One composite index

Column order
The order of the columns in an index key should be chosen based on several factors: the
selectivity of the columns, what % of queries will filter on that column and whether the queries
will filter with equality or inequality matches.

If the query predicate is based on multiple columns and there is an index where the combination
of columns forms a left-based subset of the index key, then retrieving the rows is a single seek
operation. If the query predicate is based on multiple columns and there is an index that contains
those columns as part of the index key, but they do not all form a left-based subset of the index
key, then retrieving those rows is a 2-step process; seek on the columns that do form a left-based
subset and then filter out rows that don’t match the other conditions.

Let’s look at a quick example.

Imagine we have a hypothetical table named Table1 and that there is an index on that table with
the index key consisting of three columns – Col1, Col2 and Col3.

CREATE INDEX idx_Table1_Demo


ON Table1 (Col1, Col2, Col3)

Let’s say further that there are two queries against that table.

SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col2 = @Var2

This query can be evaluated by a single seek operation against the two columns.
SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col3 = @Var3

This query cannot be evaluated by a single seek operation against the two columns. Rather it has
to be done as a seek just on one column, Col1, and the rows that are returned from that seek get
checked to see if they match the second predicate or not.

The other thing that has to be considered is the type of predicates that the queries will be using.

It is often said that the most selective column should be made the leading column of an index.
This is true, but must be considered in light of the other factors. There is little point in making a
column the leading column of an index if only 5% of the queries that use that index filter on that
column. It would mean that only those queries could seek on the index and the other 95% would
either scan or use a different index (if one existed)

In conclusion
This concludes the introductory coverage of indexes. I hope that this has been of use to people.
For further and deeper information there are a few posts on my blog, also see the Inside SQL
Server 2005 books, specifically the 1st book (The Storage Engine) and the 4th book (Query tuning
and optimisation) and SQL Server 2008 Query Performing Distilled.

You might also like