Database Administrators or Developers … Who is Responsible for Indexing?

To answer that question, I am going with “it depends” and “both”!

The reason I answer that way, is that anyone designing, building, querying, or maintaining database objects needs to have a grasp of indexing and how it will affect your work.  If you don’t have this understanding, whatever task you are doing will not be accomplished in the best way.  For example, can you really design and build tables if you don’t know what clustered and non-clustered indexes are?  Or will you be able to write fast, efficient, and scalable queries without understanding how a covering index can help?  Or if you don’t fully grasp the concept of indexing, will you really understand how crucial it is to maintain them?

I would liken the need for a database professional to understand indexing similar to a principle I learned when playing little league baseball.  I was always told to “keep your eye on the ball”.  This principle was just as applicable when I was hitting or catching.  Those who excel at baseball understand and apply this principle because it is considered one of the fundamentals of baseball.  In a similar sense, those who want to excel at being a database professional need to understand indexing, as it should be considered one of the fundamentals of databases.  Now, you could work on databases without knowledge of indexing, just like you could play baseball without keeping your eye on the ball, but you are not going to be as good as you could be or reach your full potential.

In this post, I am NOT going to cover indexing basics (ie. types, creation, etc.), index maintenance, or even how to find missing and unused indexes.  Many of those things I covered in a previous post (https://sqlbadboy.wordpress.com/2012/12/13/find-data-quickly-by-using-indexes/).  The real focus of this post will be to show you how to determine if the index choices you are making are working (i.e. measuring the benefits).

Now, let’s make an assumption regarding indexes.  Generally, there are certain indexes that you will almost always want to have:

  • Clustered index on each table
  • Non-clustered indexes for frequently used queries

So, how do we determine the impact (or measure the benefits) of adding these indexes?

Determining the impact of a clustered index

There’s a good white paper that compares tables that are organized with a clustered index versus tables that are a heap (i.e. table that lacks a clustered index) that is definitely worth reading.  (Comparing Tables Organized with Clustered Indexes versus Heaps: http://technet.microsoft.com/library/Cc917672).  It summarizes the advantages and disadvantages and some differences in performance characteristics between the two.  It also provides best practice recommendations on the merits of the two types of table organization, along with examples of when you might want to use one or the other.

However, I have found the best way to truly learn something is to try it out and prove it to yourself.  I am going to use the AdventureWorks2012 sample database for the example (Adventure Works for SQL Server 2012: http://msftdbprodsamples.codeplex.com/releases/view/55330).

1. Create a copy of the Person.Person table:

SELECT *
INTO Person.Person_NoIndexes
FROM Person.Person

2. Use the SET STATISTICS IO ON (http://msdn.microsoft.com/en-us/library/ms184361.aspx) to measure what is “happening under the covers”.  We are not going to use the timings that come back in management studio to determine anything because they can be misleading at times.

SET STATISTICS IO ON

3. Imagine that the BusinessEntityID is like an employee number (Disclaimer: I know there is a HumanResources.Employee table that has a NationalIDNumber field, but I am making that assumption for simplicity sake.).  Let’s run the following queries to show the difference between a table organized as a heap and one with a clustered index:

SELECT *
FROM Person.Person_NoIndexes
WHERE BusinessEntityID = 5

SELECT *
FROM Person.Person
WHERE BusinessEntityID = 5

The results will look like the following and we want to focus in on logical reads, which is the number of pages read from the data cache.

      • Table ‘Person_NoIndexes’. Scan count 1, logical reads 3807, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
      • Table ‘Person’. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

In the first query with no clustered index, the database engine does a table scan, which essentially looks at all the data pages for that table.  In the second one, it only needs to read the page where the data is found, and the index pages to help it find the data.  Clearly, if we are querying using the clustered index, we are going to read less data pages and this will make the query more efficient and scalable.

4. Not let’s prove the scalability.  Here’s what we will do: query the table, double the table size, query it again, create a clustered index and query it again:

-- query the table
SELECT *
FROM Person.Person_NoIndexes
WHERE BusinessEntityID = 5

-- double the size
INSERT INTO Person.Person_NoIndexes(BusinessEntityID, PersonType, NameStyle ,Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, AdditionalContactInfo, Demographics, rowguid, ModifiedDate)
SELECT BusinessEntityID = BusinessEntityID + 30000, PersonType, NameStyle ,Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, AdditionalContactInfo, Demographics, rowguid, ModifiedDate
FROM Person.Person_NoIndexes

-- query the table
SELECT *
FROM Person.Person_NoIndexes
WHERE BusinessEntityID = 5

-- create a clustered index
ALTER TABLE Person.Person_NoIndexes ADD  CONSTRAINT PK_Person_NoIndexes_BusinessEntityID PRIMARY KEY CLUSTERED (BusinessEntityID ASC) ON [PRIMARY]

-- query the table
SELECT *
FROM Person.Person_NoIndexes
WHERE BusinessEntityID = 5

Here are the results for the “query the table” steps:

    • Table ‘Person_NoIndexes’. Scan count 1, logical reads 3807, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    • Table ‘Person_NoIndexes’. Scan count 1, logical reads 7612, physical reads 0, read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    • Table ‘Person_NoIndexes’. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see, without a clustered index, your query performance will gradually degrade as the table gets larger and larger, since it will have to keep reading in more and more pages.  However, when you are querying with the clustered index, even if the data gets larger, you are still reading a limited number of pages.

So, it should be obvious as to why you would always want to have a clustered index.

Determining the impact of a non-clustered index

Let’s query our copied table and notice the results:

SELECT BusinessEntityID, FirstName, MiddleName, LastName
FROM Person.Person_NoIndexes
WHERE LastName = 'Smith'

Here are the results from this query:

  • Table ‘Person_NoIndexes’. Scan count 5, logical reads 8361, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now, if you enable the execution plans (Displaying Graphical Execution Plans: http://msdn.microsoft.com/en-us/library/ms178071(v=sql.105).aspx), it shows something quite interesting:

Pict

Although the query is using the clustered index, it recommends that we add a non-clustered index for better performance.  This information is also available through the DMVs, which I wrote about in a previous post (https://sqlbadboy.wordpress.com/2012/12/13/find-data-quickly-by-using-indexes/).  It even gives us the “impact”, which is defined as the “average percentage benefit that user queries could experience if this missing index group was implemented. The value means that the query cost would on average drop by this percentage if this missing index group was implemented.”

Here’s an example that will show the benefit gained from adding a non-clustered index:

-- query the table
SELECT BusinessEntityID, FirstName, MiddleName, LastName
FROM Person.Person_NoIndexes
WHERE LastName = 'Smith'

-- create a clustered index
CREATE NONCLUSTERED INDEX IX_Person_NoIndexes_LastName ON Person.Person_NoIndexes(LastName ASC) ON [PRIMARY]

-- query the table
SELECT BusinessEntityID, FirstName, MiddleName, LastName
FROM Person.Person_NoIndexes
WHERE LastName = 'Smith'

Here are the results for the “query the table” steps:

  • Table ‘Person_NoIndexes’. Scan count 5, logical reads 8361, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • Table ‘Person_NoIndexes’. Scan count 1, logical reads 643, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

It is quite evident that this non-clustered index is speeding up this query by reading less data pages.  And as the database grows this will become more pronounced.

In this post, I tried to simplify the explanation of how indexes help your queries.  And keep in mind that the more complex your queries, the more complicated your index strategies become.  Database indexing can be a deep and complicated subject, but that should not cause anyone to shy away from it.

So, if you want to excel as a database professional, whether you are an administrator or developer, make sure you view indexing as a fundamental part of you work.

Advertisements
This entry was posted in DBA, Developer, TSQL and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s