I had certainly hoped to post on a more frequent basis... blogging goal #1 for me: frequency of at least 2x per month... doable I think. Call me on it! :)
In my first ever blog post, I mentioned that I would love to teach a class in sql server. Well, to be honest, I have already been tasked with helping the other members of my team at work with performance tuning (based in part on the fab class that I was able to attend IEPTO1... sqlskills.com... check it out). So in a way I have started - although my goal is to branch out of my own company in terms of teaching, this is a great place to get my feet wet. I thought I would start my SQL blog series based, in part, on the training sessions that I've been working through with my team.
I am part of a team of developers - we support a couple analytical data stores, I've been tasked with learning as much as I can about performance tuning in a data warehouse environment. (So much more to learn!). One challenge is that much of what we read and see out there specifically related to performance tuning is directed at transactional systems - small quick batches of inserts updates or deletes. However, our issues were more surrounding loading huge quantities of data from one place to another. Then placing indexes such that we do not bring down the load process, yet make the read process quick and easy and painless. Does that sound familiar to anyone? Please please, share your thoughts here with me!! Your best tips and tricks! In the meantime, I will do my best to share what I am learning for anyone interested.
I started my team training sessions with basic understanding of index concepts. It seemed that everyone on the team was able to take something away from this discussion, so I thought I'd share!
Heap:
The definition of a heap is a table without a clustered index. So what does that mean?
I imagined it to be like my "to do" pile in my office at home (ok, at work too ;). Items often get put on the pile (the heap) in the order in which they are received. It's pretty easy to toss another item on the pile - not much work involved there. Unfortunately, I do need to go and find things in my heap every now and then and it does take a while, because I usually have to scan the whole thing in order to find the one I need.
Similar to a heap in SQL Server - it's pretty easy to toss an item (or 10 or 10,000 items) onto the heap, but (assuming no other filing system or index) a bit harder to find when I need to go back and retrieve something. There are other "gotchas" in a sql server heap as well that make reads take longer. I will get into those in a future post...(I need to gather lots of ideas;feel free to post a comment/suggestion to help me out with my ideas list!)
Clustered Index:
OK - I will admit that as I was learning the terminology around indexing in SQL Server, I struggled with clustered vs. non-clustered. To me - "cluster" infers a group of the same thing (take peanut clusters for example... little groups of peanuts surrounded by delicious chocolate), so a clustered index should infer something like that - groupings of my data based on the column(s) I define. Well, not exactly. So if you are like me - put that idea out of your head and carry on. If you are not like me, good, carry on.
Physical order to data
When you define a clustered index you are defining a specific physical order to your data based on your key columns. I like to think of a clustering key as the page numbers (and maybe chapter numbers) in a book. It allows you to quickly get to the page you were looking for. Imagine trying to find a specific chapter or page if it all was just mixed together and looked just like the rest of the words (data) in the book. You'd have to SCAN the whole book to find what you wanted - every time. If your key columns would not be unique, then SQL Server will "uniquify" your clustering key (this adds an additional 4 bytes per row to your clustering key and so in general, it's good to make your clustering keys unique where possible). This "uniquifier" allows SQL server to lookup that specific record when needed.
So, the clustering
key is the page number (or chapter, or chapter/page combination) and this is the data that will make its way up the upper
B-tree structure in your index - the wider the key, the bigger the tree. The bigger the tree the longer it takes to find that leaf you are looking for. The "leaf" level of a clustered index IS your data in order of clustering key columns - all of the data is there, not just the key columns.
Non-Clustered Index - Just a pile of peanuts?:
OK - sorry - I should not have made refrence to my terminology reference from above!
Certainly you have a book at home (or work) that includes a table of contents, or an index in the back of the book. This is the best thing to compare to a non-clustered index.
- It includes "key" references: chapter, key
word (i.e. types of animals, types of food)
- Non-clustered index key column(s).
- In the table of contents or index, it is in order by
key references (chapter 1, 2, 3; aardvark, aligator...zebra, etc.)
- So is a non-clustered index in order by the key
columns.
- This can be a different order than the clustered index
(but note that the non-clustered index does not include ALL data from
your table).
- It includes a reference back to the page number so that
it can easily look up what might be there.
- A non-clustered index will always include a reference back
to the clustering key
- It takes up space in the book.
- Yes, non-clustered indexes take up space!
Yes, a non-clustered index CAN be placed on a heap, and that table will still be a heap.
OK - SUMMARY:
Heap: good for inserting data, querying can take longer
Clustered index: physical order to data based on clustered index key
Non-clustered index: quick reference to help SQL Server find data in the heap or clustered index
This was relatively basic stuff. Overall, I try to keep explanaions less than super-technical in order that people can visualize and hopefully understand without the technical getting in the way.
I welcome your feedback!!!
'til next time!
Sarah :)