2010-01-09

vCenter database health

We have recently encountered a number of problems with vCenter databases in our production and non-production (dev/test) environments. Some problems and fixes are common to both and one issue has a different fix for version 2.5 vs. 4.0. The problems are related to the size & growth of the database and the size & growth of the database logs.

The production vCenter is version 2.5 U5 and the two non-production vCenter is version 4.0.0 U1. In all cases we are using Microsoft SQL Server 2005 Enterprise Edition x64. The prodution databases are on clustered bare-metal Windows servers and the non-production databases are on non-MSCS virtual machines. In all cases we are using named instances of SQL Server (http://msdn.microsoft.com/en-us/library/ms143531(SQL.90).aspx), rather than depending on the default instance. All instances are patched with SP2.

First the database growth problem. Based on the internal database size calculator in vCenter 2.5 & 4.0 (Administration menu > vCenter Settings ... > Statistics) the the factors that determine size are: number of virtual machines, detail, sample rate, retention and number of hosts. Of these the number of virtual machines, detail and sample rate seem to have the biggest impact. Additionally there is the record of all events & tasks which impacted more by changes to the environment (creating, changing, deleting virtual machines and hosts, etc) which the calculator obviously can not predict. If you have a relatively static environment then the growth of the events & tasks talbles will be smaller than for an environment with constant changes.

But it seems that there is more to the size of the database than just the number of records stored in each of its tables. Our default configuration for databases is to not configure a hard size limit, but let it grow indefinately. More on that later. Obviously the more records that are stored then the bigger the size of the database and more of the file system is consumed. What was not obvious to me was what happens when record(s) are deleted: the database does not automatically shrink.

If you look at the calculator window you will see that statistical data passes through four intervals. The level of detail in the statistics data can be selected so long as the following intervals have the same or lower level of detail. There is also the real-time statistical data collected and displayed in the Performance tab for each host and virtual machine. Using deductive reasoning, some meandering through the database  and some information from VMware support the life cycle of the statistical data is like this:

1) Statistical data is gathered by the host for both ESX/ESXi and all virtual machines running on that host. This is what you see in the Performance tab for each host or virtual machine. These data points are 20 seconds apart based on the resolution of the charts in vCenter and vSphere Client. I will call this interval 0. I'm not sure where this is stored.

2) The interval 0 data points are summarized into data at a definable number of minutes (1, 2, 3 or 5) and definable level of detail and definable retention period (1 to 5 days). I will call this interval 1. This appears to be stored in table VPX_HIST_STAT1.

3) The interval 1 data points (1, 2, 3 or 5 minutes) are summarized into data points that are 30 minutes apart with a definable level of detail. I will call this interval 2. This appears to be stored in table VPX_HIST_STAT2.

4) The interval 2 data points (30 minutes) are summarized into data points that are 2 hours apart with a definable level of detail. I will call this interval 3. This appears to be stored in table VPX_HIST_STAT3.

5) The interval 3 data points (2 hours) are summarized into data points that are 24 hours apart with a definable level of detail and definable retention (1 to 5 years). This appears to be stored in table VPX_HIST_STAT4.

This begs the question: what is doing all of the data summarization at each interval? It is not vCenter, its the database. The rest of this database explanation will be focused on SQL Server 2005 as we do not use Oracle for vCenter. All of our databases have "SQL Server Management Studio" installed, or "SQL Server Management Studio Express". In SSMS, navigate to the "SQL Server Agent" and then "Jobs" there you *should* see a number of jobs: "Past Day stats rollup", "Past Month stats rollup" and "Past Week stats rollup" for vCenter 2.5 and 4.0, and "Event Task Cleanup" for vCenter 4.0. BTW, if these jobs do no exist or are unable to run then you have a problem and should look at http://kb.vmware.com/kb/1004382 on how to re-create these jobs for vCenter 2.5 and 4.0, or http://kb.vmware.com/kb/1000125 to purge old data, or http://kb.vmware.com/kb/1007453 to completely delete old data.

These jobs will read data from one interval table, summarize the data, write it to the next interval table and delete from the original interval table. This constant creating and deleting of records fragments the database occupies more space in the file system than there is data. But that's not all, in addition to the statistical data there is historical data in the form of events and tasks which are stored in tables VPX_EVENT and VPX_EVENT_ARG. In the previous paragraph I mentioned that vCenter 4.0 has the additional job "Event Task Cleanup". These jobs are configurable in the menu Administration > vCenter Settings ... > Database Retention Policy. In vCenter 2.5 and previous, the VPX_EVENT and VPX_EVENT_ARG tables can grow indefinately.

What I learned from all of this is that the database size in the file system can be no where near what the vCenter calculator would tell you. Consequently you can encounter problems with the database filling up its file system (drive letter) or it becomes sluggish.

So, how to fix this? Here is a list of KB articles that VMware support sent me and that I forwarded to my DBA who found them very helpful.

How to increase the performance of the VirtualCenter Database
http://kb.vmware.com/kb/1002825

Investigating the health of a VirtualCenter database server
http://kb.vmware.com/kb/1003979

Troubleshooting transaction logs on a Microsoft SQL database server
http://kb.vmware.com/kb/1003980

Defragmenting VirtualCenter performance data indexes on a Microsoft SQL database
http://kb.vmware.com/kb/1003990

Reducing the size of the VirtualCenter database when the rollup scripts take a long time to run
http://kb.vmware.com/kb/1007453

Purging old data from the database used by vCenter Server
http://kb.vmware.com/kb/1000125
 
My next post will talk about the size and growth of the database logs. Once we have modified our SQL Server maintenance plans for the production and non-production vCenter databases I will post details.