VMware NTP timekeeping considerations in depth – Intro

Accurate timekeeping is important in almost every environment.  If time is not synced across your environment, authentication errors can occur, services and applications may not function properly, event logs and alerts can be off, which can inhibit troubleshooting.  You’re probably aware already that this is a big deal.  Beyond just referencing KB articles, I want to spend time to discuss NTP timekeeping in general, as well as practical methods and strategies that work, and in my experience what doesn’t work.

This will be a series of posts to try to address all major considerations with timekeeping via NTP, beginning with timekeeping within virtual machines.

NTP – Accuracy vs. Internal Synchronization

Obviously, you need your internal to have accurate time and synced with authentication sources.  Protocols like Kerberos for good reason don’t allow for much clock skewing in order to protect against authentication replay attacks. For example, Active Directory’s default tolerance for clock skewing is five minutes.

But sometimes both of those goals conflict each other.  In these cases, which is more important?  For probably almost all environments that the priority should be that clocks are synced over how accurate the clocks actually are.

Why?  Simple – application and service availability.  Chances are, if clocks are skewed too much within your environment, services and applications will become inaccessible to some or all users.  Generally, razor sharp clock accuracy in the real world if lacking can often be an annoyance, not a downtime event.  Obviously, that may not be the case for everyone, such as real time stock trading companies, but that’s generally for the most part true.  When making choices about how to configure things for time usually through NTP, if faced with a scenario where you must choose better internal synchronization instead of better accuracy to what the real time is, choose better synchronization over actual time accuracy.

When would these goals come into conflict?  As an example, VMs could be set to synchronize their clocks with their VM host via VM Tools, or they could be configured within their OS to use an external NTP server.  It’s theoretically possible that for some reason, your ESXi host’s clock might be more trustworthy than your Domain Controllers more often than not.  For most customers though, even if that were true, prioritize synchronization over clock accuracy.  Allowing VMTools to sync the clock of the VM to the host effectively means VMs running on different hosts could have different time.  Maybe the NTP service stopped on one ESXi host.  Maybe they’re not configured consistently.  It doesn’t matter why.  Prioritize synchronization instead by configuring each VM’s OS to synchronize to the same NTP servers somehow, some way.

How many NTP servers, and which ones?

When configuring anything for NTP, whether it be an ESXi server or  guests, the question always comes up – how many servers should an NTP client be set to use?

Many people know some obvious ones.  More than one, right?  Of course.  Providing more than one offers redundancy in case an NTP server fails.  However, I’ve encountered many environments where there were just two configured.  Of course three would be better just for resiliency, but configuring two NTP servers has risks beyond that.

Remember that NTP clients function by polling all their configured NTP servers, and then adopting the most consensus time values across all of them.  For example, if two NTP servers configured provide different values, the NTP client will adopt a value that’s a compromise between them.  In a scenario where NTP server 1 says the time is off by twenty minutes, but NTP server 2 is correct, the NTP client will likely to adopt a value of 10 minutes too fast, which is incorrect, and worse may cause clock skewing within the environment.  I recommend you use instead an odd number of NTP servers greater than one, and the more the merrier generally speaking.

But which ones?  Diversity that improves availability is good, but diversity that will be more likely to result in disparate values is bad.  Using NTP servers that are for example on separate compute, storage, and physical sites is good.  Mixing and matching for example internal and external NTP servers that are managed by different people on the same NTP client is generally bad, although it might be the best alternative among non-optimal choices.

In my next post in this series, I’ll go into specifics on how I generally apply these considerations to VMware environments.