I want to start this off by thanking ControlUp, LoginVSI and Ryan Ververs-Bijkerk for his assistance in helping me with this post.
Building on my last evaluation of the performance impact of Meltdown and Spectre, I was graciously given a trial of the LoginVSI software product used to simulate user loads and ControlUp’s cloud based analytic tool, ControlUp Insights.
This analysis takes into consideration the differences of operating systems and uses the latest Dell server hardware platform with a Intel Gold 6150 as the processor at its heart. This processor contains the full PCID instructions to give the best performance (as of today) with mitigations enabled. However, only Server 2012R2 and Server 2016 can take advantage of these hardware features.
This test was setup on 4 hosts. 2 of the hosts had VM’s where all the mitigations were enabled and 2 hosts had all of the mitigation features disabled. I tested live production workloads and simulated user loads from LoginVSI. The live production workloads were run on XenApp 6.5 on 2008R2 and the simulated workloads were on XenApp 7.15CU2 with 2008R2, 2012R2 and 2016.
Odd host numbers had the mitigation disabled, even host number had the mitigation enabled.
I sorted my testing logically in ControlUp by folder.
Real World Production results
The ControlUp Insights cloud product produced graphs and results that were easy and quick to interpret. These results are for XenApp 6.5, Server 2008R2.
The mitigation-disabled “Host 1” had a higher consistency of using less CPU than the mitigation-enabled “Host 2”. The biggest spread in CPU was ~20% on the Intel Gold 6150’s with mitigation enabled to disabled.
Another interesting result was IO utilization increased by an average of 100 IOPS for mitigation-enabled VM’s. This means that Meltdown/Spectre also tax the storage subsystem more. This averaged out to a consistent 12% hit in performance.
The logon duration of the VM’s increased 43% from an average of 8 seconds on a mitigation-disabled VM to 14 seconds on a mitigation enabled VM. The biggest jump in the sub-metrics were Logon Duration (Other) and Group Policy time going from 3-5s and 6-8s.
For applications we have that measure “user interactivity” the reduction in the user experience was 18%. This meant that an action on a mitigation-enabled VM took an average of 1180ms vs 990ms on a mitigation-disabled VM when measuring actions within the UI.
Honestly, I wish I had ControlUp Insights earlier when I did my original piece, it provides much more detail in terms of tracking additional metrics and presents it much more cleanly that I did. Also, when the information was available it was super quick and fast to look and compare the various types of results.
LoginVSI was gracious enough to grant me licenses to their software for this testing. Their software provides simulated user actions including pauses for coffee and chatting between work like typing/sending emails, reading word or PDF documents, or reviewing presentations. The suite of software tested by the users tends to be major applications produced by major corporations who have experience producing software. It is not representative of applications that could be impacted the most by Spectre/Meltdown (generally, applications that are registry event heavy). Regardless, it is interesting to test with these simulated users as the workload produced by them do fall under the spectrum of “real world”. As with everything, your mileage will vary and it is important to test and record your before and after impacts. ControlUp with Insights does an incredible job of this and you can easily compare different periods in time to measure the impacts, or just rely on the machine learning suggestions of their virtual experts to properly size your environment.
Since our production workloads are Windows Server 2008R2 based, I took advantage of the LoginVSI license to test all three available server operating systems: 2008R2, 2012R2, and 2016. Since newer operating systems are supposed to enable performance enhancing hardware features that can reduce the impact of these vulnerabilities, I was curious to *how much*. I have this information now.
I tested user loads of 500, 300, and 100 users across 2 hosts. I tested one with Spectre/Meltdown applied and one without. Each host ran 15VM’s with each VM having 30GB RAM and 6vCPU for an CPU oversubscription of 2.5:1. The host spec was a Dell PowerEdge M640 with Intel 6150 Gold processors and 512GB of memory.
2016 – Hosts View
500 LoginVSI users with this workload, on Server 2016 pegged the hosts CPU to 100% on both the Meltdown/Spectre enabled and disabled hosts. We can still see the gap between the CPU utilization between the two with the Meltdown Spectre hosts
300 LoginVSI users with this workload, on Server 2016 we see the gap is narrow but still visible.
100 LoginVSI users with this workload, on Server 2016 we see the gap is barely visible, it looks even.
2012R2 – Hosts View
500 LoginVSI users in this workload on Server 2012 R2. There definitely appears to be a much larger gap between the meltdown enabled and disabled hosts. And Server 2012R2 non-mitigated doesn’t cap out like Server 2016.
300 LoginVSI users in this workload on Server 2012R2. The separation between enabled and disabled is still very prominent.
100 LoginVSI users in this workload on Server 2012R2. Again, the separation is noticeable but appears narrower with lighter loads.
2008R2 – Hosts View
500 LoginVSI users in this workload on Server 2008R2. Noticeable additional CPU load with the Meltdown/Spectre host. A more interesting thing is it apperas overall CPU utilization is lower than 2012R2 or 2016.
300 LoginVSI users in this workload on Server 2008R2. The separation between enabled and disabled is still very prominent.
100 LoginVSI users in this workload on Server 2008R2. I only captured one run and the low utilization makes the difference barely noticeable.
Some interesting results for sure. I took the data and put it into a pivot table to highlight the CPU differences for each workload against each operating system.
This chart hightlights the difference in CPU percentage between mitigation enabled and disabled systems. The raw data:
Again, interesting results. 2008R2 seems to have the largest average CPU seperation, hitting 14%, followed by 2012R2 at 11% and than 2016 having a difference of 4%.
One of things about these results is that they highlight the “headroom” of the operating systems. 2008R2 actually consumes less CPU and so it has more room for separation between the 3 tiers. On the 2016, there is so much time spent where the CPU was pegged at 100% for both types of host that makes a difference of “0%”. So although the smaller number on server 2016 may lead you to believe it’s better, it’s actually not.
This shows it a little more clear. With mitigations *enabled*, Server 2008R2 can do 500 users at less average CPU load than 2016 can do 300 users with mitigations *disabled*.
From the get-go, Server 2016 appears to consume 2x more CPU than Server 2008R2 in all non-capped scenarios with Server 2012 somewhere in between.
When we compare the operating systems against the different user counts we see the impact the operating system choice has on resources.
Microsoft stated that they expected to see less of an impact of the Spectre/Meltdown mitigations with newer operating systems. Indeed this does turn out to be the case. However, the additional resource cost of newer operating systems is actually *more* than running 2008R2 or 2012R2 with mitigations enabled . So if you’re environment is sized for running Server 2016, you probably have infrastructure that has already been spec’ed for the much heavier OS anyways. If your infrastructure has been spec’ed for the older OS than you will see a larger impact. However, if you’ve spec’ed for the larger OS (say for a migration activity) but are running your older OS’s on that hardware, you will see an impact but it will be less than when you go live with 2016.
Previously I had stated that there are two different, important performance mechanisms to consider; capacity and how fast work actually gets done. All of these simulated measurements are about capacity. I hope to see how speed is impacted between the OS’s, but that may have to wait for a future posting.
Tabulating the LoginVSI simulated results without ControlUp Insights has taken me weeks to properly format the results. I was able to use a trial of ControlUp Insights to look at the real world impact of our existing applications and workloads. If my organization ever purchases Insights I would have had this post up a long time ago with even more data, looking at things like the storage subsystems. Hopefully we acquire this product in the future and if you want to save yourself and your organization time, energy and effort getting precise, accurate data that can be compared against scenarios you create: get ControlUp Insights.
I’m going to harp on this, but YOUR WORKLOAD MATTERS MORE than these simulated results. During this exercise, I was able to determine with ControlUp Insights that one of our applications is so light that we can host 1,000 users from a single host where that host struggled with 200 LoginVSI users. So WORKLOAD MATTERS. Just something to keep in mind when reviewing these results. LoginVSI produces results that can serve as a proxy for what you can expect if you can properly relate these results to your existing workload. LoginVSI also offers the capability to produce custom workloads tailored to your specific environment or applications so you can gauge impact with much more precision.