Meltdown came out and it’s a vulnerability whose fix may have a performance impact. Microsoft has stipulated that the impact will be more severe if you:
a) Are an older OS
b) Are using an older processor
c) If your application utilizes lots of context switches
Unfortunately, the environment we are operating hits all of these nails on the head. We are using, I believe, the oldest OS that Microsoft is patching this for. We are using older processors from 2011-2013 which do not have the PCID optimization (as reported by the SpeculationControl test script) which means performance is impacted even more.
I’m in a large environment where we have the ability to shuffle VM’s around hosts and put VM’s on specific hosts. This will allow us to measure the impact of Meltdown in its entirety. Our clusters are dedicated to Citrix XenApp 6.5 servers.
Looking at our cluster and all of the Citrix XenApp VM’s, we have some VM’s that are ‘application siloed’ — that is they only host a single application and some VM’s that are ‘generic’.
In order to determine the impact I looked at our cluster, summed up the total of each type of VM and then just divided by the number of hosts. We have 3 different geographical areas that have different VM types and user loads. I am going to examine each of these workload types across the different geographical areas and see what are the outcomes.
Since Meltdown impacts applications and workloads that have lots of context switches I used perfmon each server to record the context switches of each server.
The metrics I am interested in are the context switch values as they’ve been identified as the element that highlight the impact. My workloads look like this:
Based on this chart, our largest impact should be Location B, followed by Location A, last with Location C.
However, the processors for each location are as follows:
Location A: Intel Xeon 2680
Location B: Intel Xeon 2650 v2
Location C: Intel Xeon 2680
The processors may play a roll as newer generation processors are supposed to fair better.
In order to test Meltdown in a side-by-side comparison of systems with and without the mitigation I took two identical hosts and populated them with an identical amount and type of servers. On one host we patched all the VM’s with the mitigation and with the other host we left the VM’s without the patches.
Using the wonderful ControlUp Console, we can compare the results in their real-time dashboard. Unfortunately, the dashboard only gives us a “real time” view, ControlUp offers a product called “Insights” that can show the historical data, our organization has not subscribed to this product and so I’ve had to try and track the performance by exporting the ControlUp views on a interval and then manually sorting and presenting the data. Using the Insights view would be much, much faster.
ControlUp has 3 different views I was hoping to explore. The first view is the hosts view, this will be performance metrics pulled directly from the VMWare Host. The second view will be the computers view, and the last will be the sessions view. The computers and sessions view are metrics pulled directly from the Windows server itself. However, I am unable to accurately judge the performance of Windows Server metrics because of how it measures CPU performance.
Another wonderful thing about ControlUp is we can logically group our VM’s into folders, from there ControlUp can sum the values and present it in an easily digestible presentation. I created a logical structure like so and populated my VM’s:
And then within ControlUp we can “focus” on each “Location” folder and if we select the “Folder” view it presents the sums of the logical view.
In the hosts view, we can very quickly we can see impact, ranging from 5%-26%. However, this is a realtime snapshot, I tracked the average view and examined only the “business hours” as our load is VERY focused on the 8AM-4PM. After these hours and we see a significant drop in our load. If the servers are not being stressed the performance seems to be a lot more even or not noticeable (in a cumulative sense).
Some interesting results. We are consistently seeing longer login times and application launch times. 2/3 of the environments have lower user counts on the unpatched servers with the Citrix load balancing making that determination. The one environment that had more users on the mitigation servers are actually our least loaded servers in terms of servers per host and users per server, so it’s possible that more users would drive into a gap, but as of now it shows that one of our environment can support an equal number of users.
Examining this view and the historical data presented I encountered oddities — namely the CPU utilization seemed to be fairly even more often than not, but the hosts view showed greater separation between the machines with mitigation and without. I started to explore why and believe I may have come across this issue previously.
2008R2-era servers have less accurate reporting of CPU utilization.
I believe this is also the API that ControlUp is using with these servers to report on usage. When I was examining a single server with process explorer I noticed a *minimum* CPU utilization of 6%, but task manager and ControlUp would report 0% utilization at various points. The issue is an accuracy, adding and rounding issue. The more users on a server with more processes and those processes consuming ever so slightly CPU, the more the inaccuracy. Example:
We have servers with hundreds of users utilizing a workflow like this where they are using just a fraction of a percent of CPU resources. Taskmanager and the like will not catch these values and round *down*. If you have 100 users using a process that consumes 0.4% CPU then our inaccuracy is in the 40% scale! So relying on the VM metrics of ControlUp or Windows itself is not helpful. Unfortunately, this destroys my ability to capture information within the VM, requiring us to solely rely on the information within VMWare. To be clear, I do NOT believe Windows 2012 R2 and greater OS’s have this discrepancy (although I have not tested) so this issue manifests itself pretty viciously in the XenApp 6.5 -era servers. Essentially, if Meltdown is increasing CPU times on processes by a fraction of a percent then Windows will report as if everything is ok and you will probably not actually notice or think there is an issue!
In order to try and determine if this impact is detectable I have two servers with the same base image, with one having the mitigation installed and other did not. I used process explorer and “saved” the process list over the course of a few hours. I ensured the servers had a similar amount of users using a specific application that only presented a specific workload so everything was as similar as possible. In addition, I looked at processes that aren’t configurable (or since the servers have the same base image they are configured identically). Here were my results:
Just eye balling it, it appears that the mitigation has had an impact on the fractional level. When taking the average of the Winlogon.exe and iexplore.exe processes into account:
These numbers may seem small, but once you start considering the number of users the amount wasted grows dramatically. For 100 users, winlogon.exe goes from consuming a total of 1.6% to 7.1% of the CPU resulting in an additional load of 5.5%. The iexplore.exe is even more egregious as it spawns 2 processes per user, and these averages are per process. For 100 users, 200 iexplore.exe processes will be spawned. The iexplore.exe CPU utilization goes from 15.6% to 38.8%, for an additional load of 23.2%. Adding the mitigation patch can impact our load pretty dramatically, even though it may be under reported thus impacting users on a far greater scale by adding more users to servers that don’t have the resources that Windows is reporting it has. For an application like IE, this may just mean greater slowness — depending on the workload/workflow — but if you have an application more sensitive to these performance scenarios your users may experience slowness even though the servers themselves look OK from most (all?) reporting tools.
Continuing with the HOSTS view, I exported all data ControlUp collects on a minute interval and then added the data to Excel and created my pivot tables with the hosts that are hosting servers with the mitigation patches and the ones without. This is what I saw for Saturday-Sunday, these days are lightly loaded.
This is Location B, the host with VM’s that are unpatched is in orange and the host with patched VM’s is in blue. the numbers are pretty identical when CPU utilization on the host is around or below 10%, but once it starts to get loaded the separation becomes apparent.
Since these datapoints were every minute, I used a moving average of 20 data points (3 per hour) to present the data in a cleaner way:
Looking at the data for this Monday morning, we see the following:
Some interesting events, at 2:00AM the VM’s reboot. We reboot odd and even servers each day, and in my organization of testing this, I put all the odd VM’s on the blue host, and the even VM’s on the orange host. So the blue line going up at 2:00AM is the odd (patched) VM’s rebooting. The reboot cycle is staggered to take place over a 90 minute interval (last VM’s should reboot around 3:30AM). After the reboot, the servers come up and do some “pre-user” startup work like loading AppV packages, App-V registry prestaging, etc. I track the App-V registry pre-staging duration during bootup and here are my results:
Registry Pre-staging in AppV is a light-Read heavy-Write exercise. Registry reading and writing are slow in 2008R2 and our time to execute this task went from 610 seconds to 693 seconds for an overall duration increase of 14%.
Looking at Location A and C
Location C (under construction)
We can see in Location A the CPU load is pretty similar until the 20% mark, then separation starts to ramp up fairly drastically. For Location C, unfortunately, we are undergoing maintenance on the ‘patched’ VM’s, so I’m showing this data for transparency but it’s only relevant up to the 14th. I’ll update this in the next few days when the ‘patched’ VM’s come back online.
Now, I’m going to look at how “Windows” is reporting CPU performance vs the Hosts CPU utilization.
The information this is conveying is to NOT TRUST the Windows CPU utilization meter (at least 2008 R2). The CPU Utilization on the VM-level does not appear to reflect the load on the hosts. While the VM’s with the patch and without the patch both report nearly identical levels of CPU utilization, on the host level the spread is much more dramatic.
Lastly, I am able to pull some other metrics that ControlUp tracks. Namely, Logon Duration and Application Launch duration. For each of the locations I got a report of the difference between the two environments
Location A: Average Application Load Time
Location B: Average Application Load Time
Location A: Logon Duration
Location B: Logon Duration
In each of the metrics recorded we experience a worsening experience for our user base, from the application taking longer to launch, to logon times increasing.
What does this all mean?
In the end, Meltdown has a significant impact on our Citrix XenApp 6.5 environment. The perfect storm of older CPU’s, an older OS and applications that have workflows that are impacted by the patch means our environment is grossly impacted. Location A has a maximum hit (as of today) of 21%. Location B having a spread of 12%. I had originally predicted that Location B would have the largest impact, however the newer V2 processors may be playing a roll and the performance of the V2 processors maybe more efficient than the older 2680.
In the end, the performance hit is not insignificant and reduces our capacity significantly once these patches are deployed. I plan on adding new articles once I have more data on Meltdown and then further again once we start adding the mitigation’s against Spectre.