Meltdown – Performance Impact Evaluation (Citrix XenApp 6.5)

/ /
in Blog

Meltdown came out and it’s a vulnerability whose fix may have a performance impact.  Microsoft has stipulated that the impact will be more severe if you:

a) Are an older OS
b) Are using an older processor
c) If your application utilizes lots of context switches

Unfortunately, the environment we are operating hits all of these nails on the head.  We are using, I believe, the oldest OS that Microsoft is patching this for.  We are using older processors from 2011-2013 which do not have the PCID optimization (as reported by the SpeculationControl test script) which means performance is impacted even more.

I’m in a large environment where we have the ability to shuffle VM’s around hosts and put VM’s on specific hosts.  This will allow us to measure the impact of Meltdown in its entirety.  Our clusters are dedicated to Citrix XenApp 6.5 servers.

Looking at our cluster and all of the Citrix XenApp VM’s, we have some VM’s that are ‘application siloed’ — that is they only host a single application and some VM’s that are ‘generic’.

In order to determine the impact I looked at our cluster, summed up the total of each type of VM and then just divided by the number of hosts.  We have 3 different geographical areas that have different VM types and user loads.  I am going to examine each of these workload types across the different geographical areas and see what are the outcomes.

Since Meltdown impacts applications and workloads that have lots of context switches I used perfmon each server to record the context switches of each server.

The metrics I am interested in are the context switch values as they’ve been identified as the element that highlight the impact.  My workloads look like this:

Based on this chart, our largest impact should be Location B, followed by Location A, last with Location C.

However, the processors for each location are as follows:

Location A: Intel Xeon 2680
Location B: Intel Xeon 2650 v2
Location C: Intel Xeon 2680

The processors may play a roll as newer generation processors are supposed to fair better.

In order to test Meltdown in a side-by-side comparison of systems with and without the mitigation I took two identical hosts and populated them with an identical amount and type of servers. On one host we patched all the VM’s with the mitigation and with the other host we left the VM’s without the patches.

Using the wonderful ControlUp Console, we can compare the results in their real-time dashboard.  Unfortunately, the dashboard only gives us a “real time” view, ControlUp offers a product called “Insights” that can show the historical data, our organization has not subscribed to this product and so I’ve had to try and track the performance by exporting the ControlUp views on a interval and then manually sorting and presenting the data.  Using the Insights view would be much, much faster.

ControlUp has 3 different views I was hoping to explore.  The first view is the hosts view, this will be performance metrics pulled directly from the VMWare Host.  The second view will be the computers view, and the last will be the sessions view.  The computers and sessions view are metrics pulled directly from the Windows server itself.  However, I am unable to accurately judge the performance of Windows Server metrics because of how it measures CPU performance.

Another wonderful thing about ControlUp is we can logically group our VM’s into folders, from there ControlUp can sum the values and present it in an easily digestible presentation.  I created a logical structure like so and populated my VM’s:


And then within ControlUp we can “focus” on each “Location” folder and if we select the “Folder” view it presents the sums of the logical view.


In the hosts view, we can very quickly we can see impact, ranging from 5%-26%.  However, this is a realtime snapshot, I tracked the average view and examined only the “business hours” as our load is VERY focused on the 8AM-4PM.  After these hours and we see a significant drop in our load.  If the servers are not being stressed the performance seems to be a lot more even or not noticeable (in a cumulative sense).


Some interesting results.  We are consistently seeing longer login times and application launch times.  2/3 of the environments have lower user counts on the unpatched servers with the Citrix load balancing making that determination.  The one environment that had more users on the mitigation servers are actually our least loaded servers in terms of servers per host and users per server, so it’s possible that more users would drive into a gap, but as of now it shows that one of our environment can support an equal number of users.

Examining this view and the historical data presented I encountered oddities — namely the CPU utilization seemed to be fairly even more often than not, but the hosts view showed greater separation between the machines with mitigation and without.  I started to explore why and believe I may have come across this issue previously.

2008R2-era servers have less accurate reporting of CPU utilization.

I believe this is also the API that ControlUp is using with these servers to report on usage.  When I was examining a single server with process explorer I noticed a *minimum* CPU utilization of 6%, but task manager and ControlUp would report 0% utilization at various points.  The issue is an accuracy, adding and rounding issue.  The more users on a server with more processes and those processes consuming ever so slightly CPU, the more the inaccuracy.  Example:

Left, Task Manager. Right, Process Explorer

We have servers with hundreds of users utilizing a workflow like this where they are using just a fraction of a percent of CPU resources.  Taskmanager and the like will not catch these values and round *down*.  If you have 100 users using a process that consumes 0.4% CPU then our inaccuracy is in the 40% scale!  So relying on the VM metrics of ControlUp or Windows itself is not helpful.  Unfortunately, this destroys my ability to capture information within the VM, requiring us to solely rely on the information within VMWare.  To be clear, I do NOT believe Windows 2012 R2 and greater OS’s have this discrepancy (although I have not tested) so this issue manifests itself pretty viciously in the XenApp 6.5 -era servers.  Essentially, if Meltdown is increasing CPU times on processes by a fraction of a percent then Windows will report as if everything is ok and you will probably not actually notice or think there is an issue!

In order to try and determine if this impact is detectable I have two servers with the same base image, with one having the mitigation installed and other did not.  I used process explorer and “saved” the process list over the course of a few hours.  I ensured the servers had a similar amount of users using a specific application that only presented a specific workload so everything was as similar as possible.  In addition, I looked at processes that aren’t configurable (or since the servers have the same base image they are configured identically).  Here were my results:

Just eye balling it, it appears that the mitigation has had an impact on the fractional level.  When taking the average of the Winlogon.exe and iexplore.exe processes into account:


These numbers may seem small, but once you start considering the number of users the amount wasted grows dramatically.  For 100 users, winlogon.exe goes from consuming a total of 1.6% to 7.1% of the CPU resulting in an additional load of 5.5%.  The iexplore.exe is even more egregious as it spawns 2 processes per user, and these averages are per process.  For 100 users, 200 iexplore.exe processes will be spawned.  The iexplore.exe CPU utilization goes from 15.6% to 38.8%, for an additional load of 23.2%.  Adding the mitigation patch can impact our load pretty dramatically, even though it may be under reported thus impacting users on a far greater scale by adding more users to servers that don’t have the resources that Windows is reporting it has For an application like IE, this may just mean greater slowness — depending on the workload/workflow — but if you have an application more sensitive to these performance scenarios your users may experience slowness even though the servers themselves look OK from most (all?) reporting tools.

Continuing with the HOSTS view, I exported all data ControlUp collects on a minute interval and then added the data to Excel and created my pivot tables with the hosts that are hosting servers with the mitigation patches and the ones without.  This is what I saw for Saturday-Sunday, these days are lightly loaded.

This is Location B, the host with VM’s that are unpatched is in orange and the host with patched VM’s is in blue.  the numbers are pretty identical when CPU utilization on the host is around or below 10%, but once it starts to get loaded the separation becomes apparent.

Since these datapoints were every minute, I used a moving average of 20 data points (3 per hour) to present the data in a cleaner way:

Looking at the data for this Monday morning, we see the following:

Location B


Some interesting events, at 2:00AM the VM’s reboot.  We reboot odd and even servers each day, and in my organization of testing this, I put all the odd VM’s on the blue host, and the even VM’s on the orange host.  So the blue line going up at 2:00AM is the odd (patched) VM’s rebooting.  The reboot cycle is staggered to take place over a 90 minute interval (last VM’s should reboot around 3:30AM).  After the reboot, the servers come up and do some “pre-user” startup work like loading AppV packages, App-V registry prestaging, etc.  I track the App-V registry pre-staging duration during bootup and here are my results:

Registry Pre-staging in AppV is a light-Read heavy-Write exercise.  Registry reading and writing are slow in 2008R2 and our time to execute this task went from 610 seconds to 693 seconds for an overall duration increase of 14%.

Looking at Location A and C

Location A

Location C (under construction)

We can see in Location A the CPU load is pretty similar until the 20% mark, then separation starts to ramp up fairly drastically.  For Location C, unfortunately, we are undergoing maintenance on the ‘patched’ VM’s, so I’m showing this data for transparency but it’s only relevant up to the 14th.  I’ll update this in the next few days when the ‘patched’ VM’s come back online.

Now, I’m going to look at how “Windows” is reporting CPU performance vs the Hosts CPU utilization.

Location A


Location B


The information this is conveying is to NOT TRUST the Windows CPU utilization meter (at least 2008 R2).  The CPU Utilization on the VM-level does not appear to reflect the load on the hosts.  While the VM’s with the patch and without the patch both report nearly identical levels of CPU utilization, on the host level the spread is much more dramatic.


Lastly, I am able to pull some other metrics that ControlUp tracks.  Namely, Logon Duration and Application Launch duration.  For each of the locations I got a report of the difference between the two environments

Location A: Average Application Load Time

Location B: Average Application Load Time


Location A: Logon Duration


Location B: Logon Duration




In each of the metrics recorded we experience a worsening experience for our user base, from the application taking longer to launch, to logon times increasing.

What does this all mean?

In the end, Meltdown has a significant impact on our Citrix XenApp 6.5 environment.  The perfect storm of older CPU’s, an older OS and applications that have workflows that are impacted by the patch means our environment is grossly impacted.  Location A has a maximum hit (as of today) of 21%.  Location B having a spread of 12%.  I had originally predicted that Location B would have the largest impact, however the newer V2 processors may be playing a roll and the performance of the V2 processors maybe more efficient than the older 2680.

In the end, the performance hit is not insignificant and reduces our capacity significantly once these patches are deployed.  I plan on adding new articles once I have more data on Meltdown and then further again once we start adding the mitigation’s against Spectre.

CPU Utilization on the hosts. Orange is a host with VM’s without the Meltdown patches, blue is with the patches.


Read More

Citrix Storefront – Adventures in customization – Add a help button to your Storefront UI

/ /
in Blog

This customization is pretty easy.  Add the following to your custom.js file:

Replace “” with the URL you want your help screen to be.

Read More

Citrix Storefront – Adventures in customization – Default to “Store” view if you have no favourited app’s

/ /
in Blog

We are in the process of migrating users from Web Interface to Storefront.  We have identified a potential issue; new users are directed to the “Favourites” view which doesn’t have any applications be default, instead it has instructions on how to add apps to the favourites view.

New users might say, “Where did my apps go?!”

The concern is users may become confused because Web Interface shows all your applications, and this new view shows none.  What we want to do to solve this is default to the “Store” view if you have no favourite apps, and default to the favourites view if you have at least 1 app favourite.


We can do this.


Just add the code above to your custom.js file and the default view will be changed to the store if you have no favorited apps.  Done!

Read More

AppV 5 – Raiser’s Edge 7.96 – Run-time error -2147024770 (800707e)

/ /
in Blog

We are in the process of upgrading Blackbaud’s Raiser’s Edge to 7.96 and we encountered an error:

Run-time error ‘-2147024770 (8007007e)’:
Automation error
The specified module could not be found.

This error is giving us a few clues as to what might be happening.  The most obvious message is the “8007007e” which is a standard windows error hex code which translates to:

8007007E = FileNotFoundException

So RE7.exe is not finding a file it’s looking for.  With most AppV packages we can suss out the file it’s missing by using procmon and tracing for “FILE NOT FOUND” in the result field.  Unfortunately, my searching for this message did NOT result in finding a file that wasn’t resolved by another path.  In other words, all files were accounted for.  But the error message very clearly states that a file is missing.  So the next step is to install the application locally and compare the launch differences between the local install and the AppV install.  Again, process monitor makes this easy by using the “loaded modules” option.

The differences I found between a local install of this application and the AppV launch looked like so:

The launches were identical, until the highlighted points.  The local install, which works without issue, has an extra file that gets loaded.  bbcor7.dll.

It appears, somehow, this file is getting loaded and registered dynamically on a local install, but this is not happening with the AppV install.  I don’t see the file get searched for at all with the AppV install and tracing with procmon.  However, executing a regsvr32 /s “C:\Program Files (x86)\Blackbaud\The Raisers Edge 7\DLL\bbcor7.dll” during sequencing does do all the necessary work to register and allow RE7.exe to find and load the file in an AppV bubble.

So, long story short, execute:

While sequencing your AppV package and this should fix this issue.

Here is my entire sequencing script:


Read More

Citrix XenDesktop/XenApp 7.15 – The local host cache in action

/ /
in Blog

The Citrix Local Host Cache feature, introduced in XenDesktop/XenApp 7.12, has some nuances that maybe better demonstrated in realtime then typed out in text.  I will do both in this article to share both a ‘step by step’ of what happens when you have a network or site database outage and what occurs as well as a realtime video highlighting the feature in action.  There are many other blogs and articles that do a great job going into the step by step details of the feature but I find seeing it in action to be very informative.

To view a video of this process, scroll to the very end, or click here.

To start, I’ve created a powershell script that simulates a user querying the broker for a list of applicaitons.

Columns are time of the response, the payload size received (in bytes) and the total time to respond in milliseconds.

As we’re querying the broker, the broker is reaching out to the database and then responding to the user with the information requested.


Periodically, the Citrix Config Synchronizer Service will check to ensure the local host cache database is in sync with the site database. This is an event that occurs every 2 minutes during normal operation.

To show the network connection failing, I am going to setup a continuous ping to the database server

To simulate a network failure, I’m going to use the tool clumsy to drop all packets to and from the database server.

Clicking start in clumsy immediately stops the simulated user from getting their list of applications.


And the ping’s now time out in their requests.

The broker has a 20 second time out that after which it will respond to requests with what it thinks is the current status. The first timed out request receives a response of “working” and then thereafter a response of “pending failed” will be returned

Around 24 seconds the broker has noticed the database has failed and has logged it’s first event, 1201, “The connection between the Citrix Broker Service and the database has been lost”

Now one-minute thirty three seconds into the failure, other Citrix services are now reporting they cannot contact the database.

Just shy of 2 minutes, the broker service has now exceeded it’s timeout for contacting the database and is in the process of switching to the local host cache. It stops the “primary broker”.

And then the Citrix High Availability Service comes active, brokering user requests.

In my simulation the amount of time it took the user to receive a response from the LHC is a little faster than the site database. The LHC response time is 80-90 milliseconds where the response time for a request that includes the site database is 90-100. This information allows us to visually see the two different modes of operation in action.

Top, site database response times – middle is the outage – bottom is LHC response times

How long does it take to “fall back” to the database when connectivity is restored?

I “Stopped” clumsy to restore our network connection and started a timer.


We can see the ping responses from the database immediately to verify our connection is back.


Almost immediately, all services have noticed that they have connectivity again, including the broker service.

However, we do not fall back immediately.

At one minute thirty three seconds the broker has switched back to the primary broker. And all services have been restored.

To watch a video of this all in action, please view here:



Read More

Citrix Storefront – Adventures in customization – Define a custom resolution for a specific application

/ /
in Blog

Currently, Storefront does not grant the ability to define applications with specific resolutions.  In order to configure the resolution, Citrix recommends you modify the default.ica file.  This is terrible!  If you had specific applications that required specific resolutions, what are you to do?  Direct users to a variety of stores depending on the resolution required?!

Fortunately, again, we can extend StoreFront to make it so we can configure custom resolutions for different applications on the same store.  The solution is a Storefront extension I’ve already written.

The steps to set this up:

  1. Download the Storefront_CustomizationLaunch.dll.
  2. Copy the file to C:\inetpub\wwwroot\Citrix\Store\bin
  3. Edit the web.config in the Store directory and enable the extension
  4. We need to enable Header pass-through for DesiredHRES, DesiredVRES, and TWIMode in the “C:\inetpub\wwwroot\Citrix\StoreWeb\web.config” file:
  5. Lastly, add the following to the custom.js file in your StoreWeb/custom folder:
  6. And enjoy the results!  🙂

Read More

Citrix Storefront – Adventures in customization – Prepopulate Explicit Logon Credentials

/ /
in Blog

Citrix Storefront allows you to prepopulate the credentials for your Explicit Logon.  The explicit logon screen is generally seen here:

And you can prepopulate the Username/Password fields.  If you don’t want to prepopulate the password, that’s fine too.  There are 3 properties and none are required.  Username, Password and Domain.  In order to prepopulate you must pass your credentials through to Storefront somehow, either as a cookie, header or as a URL search query.  I will demo it in the URL search query since I already have that code for pulling the parameters.  You must have “Explicit Authentication” enabled, aka, “User name and Password”:

Put the following code into your custom.js file:

The url to query is:

And the result:

Read More

Citrix Storefront – Adventures in customization – Login via credentials in URL search query

/ /
in Blog

If you use a 3rd party service to connect to your Citrix Storefront environment, you may want to “pass-through” credentials without using domain authentication or whatever.  This post illustrates how you can login to your Storefront environment using nothing more than a URL with your credentials embedded in them.  To enable this functionality, this code must be in your custom.js file.

You MUST have HTTP Basic enabled as an authentication method on your Citrix Storefront Store.

The URL to login would look like this:

Put it all together:

Read More

Citrix Provisioning Server – PXE requests stop working

/ /
in Blog

We use a bootable ISO in our environment to boot our VM’s to a specific set of PVS servers.  This ISO will vary by region enforcing that each target device that boots will be directed to their closest PVS server.

However, we have 1 region that does not leverage this capability and this region was designed to utilize the PXE services of the Citrix PVS server’s.  Occasionally, we encountered VM’s that will not boot and instead the console shows “PXE-E53: no boot filename received”


When I logged onto the Citrix PVS servers, I checked their services.  Both services were reported as “Running”:

When I checked the event logs I did not see any errors in either the application log or the system log.  Administrative events showed nothing out the usual either.

In order to confirm that the PVS service was actually listening, I executed

this showed me all the open ports the server was listening for and the processes tied to those ports.  Since PXE is a UDP operation, I examined the UDP portion of the netstat output.  

Port 69 is used by TFTP to transfer files, and port 67 is used by PXE.  However, I only saw port 69, port 67 was no where to be found.  I restarted the “Citrix PVS PXE Service”, reran netstat and confirmed that the PXE port was not listening and matched up the process ID to the proper services.

Restarting the failed target devices and they began to boot properly.

However, why did this fail in the first place?  I read on the Citrix forums that the Citrix services can become unbound if the network is not available when the services are started.  To test this I rebooted one the affected Citrix PVS servers.  Sure enough, it came back up with port 67 not being monitored but the service in a ‘Running’ state.  I wanted to see if I could capture the flow of communication from the network and when the service started so I used procmon and enabled “Boot Logging”.

Lo and behold, the procmon monitoring on startup added enough of a delay that the PXE service was bound consistently.  Stopping the boot logging and the PXE service would start but fail to bind to the port.

So now this leads to a bit of a quandary.  The delay seems to be in the milliseconds.  I’ve considered a couple solutions for this issue.

  1. A startup script that checks to ensure both ports and restarts the proper service if one of the ports is not found.
  2. Change the service startup type to be “Automatic (Delayed Start)”.  This delays the service by up to 2 minutes.  This does mean that the PVS server will NOT be able to service target device boot requests during this window.

I think we’re going to go with option 2.  The reason is we can apply this setting change via Group Policy Preferences.  This ensures that if we any removal/upgrade of the PVS software this setting will get reapplied.  And then we don’t have to worry about upgrading the OS and losing the startup script either or maintaining a script.

We’ve been affected by this a few times in the past, the fix has always been to restart the PVS server, but I managed to hit a window where the failure was happening consistently and managed to get this information.  🙂


Read More

Citrix Provisioning Services Reverse Imaging – 2017 edition

/ /
in Blog

We’ve been trying to upgrade our Citrix PVS tools to version 7.13 and had some issues.  I need to reverse image so I can remove the software without dealing with the in-place upgrade procedure. So I decided this would be a good time to update my reverse imaging process.  This time I will only be using native tools built into the OS.  With that, this is my new procedure for reverse imaging.  It’s much faster and easier than my last process, but it does have a dependency.  You must be running Windows 2012 or greater so that dism.exe is a part of the OS.  This reverse imaging process is VMWare specific but the principals should apply to all hypervisors.  The overview of the process is as follows:
Create a local hard drive on a ‘build’ server that your vDisk will be based off of.
Attach that disk to the PVS server
Image vDisk to local disk
Boot local disk, manipulate as needed
Image local disk back

  1. Create a new hard disk on your ‘build’ VM that we will use as the staging for the vDisk you want to reimage
  2. Attach the newly created disk to your PVS server
  3. Setup the disks on the PVS server.  In my example, the G: drive is a mounted VHDX of the Citrix vDisk.  I’m using DISM to capture an image file of this disk to a wim that I will apply to the local disk created in step 1.  You will need to create a partition on the disk you made in step 1 and set it as active (if this is a MBR disk).  Unfortunately, DISM does not do disk to disk imaging so this intermediary step is required.  In my video here, the E: drive is the local disk that belongs to my target ‘build’ server.  In addition, you must fix the BCD store as it will point to the original partition, not your new target partition.
    The BCD commands are:

    You will need to replace “partition=E:” with your drive letter of the LOCAL disk.

  4. Remove your hard disk from the PVS virtual machine without deleting it.  Ensure the disk is attached to your BUILD target device.  If you are booting up your BUILD target device from an ISO or PXE you may need to disable those features so that it boots from the hard disk.  I also disconnect the network from the virtual machine because we will need to reset the computer account with AD.
  5. To reset the computer account with AD, you need to login to the system.  The easiest method is if you know the local administrator password, log in with that and “rejoin” the domain.  If not, disconnect the network then logon with an account that has recently logged onto the server.  Cached credentials should be able to pass you through.  Once you are in, connect the network and rejoin the domain.  You can typically “rejoin” by changing the domain name to the NETBIOS name, or the reverse if the NETBIOS name is present

    At this point you can do whatever work is required.

  6. Once you are done with whatever work you need to do, set the target device to boot from hard disk, but keep your vDisk set as an option.  This will attach the vDisk on boot as a secondary drive.  This will allow us to image back to it.  You need to ensure this vDisk is in PRIVATE mode OR has a Maintenance version.  This will set the vDisk to a Read/Write mode.

    If you have your WRITE drive attached, you will need to ensure your local system disk is the LAST disk in the order for your VM.  The PVS boot loader, when set to boot from hard disk, seems to try and boot the LAST disk it detects.

    Checking the PVS Status Tray shows you’ve booted from a local hard drive.

    Something you will need to check and validate is that the Hard Disk of the Citrix vDisk is NOT the drive letter of your write cache (if you find you cannot attach a write cache).  The reason is you have probably redirected the page file, event logs or whatever and if the Citrix vDisk occupies that drive letter you will not be able to image to it because it will be in use.
  7. To image back open the Imaging Wizard:

    And go through the imaging process.

    I prefer to image to the vDisk volume (as seen in this process).

  8. Done!  Set the device to boot via the vDisk as opposed to Hard Disk, delete the local OS disk, boot up your vDisk and run any scripts to ‘finalize’ your image.
Read More
Page 1 of 2712345...1020...Last »