Citrix Provisioning Service – Network Service Starting/Stopping services remotely

/ /
in Blog

Citrix Provisioning Services has a feature within the “Provisioning Services Console” that allows you to stop/restart/start the streaming service on another server:


This feature worked with Server 2008R2 but with 2012R2 and greater it stopped working.  Citrix partially identified the issue here:


I was exploring starting and stopping the streaming service on other PVS servers from the Console and I found this information was incorrect.  Adding the NetworkService does NOT enable the streaming service to be stop/started/restarted from other machines.  The reason is the NETWORKSERVICE is a LOCAL account on the machine itself.  When it attempts to reach out and communicate with another system it is translated into a proper SID, which matches the machine account.  Since that SID communicating across the wire does not have access to the service you get a failure.

In order to fix this properly we can add either the machine account permissions for each PVS Server on each service OR we can add all machine accounts into a security group and add that as permissions to manipulate the service on each PVS Server.

I created a PowerShell script to enable easily add a group, user or machine account to the Streaming Service.  It will also list all the permissions:

An example adding a Group to the permissions to the service:

And now we can start the service remotely:


In order to get this working entirely I recommend the following steps:

  1. Create a Group (eg, “CTX.Servers.ProvisioningServiceServer”)
  2. Add all the PVS Machine Accounts into that group
  3. Reboot your PVS server to gain that group membership token
  4. Run the powershell script on each machine to add the group permission to the streaming service:
  5. Done!

And now the script:


Read More

Meltdown + Spectre – Performance Analysis

/ /
in Blog

Meltdown and Spectre (variant 2) are two vulnerabilities that came out at the same time, however they are vastly different.  Patches for both were released extremely quickly for Microsoft OS’s but because of a variety of issues with Spectre, only Meltdown was truly available to be mitigated.  Spectre (variant 2) mitigation had a problematic release, causing numerous issues for whomever installed the fix, that it had to be recalled and the release delayed weeks.  However, around March/April 2018, the release of the Spectre patch was finalized and the microcode released.

Threat on Performance

Spectre (variant 2), of the two, threatened to degrade performance fairly drastically.  Initial benchmarks made mention that storage was hit particularly hard.  Microsoft made comments that server OS’s could be hit particularly hard.  Even worse, older operating systems would not support CPU features (PCID) that could reduce the performance impact.  Older OS’s suffer more due to the design (at the time) that involved running more code in kernel mode (fonts was signalled out as an example of one of these design decision) than newer OS’s.

As with most things on my blog I am particularly interested in the impact against Citrix/Remote Desktop Services type of workloads.  I wanted to test ON/OFF workloads of the mitigation impacts.


My setup consists of two ESXi (version 6.0) hosts with identical VM’s on each, hosting identical applications.  I was able to setup 4 of these pairs of hosts.  Each pair of hosts have identical processors.  The one big notable change is one host of each pair has the Spectre and Meltdown patch applied to the ESXi hypervisor.

The operating system of all the VM’s is Windows Server 2008 R2.  Applications are published from Citrix XenApp 6.5.



This is simply a snapshot of a single point in time to show the metrics of these systems.

Performance Considerations

Performance within a Citrix XenApp farm can be described in two ways.  Capacity and speed.


Generally, one would test for a “best case” test of the speed aspect of your applications performance.

A simplified view of this is “how fast can the app do X task?”

This task can be anything.  I’ve seen it measured by an automated script flipping through tabs of an application, where each tab pulled data from a database – rendered it – then moved on to the next tab.  The total time to execute these tasks amounted to a number that they used to baseline the performance of this application.

I’ve seen it measured as simply opening an excel document with macros and lots of formulas that pull data and perform calculations and measuring that duration.

The point of each exercise is to generate a baseline that both the app team and the Citrix team can agree to.  I’ve almost never had the baseline equal “real world” workloads, typically the test is an exaggeration of the actual workflow of users (eg, the test exaggerates CPU utilization).  Sometimes this is communicated and understood, other times not, but hopefully it gives you a starting point.

In general, and for Citrix workloads specifically, running the baseline test on a desktop usually produces a reasonable number of, “well… if we don’t put it on Citrix this is what performance will be like so this is our minimum expectation.”  Sometimes this is helpful.

Once you’ve established the speed baseline you can now look at capacity.


After establishing some measurable level of performance within the application(s) you should now be able to test capacity.

If possible, start loading up users or test users running the benchmark.  Eventually, you’ll hit a point where the server fails — either because it ran out of resources, performance degraded so much it errors, etc.  If you do it right, you should be able to start to find the curve that intersects descending performance with your “capacity”.

At this point, cost may come into consideration.

Can you afford ANY performance degradation?

If not, than the curve is fairly easy.  At user X we start to see performance degrade so X-1 is our capacity.

If yes, at what point does performance degrade so much that adding users stops making sense?  Using the “without Citrix this is how it performs on the desktop” can be helpful to establish a minimum level of performance that the Citrix solution cannot cross.

Lastly, if you have network bound applications, and you have an appropriately designed Citrix solution where the app servers sit immediately beside the network resources on super-high bandwidth, ultra-low latency you may never experience performance degradation (lucky you!).  However, you may hit resource constraints in these scenarios.  Eg, although performance of the application is dependent on network, the application itself uses 1GB of RAM per instance of the application — you’ll be limited pretty quickly be the amount of RAM you can have in your VM’s.  These cases are generally preferred because the easy answer to increase capacity is *more hardware* but sometimes you can squeeze some more users with software like AppSense or WEM.

Spectre on Performance

So what is the impact Spectre has on performance — speed and/or capacity?

If Spectre simply makes a task take longer, but you can fit the same number of tasks on a given VM/Host/etc. then the impact is only on speed. Example: a task that took 5 seconds at 5% CPU utilization now takes 10 seconds at 5% CPU utilization.  Ideally, the capacity should be identical even though the task now takes twice as long.

If Spectre makes things use *more* resources, but the speed is the same, then the impact is only on capacity.  Example: a task that took 5 seconds at 5% CPU utilization now takes 10% CPU utilization.  In this scenario, the performance should be identical but your capacity is now halved.

The worst case scenario is if the impact is on both, speed and capacity.  In this case, neither are recoverable except you might be able to make up some speed with newer/faster hardware.

I’ve tested to see the impacts of Spectre in my world.  This world consists of Windows 2008 R2 with XenApp 6.5 on hardware that is 6 years old.  I was also able to procure some newer hardware to measure the impact there as well.

Test Setup

Testing was accomplished by taking 2 identically configured ESXi hosts, applying the VMWare ESXi patch with the microcode for Spectre mitigation to one of the hosts, and enabling it in the operating system.  I added identical Citrix VM’s to both hosts and enabled user logins to start generating load.


Performance needs to measured at two levels.  At the Windows/VM level, and at the hypervisor/host level.  This is because the Hypervisor may pickup the additional work required for the mitigation that the operating system may not, and also due to the way Windows 2008 R2 does not accurately measure CPU performance.

Windows/VM Level – Speed

I used ControlUp to measure and capture performance information.  ControlUp is able to capture various metrics including average logon duration.  This singular metric includes various system interactions, from using the network by querying Active Directory, pulling files from network shares, disk to store group policies in a cache, CPU processing which policies are applicable, and executables being launched in a sequence.  I believe that measuring logons is a good proxy for understanding the performance impact.  So lets see some numbers:


The top 3 results are Spectre enabled machines, the bottom 3 are without the patch.  The results are not good.  We are seeing a 200% speed impact in this metric.

With ControlUp we can drill down further into the impact:

Without Spectre Patch


With Spectre Patch


The component that took the largest hit is Group Policy.  Again, ControlUp can drill down into this component.

Without Spectre


With Spectre

All group policy preference components take a 200% hit.  The Group Policy Preferences functions operate by pulling down an XML file from the SYSVOL store, reading the XML file, than applying whatever resultant set of policies finds applicable.  In order to trace down further to find more differences, I logged into each type of machine, one with Spectre and one without, and started a Process Monitor trace.  Group Policy is applied via the Group Policy service, which a seperate instance of the svchost.exe.  The process can be found via Task Manager:

Setting ProcMon to filter only on that PID we can begin to evaluate the performance.  I relogged in with procmon capturing the logon.

Spectre Patched system on left, no patch on right

Using ProcessMonitor, we can look at the various “Summaries” to see which particular component may be most affected:

We see that 8.45 seconds is spent on the registry, 0.40 seconds on file actions, 1.04 seconds on the ProcessGroupPolicyExRegistry instruction.

The big ticket item is the time spent with the registry.

So how does it compare to a non-spectre system?


We see that 1.97 seconds is spent on the registry, 0.33 seconds on file actions, 0.24 seconds on the ProcessGroupPolicyExRegistry instruction.

Here’s a table showing the results:

So it definitely appears we need to look at the registry actions.  One of the cool things about Procmon is you can set a filter on your trace and open up the summaries and it will show you only the objects in the filter.  I set a filter for RegSetValue to see what the impact is for setting values in the registry:

RegSetValue – without spectre applied


RegSetValue – with spectre applied

1,079 RegSetValue events and a 4x performance degradation.  Just to test if it is specific to write events I changed the procmon filter to filter on “category” “Read”


Registry Reads – Spectre applied


Registry Reads – Spectre not applied

We see roughly the same ratio of performance degradation, perhaps a little more so.  As a further test I created a PowerShell script that will just measure creating 1000 registry values and test it on each system:

Spectre Applied


Spectre Not Applied


A 2.22x reduction in performance.  But this is writing to the HKCU…  which is a much smaller file.  What happens if I force a change on the much larger HKLM?

Spectre Applied


Spectre Not Applied


Wow.  The size of the registry hive makes a big difference in performance.  We go from 2.22x to 3.42x performance degradation.  So on a minute level, Spectre appears to have a large impact on Registry operations and the larger the hive the worse the impact.  With this information there is a large element of sense as to why Spectre may impact Citrix/RDS more.  Registry operations occur with a high frequency in this world, and logon’s highlight it even more as group policy and the registry are very intertwined.

This actually brings to mind another metric I can measure.  We have a very large AppV package that has a 80MB registry hive that is applied to the SOFTWARE hive when the package is loaded.  The difference in the amount of time (in seconds) loading this package is:

“583.7499291” (not spectre system)
“2398.4593479” (spectre system)

This goes from 9.7 mins to 39.9 minutes.  Another 4x drop in performance and this would be predominately registry related.  So another bullet that registry operations are hit very hard.

Windows/VM Level – Capacity

Does Spectre affect the capacity of our Citrix servers?

I recorded the CPU utilization of several VM’s that mirror each other on hosts that mirror each other with a singular difference.  One set had the Spectre mitigation enabled.  I then took their CPU utilization results:

Red = VM with Spectre, Blue = VM without Spectre

By just glancing at the data we can see that the Spectre VM’s had higher peaks and they appear higher more consistently.  Since “spiky” data is difficult to read, I smoothed out the data using a moving average:

Red = VM with Spectre, Blue = VM without Spectre

We can get a better feel for the separation in CPU utilization between Spectre enabled/disabled.  We are seeing clearly higher utilization.

Lastly, I took all of the results for each hour and produced a graph in an additive model:

This graph gives a feel for the impact during peak hours, and helps smooth out the data a bit further.  I believe what I’m seeing with each of these graphs is a performance hit measured by the VM at 25%-35%.

Host Level – Capacity

Measuring from the host level can give us a much more accurate picture of actual resources consumed.  Windows 2008 R2 isn’t a very accurate counter and so if there are lots of little slices, they can add up.

My apologies for swapping colors.  Raw data:

Blue = Spectre Applied, Red = No Spectre

Very clearly we can see the hosts with Spectre applied consume more CPU resources, even coming close to consuming 100% of the CPU resources on the hosts.  Smoothing out the data using moving averages reveals the gap in performance with more clarity.


Showing an hourly “Max CPU” per hour hit gives another visualization of the performance hit.



Windows 2008 R2, for Citrix/RDS workloads will be impacted quite highly.  The impact that I’ve been able to measure appears to be focused on registry-related activities.  Applications that store their settings/values/preferences in registry hives, whether they be the SOFTWARE/SYSTEM/HKCU hive will feel a performance impact.  Logon actions on RDS servers would be particularly impacted because group policies are largely registry related items, thus logon times will increase as it takes longer to process reads and writes.  CPU utilization is higher on both the Windows VM-level and the hypervisor level,  up to 40%.  The impact of speed on the applications and other functions is notable although more difficult to measure.  I was able to measure a ~400% degradation in performance for CPU processing for Group Policy Preferences, but perception is a real thing, so going from 100ms to 400ms may not be noticed.  However, on applications that measure response time, it was found we had a performance impact of 165%.  What took 1000ms now takes 1650ms.

At the time of this writing, I was only able to quantify the performance impact between two of the different hosts.  The Intel Xeon E5-2660 v4 and Intel Xeon E5-2680.

The Intel Xeon E5-2660 v4 has a frequency of 26% less than the older 2680.  In order to overcome this handicap, the processor must have improved at a per-clock rate higher than the 26% frequency loss.  CPUBenchMark had the two processors with a single thread CPU score of 1616 for the Intel Xeon E5-2660 v4 and 1657 for the Intel Xeon E5-2680.  This put them close but after 4 years the 2680 was marginally faster.  This has played out in our testing that the higher frequency processor is faster.  Performance degradation for the two processors came out as such:


CPU Performance Hit
Intel Xeon E5-2680 2.70GHz 155%
Intel Xeon E5-2660 v4 2.00GHz 170%


This tells that frequency of the processor is more important to mitigate the performance hit.

Keep in mind, these findings are my own.  It’s what I’ve experienced in my environment with the products and operating systems we use.  Newer operating systems are supposed to perform better, but I don’t have the ability to test that currently so I’m sharing these numbers as this is an absolute worst case type of scenario that you might come across.  Ensure you test the impact to understand how your environment will be affected!

Read More

Group Policy Preferences Registry Extension vs Group Policy Registry Extension

/ /
in Blog

In various discussions I’ve read about the drawbacks of Group Policy Preferences but is it really that bad?


…Or is it how you are using it?


There are two methods of applying registry keys/values with Group Policy.  The Group Policy Registry Extension is the “traditional” form of applying policies.  Also known as ADM or ADMX policies, when creating GPO’s with this method a binary file, “.pol”, is created.  When policy application occurs this file is read and applied to your registry.  As a binary file, this file is kept small and fast.  Reading and applying the settings should be nearly instant.

The second method of applying registry keys is with Group Policy Preferences (GPP).  This was a “new” method introduced in Windows Server 2008 with the purchase of PolicyMaker by Microsoft.  Group Policy Preferences are much, much more flexible than the traditional form.  There are different ways of applying registry values, including the CRUD model (Create, Replace, Update, Delete), filtering by the way of “Item Level Targeting“, either on an individual value or on a collection.

I’ve seen an organization heavily leverage GPP to great success.  I started to wonder though, what are the performance impacts of using GPP over the traditional method.  This post will explore the differences in the CRUD model and how it compares to the traditional method..

I intend to look the following scenarios:

  1. Creating a registry value
  2. Updating a previous registry value
  3. Removing a registry value

However, GPP has a fourth method, “Replace” and I’ll explore what it does in addition to these 3 methods.

Creating a Registry Value

In this scenario, the registry will be clean and a new value will be created.  I’m going to refer to the Group Policy Registry Extension (AKA, Administrative Templates, ADM/ADX) as the “traditional” method and use the abbreviation GPP for the Group Policy Preferences Registry Extension.


After reading the Registry.Pol from the sysvol, the application of the registry key takes just 3 operations.  RegCreateKey, RegSetValue, and RegCloseKey.

Each one of these operations took around 1-1.1ms, with the caveat that Process Monitor (procmon) consumes some resources capturing this information, slowing it down slightly.



We can see a new operation “RegQueryValue”.  As described by William Stanek, “The Create action creates a preference if it doesn’t already exist. For example, you can use the Create action to create and set the value of a user environment variable called CurrentOrg on computers where it does not yet exist. If the variable already exists, the value of the variable will not be changed.”

The RegQueryValue is executing the check to see if a variable already exists.  So what does GPP look like if the value is already present?

3 operations with the process exiting on a success on the value being present.

The end result, is 3 operations for our traditional method, and 4 operations for the Group Policy Preferences method for creating a registry entry.

Updating a registry value

In this scenario, the registry will contain a value, and the policy will be updated with a new value.  For the traditional method this will involve changing the Microsoft “User Profiles” policy.  I set the “HomeDir” location to “TrententTest”, applied the value, then updated it to “TrententTye”.  This will ensure a new, changed key is applied.  For GPP I’m going to change the value on the policy to 0x0 from 0x1 and use the “Update” operation.


Traditional maintains a very simple “3 operation” action with updating a value having the same effect as if the value was never present to begin with.


With the “Update” action, GPP now executes just 3 operations, same as the traditional.

The end result, is 3 operations for our traditional method, and 3 operations for the Group Policy Preferences method for updating a registry entry.

Removing a registry value

In this scenario, I am going to remove a registry value.  Using the traditional method this means modifying my group policy to “Not Configured”, and for GPP this means setting “Delete” for our operation.


Again, Traditional performs it’s work in just 3 operations.



GPP also performs this work in just 3 operations.


GPP – The Replace Method

Group Policy Preferences has another operation to explore.  “Replace”.

This operation …”creates preferences that don’t yet exist, or deletes and then creates preferences that already exist.”

This sounds like it performs a few operations.  Lets see what it looks like:

Replace executes “6” operations.  RegOpenKey, RegDeleteValue, RegCloseKey, RegCreateKey, RegSetValue, RegCloseKey.  I’m not entirely sure why you’d want a DeleteValue before SetValue but that’s what this selection does.


Revisiting GPP: “Creating a Registry Value”

During the process of creating this post, I wondered if the 3 operation “Update” would work better for creating a key.  The GPP “Create” selection has 4 operations, but the “Update” selection only has 3 operations.  I deleted my “TrententTestPreferences” key and refreshed group policy:


3 operations!  So Group Policy Preferences has the potential to operate at the same speed as the traditional group policy IF YOU STICK TO USING “UPDATE”.  At the very least, these operations should take the same amount of time.  Of course, implementation might be a different story.

The final tally:

Stay tuned for part 2 — The Performance Comparison

Read More

Group Policy – Monolithic vs Functional Design and Performance Evaluation

/ /
in Blog

Group Policy Design is a hotly discussed topic, with lots of different ideas and discussions.  However, there is not a whole lot of actual metrics.  ADM and ADMX templates apply registry keys in an ‘enforced’ manner.  That is, if you or the machine has access to read the policies, the registry keys within are applied.  If you stuck purely to ADM/ADMX policies but wanted to do dynamic filtering or application of the keys/values based on a set of criteria you’d probably design multiple policies and nested organizational units (OUs).  From here, you could filter certain policies based on the machine or user location in the OU structure or by filtering on the policies themselves and denying access to certain groups or doing explicit allows to certain groups.  This design style, back in the day, was called “functional” design.


However, the alternative style, “monolithic” design, simplifies the group policy object (GPO) design into much fewer GPO’s.



My test setup is very simple; a organizational unit (OU) with inheritance blocked to control the application of the GPO’s.  I created 100 individual GPO’s with a single registry value, and 1 GPO with 100 values.  I chose to do a simple registry addition as it should be the best performance option for group policy.  I created a custom ADMX file for this purpose:

Monolithic simulation:


Functional simulation:



In testing these two designs I elected to focus on the one factor that would have the most impact: latency.  I setup my client machine in the OU, put a WAN emulator that can manipulate latency and measured the performance between the functional and monolithic designs at varying latencies.  I looked for the following event ID’s: 4257, 5257, 4016, 5016.  The x257 events correspond to when group policy downloads the group policy objects off the SYSVOL file share.  The x016 event’s determine how long it took the policy to be processed.


The results:


Raw Data:

Functional GPO – applying 100 registry values
Event ID 4016 to 5016 (ms)
Latency Time (ms)
0 271
10 4089
25 8078
50 15315
75 22904
100 29820
Event ID 4257 to 5257 – Starting to download policies
Latency Time (s)
0 0
10 3
25 6
50 12
75 17
100 22


Monolithic GPO – applying 100 registry values
Event ID 4016 to 5016 (ms)
Latency Time (ms)
0 117
10 156
25 198
50 284
75 336
100 435
Event ID 4257 to 5257 – Starting to download policies
Latency Time (s)
0 0
10 0
25 1
50 1
75 1
100 1



There is a huge edge to the monolithic design.  Both in terms of how long it takes to process a single policy vs multiple policies and the resiliency to the effects of latency.  Even ‘light’ latency can have a profound impact on performance.  Going from 0ms to 10ms increased the length of time to process a functional design by 15 times!  The monolithic design, on the other hand, was barely impacted.  Even with a latency of 100ms, it only added ~300ms of time to process the policy.  I would consider this imperceptible in the real world, where as the functional design going from ~271ms to ~4000ms would be an extremely noticeable impact!  Let alone about 30 seconds at 100ms!

Another factor is how much additional time is required to download the policies.  This is time in addition to the processing time.  This probably shouldn’t be a huge surprise, it appears that group policies are downloaded and processed sequentially, in an order.  I’m sure this is necessary to maintain some semblance of prediction if you have conflicting policies settings, the one last in the list (whether sorted alphabetically or what have you) can be relied on to be the winner.

Adding latency, even just a little latency, has a noticeable impact.  And the more policies, the more traffic, the more the impact of latency.  Again, a loss for a functional design and a win for a more monolithic design.


Group Policy Objects can have a large impact on user experience.  The goal should be to minimize them to as few as possible.  As with everything, there are exceptions to the rules, but for Group Policy it’s important to try and maintain this rule.  Even just a little latency between the domain controller and the client can have a massive impact in group policy performance.  This can impact the length of time it takes a machine to boot, to delaying a user logging into a system.


Read More

Corrupt Registry Repair with Citrix Provisioning Services

/ /
in Blog

I encountered an interesting issue and worked through a solution with a corrupt registry.  The issue seemed innocuously enough, we upgraded PowerShell to 5.1.  Upon reboot I encountered a bluescreen with code 0xF4:

I have encountered this issue before, but I haven’t recorded my troubleshooting steps until now.

Since this was a PVS target device, the easy method was deleting the version and trying to upgrade PowerShell to 5.1, which resulted in the same BSOD.  So it was easily reproducible.  So I tried it a few more times, because, why not?

I then deleted this version, and booted into the system and looked at Event Viewer for hints of what could be at fault.  Going through it was pretty obvious that it was registry corruption:

Filtering for EventID 5 shows all the attempts of booting to BSOD:


I mean, it literally says the Registry was corrupted 🙂

Where can you find this corruption?  With PVS it’s fairly simple but I believe the same process can exist for other systems including physical.  The first step was to mount the vDisk to a VM.


Mount the registry hive you suspect with corruption.

Next is to scan the registry for corruption.  Thus far, I’ve only found corruption to be detectable if it’s a key or value that cannot be read.  If the data on a value can be read but contains garbage it’s much harder to detect.  In order to avoid permissions being a problem, I open a PowerShell prompt as SYSTEM using PSEXEC.  If you don’t elevate permissions, some keys maybe restricted from the Admins group and this will be detected as a failure.

Once at this stage, it’s a one-liner to scan the registry:

In my experience, corruption can be detected as “Permission Denied”, “Access Denied”,”Path does not exist” or some such:


At this point you can examine the text file to see the last path it explored:


At this point you can open regedit (as SYSTEM) and examine the keys within that path.  Clicking through each one revealed the corrupted key:

Attempting to get Permissions on this key reveals it also exists on the ACL level:

“The requested security information is either unavailable or can’t be displayed”.

Deleting the key may fail as well:

At this point you need to evaluate how to manage the corruption.  If you cannot delete the key, rename it, or in some way replace it you may have an option like I did…  You can rename a higher up branch in the tree, go to a existing system with the same keys (with PVS I can go to the previous version and export that tree) and reimport.

Unload the hive and boot up the system –> and you may have a fully working system!

I’ve used this trick here, and on a corrupt COMPONENTS hive in the past.  With the COMPONENTS hive I got lucky I could replace the corrupted keys with ones from a branched vDisk.  Other machines didn’t have the same key in COMPONENTS so I got lucky.


Read More

Meltdown – Performance Impact Evaluation (Citrix XenApp 6.5)

/ /
in Blog

Meltdown came out and it’s a vulnerability whose fix may have a performance impact.  Microsoft has stipulated that the impact will be more severe if you:

a) Are an older OS
b) Are using an older processor
c) If your application utilizes lots of context switches

Unfortunately, the environment we are operating hits all of these nails on the head.  We are using, I believe, the oldest OS that Microsoft is patching this for.  We are using older processors from 2011-2013 which do not have the PCID optimization (as reported by the SpeculationControl test script) which means performance is impacted even more.

I’m in a large environment where we have the ability to shuffle VM’s around hosts and put VM’s on specific hosts.  This will allow us to measure the impact of Meltdown in its entirety.  Our clusters are dedicated to Citrix XenApp 6.5 servers.

Looking at our cluster and all of the Citrix XenApp VM’s, we have some VM’s that are ‘application siloed’ — that is they only host a single application and some VM’s that are ‘generic’.

In order to determine the impact I looked at our cluster, summed up the total of each type of VM and then just divided by the number of hosts.  We have 3 different geographical areas that have different VM types and user loads.  I am going to examine each of these workload types across the different geographical areas and see what are the outcomes.

Since Meltdown impacts applications and workloads that have lots of context switches I used perfmon each server to record the context switches of each server.

The metrics I am interested in are the context switch values as they’ve been identified as the element that highlight the impact.  My workloads look like this:

Based on this chart, our largest impact should be Location B, followed by Location A, last with Location C.

However, the processors for each location are as follows:

Location A: Intel Xeon 2680
Location B: Intel Xeon 2650 v2
Location C: Intel Xeon 2680

The processors may play a roll as newer generation processors are supposed to fair better.

In order to test Meltdown in a side-by-side comparison of systems with and without the mitigation I took two identical hosts and populated them with an identical amount and type of servers. On one host we patched all the VM’s with the mitigation and with the other host we left the VM’s without the patches.

Using the wonderful ControlUp Console, we can compare the results in their real-time dashboard.  Unfortunately, the dashboard only gives us a “real time” view, ControlUp offers a product called “Insights” that can show the historical data, our organization has not subscribed to this product and so I’ve had to try and track the performance by exporting the ControlUp views on a interval and then manually sorting and presenting the data.  Using the Insights view would be much, much faster.

ControlUp has 3 different views I was hoping to explore.  The first view is the hosts view, this will be performance metrics pulled directly from the VMWare Host.  The second view will be the computers view, and the last will be the sessions view.  The computers and sessions view are metrics pulled directly from the Windows server itself.  However, I am unable to accurately judge the performance of Windows Server metrics because of how it measures CPU performance.

Another wonderful thing about ControlUp is we can logically group our VM’s into folders, from there ControlUp can sum the values and present it in an easily digestible presentation.  I created a logical structure like so and populated my VM’s:


And then within ControlUp we can “focus” on each “Location” folder and if we select the “Folder” view it presents the sums of the logical view.


In the hosts view, we can very quickly we can see impact, ranging from 5%-26%.  However, this is a realtime snapshot, I tracked the average view and examined only the “business hours” as our load is VERY focused on the 8AM-4PM.  After these hours and we see a significant drop in our load.  If the servers are not being stressed the performance seems to be a lot more even or not noticeable (in a cumulative sense).


Some interesting results.  We are consistently seeing longer login times and application launch times.  2/3 of the environments have lower user counts on the unpatched servers with the Citrix load balancing making that determination.  The one environment that had more users on the mitigation servers are actually our least loaded servers in terms of servers per host and users per server, so it’s possible that more users would drive into a gap, but as of now it shows that one of our environment can support an equal number of users.

Examining this view and the historical data presented I encountered oddities — namely the CPU utilization seemed to be fairly even more often than not, but the hosts view showed greater separation between the machines with mitigation and without.  I started to explore why and believe I may have come across this issue previously.

2008R2-era servers have less accurate reporting of CPU utilization.

I believe this is also the API that ControlUp is using with these servers to report on usage.  When I was examining a single server with process explorer I noticed a *minimum* CPU utilization of 6%, but task manager and ControlUp would report 0% utilization at various points.  The issue is an accuracy, adding and rounding issue.  The more users on a server with more processes and those processes consuming ever so slightly CPU, the more the inaccuracy.  Example:

Left, Task Manager. Right, Process Explorer

We have servers with hundreds of users utilizing a workflow like this where they are using just a fraction of a percent of CPU resources.  Taskmanager and the like will not catch these values and round *down*.  If you have 100 users using a process that consumes 0.4% CPU then our inaccuracy is in the 40% scale!  So relying on the VM metrics of ControlUp or Windows itself is not helpful.  Unfortunately, this destroys my ability to capture information within the VM, requiring us to solely rely on the information within VMWare.  To be clear, I do NOT believe Windows 2012 R2 and greater OS’s have this discrepancy (although I have not tested) so this issue manifests itself pretty viciously in the XenApp 6.5 -era servers.  Essentially, if Meltdown is increasing CPU times on processes by a fraction of a percent then Windows will report as if everything is ok and you will probably not actually notice or think there is an issue!

In order to try and determine if this impact is detectable I have two servers with the same base image, with one having the mitigation installed and other did not.  I used process explorer and “saved” the process list over the course of a few hours.  I ensured the servers had a similar amount of users using a specific application that only presented a specific workload so everything was as similar as possible.  In addition, I looked at processes that aren’t configurable (or since the servers have the same base image they are configured identically).  Here were my results:

Just eye balling it, it appears that the mitigation has had an impact on the fractional level.  When taking the average of the Winlogon.exe and iexplore.exe processes into account:


These numbers may seem small, but once you start considering the number of users the amount wasted grows dramatically.  For 100 users, winlogon.exe goes from consuming a total of 1.6% to 7.1% of the CPU resulting in an additional load of 5.5%.  The iexplore.exe is even more egregious as it spawns 2 processes per user, and these averages are per process.  For 100 users, 200 iexplore.exe processes will be spawned.  The iexplore.exe CPU utilization goes from 15.6% to 38.8%, for an additional load of 23.2%.  Adding the mitigation patch can impact our load pretty dramatically, even though it may be under reported thus impacting users on a far greater scale by adding more users to servers that don’t have the resources that Windows is reporting it has For an application like IE, this may just mean greater slowness — depending on the workload/workflow — but if you have an application more sensitive to these performance scenarios your users may experience slowness even though the servers themselves look OK from most (all?) reporting tools.

Continuing with the HOSTS view, I exported all data ControlUp collects on a minute interval and then added the data to Excel and created my pivot tables with the hosts that are hosting servers with the mitigation patches and the ones without.  This is what I saw for Saturday-Sunday, these days are lightly loaded.

This is Location B, the host with VM’s that are unpatched is in orange and the host with patched VM’s is in blue.  the numbers are pretty identical when CPU utilization on the host is around or below 10%, but once it starts to get loaded the separation becomes apparent.

Since these datapoints were every minute, I used a moving average of 20 data points (3 per hour) to present the data in a cleaner way:

Looking at the data for this Monday morning, we see the following:

Location B


Some interesting events, at 2:00AM the VM’s reboot.  We reboot odd and even servers each day, and in my organization of testing this, I put all the odd VM’s on the blue host, and the even VM’s on the orange host.  So the blue line going up at 2:00AM is the odd (patched) VM’s rebooting.  The reboot cycle is staggered to take place over a 90 minute interval (last VM’s should reboot around 3:30AM).  After the reboot, the servers come up and do some “pre-user” startup work like loading AppV packages, App-V registry prestaging, etc.  I track the App-V registry pre-staging duration during bootup and here are my results:

Registry Pre-staging in AppV is a light-Read heavy-Write exercise.  Registry reading and writing are slow in 2008R2 and our time to execute this task went from 610 seconds to 693 seconds for an overall duration increase of 14%.

Looking at Location A and C

Location A

Location C (under construction)

We can see in Location A the CPU load is pretty similar until the 20% mark, then separation starts to ramp up fairly drastically.  For Location C, unfortunately, we are undergoing maintenance on the ‘patched’ VM’s, so I’m showing this data for transparency but it’s only relevant up to the 14th.  I’ll update this in the next few days when the ‘patched’ VM’s come back online.

Now, I’m going to look at how “Windows” is reporting CPU performance vs the Hosts CPU utilization.

Location A


Location B


The information this is conveying is to NOT TRUST the Windows CPU utilization meter (at least 2008 R2).  The CPU Utilization on the VM-level does not appear to reflect the load on the hosts.  While the VM’s with the patch and without the patch both report nearly identical levels of CPU utilization, on the host level the spread is much more dramatic.


Lastly, I am able to pull some other metrics that ControlUp tracks.  Namely, Logon Duration and Application Launch duration.  For each of the locations I got a report of the difference between the two environments

Location A: Average Application Load Time

Location B: Average Application Load Time


Location A: Logon Duration


Location B: Logon Duration




In each of the metrics recorded we experience a worsening experience for our user base, from the application taking longer to launch, to logon times increasing.

What does this all mean?

In the end, Meltdown has a significant impact on our Citrix XenApp 6.5 environment.  The perfect storm of older CPU’s, an older OS and applications that have workflows that are impacted by the patch means our environment is grossly impacted.  Location A has a maximum hit (as of today) of 21%.  Location B having a spread of 12%.  I had originally predicted that Location B would have the largest impact, however the newer V2 processors may be playing a roll and the performance of the V2 processors maybe more efficient than the older 2680.

In the end, the performance hit is not insignificant and reduces our capacity significantly once these patches are deployed.  I plan on adding new articles once I have more data on Meltdown and then further again once we start adding the mitigation’s against Spectre.

CPU Utilization on the hosts. Orange is a host with VM’s without the Meltdown patches, blue is with the patches.


Read More

Citrix Storefront – Adventures in customization – Add a help button to your Storefront UI

/ /
in Blog

This customization is pretty easy.  Add the following to your custom.js file:

Replace “” with the URL you want your help screen to be.

Read More

Citrix Storefront – Adventures in customization – Default to “Store” view if you have no favourited app’s

/ /
in Blog

We are in the process of migrating users from Web Interface to Storefront.  We have identified a potential issue; new users are directed to the “Favourites” view which doesn’t have any applications be default, instead it has instructions on how to add apps to the favourites view.

New users might say, “Where did my apps go?!”

The concern is users may become confused because Web Interface shows all your applications, and this new view shows none.  What we want to do to solve this is default to the “Store” view if you have no favourite apps, and default to the favourites view if you have at least 1 app favourite.


We can do this.


Just add the code above to your custom.js file and the default view will be changed to the store if you have no favorited apps.  Done!

Read More

AppV 5 – Raiser’s Edge 7.96 – Run-time error -2147024770 (800707e)

/ /
in Blog

We are in the process of upgrading Blackbaud’s Raiser’s Edge to 7.96 and we encountered an error:

Run-time error ‘-2147024770 (8007007e)’:
Automation error
The specified module could not be found.

This error is giving us a few clues as to what might be happening.  The most obvious message is the “8007007e” which is a standard windows error hex code which translates to:

8007007E = FileNotFoundException

So RE7.exe is not finding a file it’s looking for.  With most AppV packages we can suss out the file it’s missing by using procmon and tracing for “FILE NOT FOUND” in the result field.  Unfortunately, my searching for this message did NOT result in finding a file that wasn’t resolved by another path.  In other words, all files were accounted for.  But the error message very clearly states that a file is missing.  So the next step is to install the application locally and compare the launch differences between the local install and the AppV install.  Again, process monitor makes this easy by using the “loaded modules” option.

The differences I found between a local install of this application and the AppV launch looked like so:

The launches were identical, until the highlighted points.  The local install, which works without issue, has an extra file that gets loaded.  bbcor7.dll.

It appears, somehow, this file is getting loaded and registered dynamically on a local install, but this is not happening with the AppV install.  I don’t see the file get searched for at all with the AppV install and tracing with procmon.  However, executing a regsvr32 /s “C:\Program Files (x86)\Blackbaud\The Raisers Edge 7\DLL\bbcor7.dll” during sequencing does do all the necessary work to register and allow RE7.exe to find and load the file in an AppV bubble.

So, long story short, execute:

While sequencing your AppV package and this should fix this issue.

Here is my entire sequencing script:


Read More

Citrix XenDesktop/XenApp 7.15 – The local host cache in action

/ /
in Blog

The Citrix Local Host Cache feature, introduced in XenDesktop/XenApp 7.12, has some nuances that maybe better demonstrated in realtime then typed out in text.  I will do both in this article to share both a ‘step by step’ of what happens when you have a network or site database outage and what occurs as well as a realtime video highlighting the feature in action.  There are many other blogs and articles that do a great job going into the step by step details of the feature but I find seeing it in action to be very informative.

To view a video of this process, scroll to the very end, or click here.

To start, I’ve created a powershell script that simulates a user querying the broker for a list of applicaitons.

Columns are time of the response, the payload size received (in bytes) and the total time to respond in milliseconds.

As we’re querying the broker, the broker is reaching out to the database and then responding to the user with the information requested.


Periodically, the Citrix Config Synchronizer Service will check to ensure the local host cache database is in sync with the site database. This is an event that occurs every 2 minutes during normal operation.

To show the network connection failing, I am going to setup a continuous ping to the database server

To simulate a network failure, I’m going to use the tool clumsy to drop all packets to and from the database server.

Clicking start in clumsy immediately stops the simulated user from getting their list of applications.


And the ping’s now time out in their requests.

The broker has a 20 second time out that after which it will respond to requests with what it thinks is the current status. The first timed out request receives a response of “working” and then thereafter a response of “pending failed” will be returned

Around 24 seconds the broker has noticed the database has failed and has logged it’s first event, 1201, “The connection between the Citrix Broker Service and the database has been lost”

Now one-minute thirty three seconds into the failure, other Citrix services are now reporting they cannot contact the database.

Just shy of 2 minutes, the broker service has now exceeded it’s timeout for contacting the database and is in the process of switching to the local host cache. It stops the “primary broker”.

And then the Citrix High Availability Service comes active, brokering user requests.

In my simulation the amount of time it took the user to receive a response from the LHC is a little faster than the site database. The LHC response time is 80-90 milliseconds where the response time for a request that includes the site database is 90-100. This information allows us to visually see the two different modes of operation in action.

Top, site database response times – middle is the outage – bottom is LHC response times

How long does it take to “fall back” to the database when connectivity is restored?

I “Stopped” clumsy to restore our network connection and started a timer.


We can see the ping responses from the database immediately to verify our connection is back.


Almost immediately, all services have noticed that they have connectivity again, including the broker service.

However, we do not fall back immediately.

At one minute thirty three seconds the broker has switched back to the primary broker. And all services have been restored.

To watch a video of this all in action, please view here:



Read More
Page 1 of 2712345...1020...Last »