Citrix

User Profile Manager – Unavoidable Delays

2018-06-30
/ /
in Blog
/

I’ve been exploring optimizing logon times and noticed “User Profile Service” always showed up for 1-3 seconds.  I asked why and began my investigation.

The first thing I needed to do is separate the “User Profile Service” into it’s own process.  It’s originally configured to share the same process as other services which makes procmon’ing difficult.

Making this change is easy:

Now that the User Profile Service is running in it’s own process we can use Process Monitor to target that PID.

I logged onto the RDS server with a different account and started my procmon trace.  I then logged into server:

One of the beautiful things about a video like this is we can start to go through frame-by-frame if needed to observe the exact events that are occurring.  Process Monitor also gives us a good overview of what’s happening with the “Process Activity” view:

9,445 file events, 299,668 registry events.  Registry, by far, has the most events occurring on it.  And so we investigate:

  1. On new logins the registry hive is copied from Default User to your profile directory, the hive is mounted and than security permissions are set.

    Setting the initial permissions of the user hive began at 2:14:46.3208182 and finished at 2:14:46.4414112.  Spanning a total time of 121 milliseconds.  Pretty quick but to minimize logon duration it’s worth examining each key in the Default User hive and ensuring you do not have any unnecessary keys.  Each of these keys will have their permissions evaluated and modified.
  2. The Profile Notification system now kicks off.

    The user profile server now goes through each “ProfileNotification” and, if it’s applicable, executes whatever action the module is responsible for.  In my screenshot we can see the User Profile Service alerts the “WSE”.  Each key actually contains the friendly name, giving you a hint about its role:

    It also appears we can measure the duration of each module by the “RegOpenKey” and “RegCloseKey” events tied to that module.

    In my procmon log, the WSE took 512ms, the next module “WinBio” took 1ms, etc.  The big time munchers for my system were:
    WSE: 512ms
    SyncCenter: 260ms
    SHACCT: 14ms
    SettingProfileHandler: 4ms
    GPSvc: 59ms
    GamesUX: 60ms
    DefaultAssociationsProfileHandler: 4450ms (!)
  3. In the previous screenshot we can see the ProfileNotification has two events kicked off that it runs through it’s list of modules: Create and Load.  Load takes 153ms in total, so Create is what is triggering our event.
  4. DefaultAssociationsProfileHandler consumes the majority of the User Profile Service time.  What the heck is it doing?  It appears the Default Association Profile Handler is responsible for creating the associations between several different components and your ability to customize them.  It associates (that I can see):
    ApplicationToasts (eg, popup notifications)
    RegisteredApplications
    File Extensions
    DefaultPrograms
    UrlAssociations
    The GPO “Set Default Associations via XML file” is processed and the above is re-run with the XML file values.
  5. Do we need these associations?

    Honestly…   Maybe.

    However, does this need to be *blocking* the login process?  Probably not.  This could be an option to be run asynchronously with you, as the admin, gambling that any required associations will be set before the user gets the desktop/app…  Or if you have applications that are entirely single purpose that simply read and write to a database somewhere than this is superfluous.

  6. Can we disable it?

    Yes…

    But I’m on the fence if this is a good idea.  To disable it, I’ve found deleting the “DefaultAssociationsProfileHandler” does work, associations are skipped and we logon 1-4 seconds faster.  However, launching a file directly or shortcut with a url handler will prompt you to choose your default program (as should be expected).

I’m exploring this idea.  Deleting the key entirely and using SetUserFTA to set associations.

We have ~400 App-V applications that write/overwrite approximately 800 different registered applications and file extensions into our registry hive (we publish globally — this puts them there).  This is the reason why I started this investigation is some of our servers with lots of AppV applications were reporting longer UserProfileService times and tying it all together, this one module in the User Profile Service appears to be the culprit.  And with Spectre increasing the duration of registry operations 400% this became noticeable real quick in our testing.

Lastly, time is still being consumed on RDS and server platforms by the infuriating garbage services like GamesUX (“Games Explorer”).  It just tweaks a nerve a little bit when I see time being consumed on wasteful processes.

Read More

Citrix Provisioning Service – Network Service Starting/Stopping services remotely

2018-05-02
/ /
in Blog
/

Citrix Provisioning Services has a feature within the “Provisioning Services Console” that allows you to stop/restart/start the streaming service on another server:

 

This feature worked with Server 2008R2 but with 2012R2 and greater it stopped working.  Citrix partially identified the issue here:

 

I was exploring starting and stopping the streaming service on other PVS servers from the Console and I found this information was incorrect.  Adding the NetworkService does NOT enable the streaming service to be stop/started/restarted from other machines.  The reason is the NETWORKSERVICE is a LOCAL account on the machine itself.  When it attempts to reach out and communicate with another system it is translated into a proper SID, which matches the machine account.  Since that SID communicating across the wire does not have access to the service you get a failure.

In order to fix this properly we can add either the machine account permissions for each PVS Server on each service OR we can add all machine accounts into a security group and add that as permissions to manipulate the service on each PVS Server.

I created a PowerShell script to enable easily add a group, user or machine account to the Streaming Service.  It will also list all the permissions:

An example adding a Group to the permissions to the service:

And now we can start the service remotely:

 

In order to get this working entirely I recommend the following steps:

  1. Create a Group (eg, “CTX.Servers.ProvisioningServiceServer”)
  2. Add all the PVS Machine Accounts into that group
  3. Reboot your PVS server to gain that group membership token
  4. Run the powershell script on each machine to add the group permission to the streaming service:
  5. Done!

And now the script:

 

Read More

Meltdown + Spectre – Performance Analysis

2018-04-30
/ /
in Blog
/

Meltdown and Spectre (variant 2) are two vulnerabilities that came out at the same time, however they are vastly different.  Patches for both were released extremely quickly for Microsoft OS’s but because of a variety of issues with Spectre, only Meltdown was truly available to be mitigated.  Spectre (variant 2) mitigation had a problematic release, causing numerous issues for whomever installed the fix, that it had to be recalled and the release delayed weeks.  However, around March/April 2018, the release of the Spectre patch was finalized and the microcode released.

Threat on Performance

Spectre (variant 2), of the two, threatened to degrade performance fairly drastically.  Initial benchmarks made mention that storage was hit particularly hard.  Microsoft made comments that server OS’s could be hit particularly hard.  Even worse, older operating systems would not support CPU features (PCID) that could reduce the performance impact.  Older OS’s suffer more due to the design (at the time) that involved running more code in kernel mode (fonts was signalled out as an example of one of these design decision) than newer OS’s.

As with most things on my blog I am particularly interested in the impact against Citrix/Remote Desktop Services type of workloads.  I wanted to test ON/OFF workloads of the mitigation impacts.

Setup

My setup consists of two ESXi (version 6.0) hosts with identical VM’s on each, hosting identical applications.  I was able to setup 4 of these pairs of hosts.  Each pair of hosts have identical processors.  The one big notable change is one host of each pair has the Spectre and Meltdown patch applied to the ESXi hypervisor.

The operating system of all the VM’s is Windows Server 2008 R2.  Applications are published from Citrix XenApp 6.5.

 

 

This is simply a snapshot of a single point in time to show the metrics of these systems.

Performance Considerations

Performance within a Citrix XenApp farm can be described in two ways.  Capacity and speed.

Speed

Generally, one would test for a “best case” test of the speed aspect of your applications performance.

A simplified view of this is “how fast can the app do X task?”

This task can be anything.  I’ve seen it measured by an automated script flipping through tabs of an application, where each tab pulled data from a database – rendered it – then moved on to the next tab.  The total time to execute these tasks amounted to a number that they used to baseline the performance of this application.

I’ve seen it measured as simply opening an excel document with macros and lots of formulas that pull data and perform calculations and measuring that duration.

The point of each exercise is to generate a baseline that both the app team and the Citrix team can agree to.  I’ve almost never had the baseline equal “real world” workloads, typically the test is an exaggeration of the actual workflow of users (eg, the test exaggerates CPU utilization).  Sometimes this is communicated and understood, other times not, but hopefully it gives you a starting point.

In general, and for Citrix workloads specifically, running the baseline test on a desktop usually produces a reasonable number of, “well… if we don’t put it on Citrix this is what performance will be like so this is our minimum expectation.”  Sometimes this is helpful.

Once you’ve established the speed baseline you can now look at capacity.

Capacity

After establishing some measurable level of performance within the application(s) you should now be able to test capacity.

If possible, start loading up users or test users running the benchmark.  Eventually, you’ll hit a point where the server fails — either because it ran out of resources, performance degraded so much it errors, etc.  If you do it right, you should be able to start to find the curve that intersects descending performance with your “capacity”.

At this point, cost may come into consideration.

Can you afford ANY performance degradation?

If not, than the curve is fairly easy.  At user X we start to see performance degrade so X-1 is our capacity.

If yes, at what point does performance degrade so much that adding users stops making sense?  Using the “without Citrix this is how it performs on the desktop” can be helpful to establish a minimum level of performance that the Citrix solution cannot cross.

Lastly, if you have network bound applications, and you have an appropriately designed Citrix solution where the app servers sit immediately beside the network resources on super-high bandwidth, ultra-low latency you may never experience performance degradation (lucky you!).  However, you may hit resource constraints in these scenarios.  Eg, although performance of the application is dependent on network, the application itself uses 1GB of RAM per instance of the application — you’ll be limited pretty quickly be the amount of RAM you can have in your VM’s.  These cases are generally preferred because the easy answer to increase capacity is *more hardware* but sometimes you can squeeze some more users with software like AppSense or WEM.

Spectre on Performance

So what is the impact Spectre has on performance — speed and/or capacity?

If Spectre simply makes a task take longer, but you can fit the same number of tasks on a given VM/Host/etc. then the impact is only on speed. Example: a task that took 5 seconds at 5% CPU utilization now takes 10 seconds at 5% CPU utilization.  Ideally, the capacity should be identical even though the task now takes twice as long.

If Spectre makes things use *more* resources, but the speed is the same, then the impact is only on capacity.  Example: a task that took 5 seconds at 5% CPU utilization now takes 10% CPU utilization.  In this scenario, the performance should be identical but your capacity is now halved.

The worst case scenario is if the impact is on both, speed and capacity.  In this case, neither are recoverable except you might be able to make up some speed with newer/faster hardware.

I’ve tested to see the impacts of Spectre in my world.  This world consists of Windows 2008 R2 with XenApp 6.5 on hardware that is 6 years old.  I was also able to procure some newer hardware to measure the impact there as well.

Test Setup

Testing was accomplished by taking 2 identically configured ESXi hosts, applying the VMWare ESXi patch with the microcode for Spectre mitigation to one of the hosts, and enabling it in the operating system.  I added identical Citrix VM’s to both hosts and enabled user logins to start generating load.

 

Performance needs to measured at two levels.  At the Windows/VM level, and at the hypervisor/host level.  This is because the Hypervisor may pickup the additional work required for the mitigation that the operating system may not, and also due to the way Windows 2008 R2 does not accurately measure CPU performance.

Windows/VM Level – Speed

I used ControlUp to measure and capture performance information.  ControlUp is able to capture various metrics including average logon duration.  This singular metric includes various system interactions, from using the network by querying Active Directory, pulling files from network shares, disk to store group policies in a cache, CPU processing which policies are applicable, and executables being launched in a sequence.  I believe that measuring logons is a good proxy for understanding the performance impact.  So lets see some numbers:

 

The top 3 results are Spectre enabled machines, the bottom 3 are without the patch.  The results are not good.  We are seeing a 200% speed impact in this metric.

With ControlUp we can drill down further into the impact:

Without Spectre Patch

 

With Spectre Patch

 

The component that took the largest hit is Group Policy.  Again, ControlUp can drill down into this component.

Without Spectre

 

With Spectre

All group policy preference components take a 200% hit.  The Group Policy Preferences functions operate by pulling down an XML file from the SYSVOL store, reading the XML file, than applying whatever resultant set of policies finds applicable.  In order to trace down further to find more differences, I logged into each type of machine, one with Spectre and one without, and started a Process Monitor trace.  Group Policy is applied via the Group Policy service, which a seperate instance of the svchost.exe.  The process can be found via Task Manager:

Setting ProcMon to filter only on that PID we can begin to evaluate the performance.  I relogged in with procmon capturing the logon.

Spectre Patched system on left, no patch on right

Using ProcessMonitor, we can look at the various “Summaries” to see which particular component may be most affected:


We see that 8.45 seconds is spent on the registry, 0.40 seconds on file actions, 1.04 seconds on the ProcessGroupPolicyExRegistry instruction.

The big ticket item is the time spent with the registry.

So how does it compare to a non-spectre system?

 

We see that 1.97 seconds is spent on the registry, 0.33 seconds on file actions, 0.24 seconds on the ProcessGroupPolicyExRegistry instruction.

Here’s a table showing the results:

So it definitely appears we need to look at the registry actions.  One of the cool things about Procmon is you can set a filter on your trace and open up the summaries and it will show you only the objects in the filter.  I set a filter for RegSetValue to see what the impact is for setting values in the registry:

RegSetValue – without spectre applied

 

RegSetValue – with spectre applied

1,079 RegSetValue events and a 4x performance degradation.  Just to test if it is specific to write events I changed the procmon filter to filter on “category” “Read”

 

Registry Reads – Spectre applied

 

Registry Reads – Spectre not applied

We see roughly the same ratio of performance degradation, perhaps a little more so.  As a further test I created a PowerShell script that will just measure creating 1000 registry values and test it on each system:

Spectre Applied

 

Spectre Not Applied

 

A 2.22x reduction in performance.  But this is writing to the HKCU…  which is a much smaller file.  What happens if I force a change on the much larger HKLM?

Spectre Applied

 

Spectre Not Applied

 

Wow.  The size of the registry hive makes a big difference in performance.  We go from 2.22x to 3.42x performance degradation.  So on a minute level, Spectre appears to have a large impact on Registry operations and the larger the hive the worse the impact.  With this information there is a large element of sense as to why Spectre may impact Citrix/RDS more.  Registry operations occur with a high frequency in this world, and logon’s highlight it even more as group policy and the registry are very intertwined.

This actually brings to mind another metric I can measure.  We have a very large AppV package that has a 80MB registry hive that is applied to the SOFTWARE hive when the package is loaded.  The difference in the amount of time (in seconds) loading this package is:

“583.7499291” (not spectre system)
“2398.4593479” (spectre system)

This goes from 9.7 mins to 39.9 minutes.  Another 4x drop in performance and this would be predominately registry related.  So another bullet that registry operations are hit very hard.

Windows/VM Level – Capacity

Does Spectre affect the capacity of our Citrix servers?

I recorded the CPU utilization of several VM’s that mirror each other on hosts that mirror each other with a singular difference.  One set had the Spectre mitigation enabled.  I then took their CPU utilization results:

Red = VM with Spectre, Blue = VM without Spectre

By just glancing at the data we can see that the Spectre VM’s had higher peaks and they appear higher more consistently.  Since “spiky” data is difficult to read, I smoothed out the data using a moving average:

Red = VM with Spectre, Blue = VM without Spectre

We can get a better feel for the separation in CPU utilization between Spectre enabled/disabled.  We are seeing clearly higher utilization.

Lastly, I took all of the results for each hour and produced a graph in an additive model:

This graph gives a feel for the impact during peak hours, and helps smooth out the data a bit further.  I believe what I’m seeing with each of these graphs is a performance hit measured by the VM at 25%-35%.

Host Level – Capacity

Measuring from the host level can give us a much more accurate picture of actual resources consumed.  Windows 2008 R2 isn’t a very accurate counter and so if there are lots of little slices, they can add up.

My apologies for swapping colors.  Raw data:

Blue = Spectre Applied, Red = No Spectre

Very clearly we can see the hosts with Spectre applied consume more CPU resources, even coming close to consuming 100% of the CPU resources on the hosts.  Smoothing out the data using moving averages reveals the gap in performance with more clarity.

 

Showing an hourly “Max CPU” per hour hit gives another visualization of the performance hit.

 

Summary

Windows 2008 R2, for Citrix/RDS workloads will be impacted quite highly.  The impact that I’ve been able to measure appears to be focused on registry-related activities.  Applications that store their settings/values/preferences in registry hives, whether they be the SOFTWARE/SYSTEM/HKCU hive will feel a performance impact.  Logon actions on RDS servers would be particularly impacted because group policies are largely registry related items, thus logon times will increase as it takes longer to process reads and writes.  CPU utilization is higher on both the Windows VM-level and the hypervisor level,  up to 40%.  The impact of speed on the applications and other functions is notable although more difficult to measure.  I was able to measure a ~400% degradation in performance for CPU processing for Group Policy Preferences, but perception is a real thing, so going from 100ms to 400ms may not be noticed.  However, on applications that measure response time, it was found we had a performance impact of 165%.  What took 1000ms now takes 1650ms.

At the time of this writing, I was only able to quantify the performance impact between two of the different hosts.  The Intel Xeon E5-2660 v4 and Intel Xeon E5-2680.

The Intel Xeon E5-2660 v4 has a frequency of 26% less than the older 2680.  In order to overcome this handicap, the processor must have improved at a per-clock rate higher than the 26% frequency loss.  CPUBenchMark had the two processors with a single thread CPU score of 1616 for the Intel Xeon E5-2660 v4 and 1657 for the Intel Xeon E5-2680.  This put them close but after 4 years the 2680 was marginally faster.  This has played out in our testing that the higher frequency processor is faster.  Performance degradation for the two processors came out as such:

 

CPU Performance Hit
Intel Xeon E5-2680 2.70GHz 155%
Intel Xeon E5-2660 v4 2.00GHz 170%

 

This tells that frequency of the processor is more important to mitigate the performance hit.

Keep in mind, these findings are my own.  It’s what I’ve experienced in my environment with the products and operating systems we use.  Newer operating systems are supposed to perform better, but I don’t have the ability to test that currently so I’m sharing these numbers as this is an absolute worst case type of scenario that you might come across.  Ensure you test the impact to understand how your environment will be affected!

Read More

Meltdown – Performance Impact Evaluation (Citrix XenApp 6.5)

2018-01-15
/ /
in Blog
/

Meltdown came out and it’s a vulnerability whose fix may have a performance impact.  Microsoft has stipulated that the impact will be more severe if you:

a) Are an older OS
b) Are using an older processor
c) If your application utilizes lots of context switches

Unfortunately, the environment we are operating hits all of these nails on the head.  We are using, I believe, the oldest OS that Microsoft is patching this for.  We are using older processors from 2011-2013 which do not have the PCID optimization (as reported by the SpeculationControl test script) which means performance is impacted even more.

I’m in a large environment where we have the ability to shuffle VM’s around hosts and put VM’s on specific hosts.  This will allow us to measure the impact of Meltdown in its entirety.  Our clusters are dedicated to Citrix XenApp 6.5 servers.

Looking at our cluster and all of the Citrix XenApp VM’s, we have some VM’s that are ‘application siloed’ — that is they only host a single application and some VM’s that are ‘generic’.

In order to determine the impact I looked at our cluster, summed up the total of each type of VM and then just divided by the number of hosts.  We have 3 different geographical areas that have different VM types and user loads.  I am going to examine each of these workload types across the different geographical areas and see what are the outcomes.

Since Meltdown impacts applications and workloads that have lots of context switches I used perfmon each server to record the context switches of each server.

The metrics I am interested in are the context switch values as they’ve been identified as the element that highlight the impact.  My workloads look like this:

Based on this chart, our largest impact should be Location B, followed by Location A, last with Location C.

However, the processors for each location are as follows:

Location A: Intel Xeon 2680
Location B: Intel Xeon 2650 v2
Location C: Intel Xeon 2680

The processors may play a roll as newer generation processors are supposed to fair better.

In order to test Meltdown in a side-by-side comparison of systems with and without the mitigation I took two identical hosts and populated them with an identical amount and type of servers. On one host we patched all the VM’s with the mitigation and with the other host we left the VM’s without the patches.

Using the wonderful ControlUp Console, we can compare the results in their real-time dashboard.  Unfortunately, the dashboard only gives us a “real time” view, ControlUp offers a product called “Insights” that can show the historical data, our organization has not subscribed to this product and so I’ve had to try and track the performance by exporting the ControlUp views on a interval and then manually sorting and presenting the data.  Using the Insights view would be much, much faster.

ControlUp has 3 different views I was hoping to explore.  The first view is the hosts view, this will be performance metrics pulled directly from the VMWare Host.  The second view will be the computers view, and the last will be the sessions view.  The computers and sessions view are metrics pulled directly from the Windows server itself.  However, I am unable to accurately judge the performance of Windows Server metrics because of how it measures CPU performance.

Another wonderful thing about ControlUp is we can logically group our VM’s into folders, from there ControlUp can sum the values and present it in an easily digestible presentation.  I created a logical structure like so and populated my VM’s:

 

And then within ControlUp we can “focus” on each “Location” folder and if we select the “Folder” view it presents the sums of the logical view.

HOSTS

In the hosts view, we can very quickly we can see impact, ranging from 5%-26%.  However, this is a realtime snapshot, I tracked the average view and examined only the “business hours” as our load is VERY focused on the 8AM-4PM.  After these hours and we see a significant drop in our load.  If the servers are not being stressed the performance seems to be a lot more even or not noticeable (in a cumulative sense).

FOLDERS

Some interesting results.  We are consistently seeing longer login times and application launch times.  2/3 of the environments have lower user counts on the unpatched servers with the Citrix load balancing making that determination.  The one environment that had more users on the mitigation servers are actually our least loaded servers in terms of servers per host and users per server, so it’s possible that more users would drive into a gap, but as of now it shows that one of our environment can support an equal number of users.

Examining this view and the historical data presented I encountered oddities — namely the CPU utilization seemed to be fairly even more often than not, but the hosts view showed greater separation between the machines with mitigation and without.  I started to explore why and believe I may have come across this issue previously.

2008R2-era servers have less accurate reporting of CPU utilization.

I believe this is also the API that ControlUp is using with these servers to report on usage.  When I was examining a single server with process explorer I noticed a *minimum* CPU utilization of 6%, but task manager and ControlUp would report 0% utilization at various points.  The issue is an accuracy, adding and rounding issue.  The more users on a server with more processes and those processes consuming ever so slightly CPU, the more the inaccuracy.  Example:

Left, Task Manager. Right, Process Explorer

We have servers with hundreds of users utilizing a workflow like this where they are using just a fraction of a percent of CPU resources.  Taskmanager and the like will not catch these values and round *down*.  If you have 100 users using a process that consumes 0.4% CPU then our inaccuracy is in the 40% scale!  So relying on the VM metrics of ControlUp or Windows itself is not helpful.  Unfortunately, this destroys my ability to capture information within the VM, requiring us to solely rely on the information within VMWare.  To be clear, I do NOT believe Windows 2012 R2 and greater OS’s have this discrepancy (although I have not tested) so this issue manifests itself pretty viciously in the XenApp 6.5 -era servers.  Essentially, if Meltdown is increasing CPU times on processes by a fraction of a percent then Windows will report as if everything is ok and you will probably not actually notice or think there is an issue!

In order to try and determine if this impact is detectable I have two servers with the same base image, with one having the mitigation installed and other did not.  I used process explorer and “saved” the process list over the course of a few hours.  I ensured the servers had a similar amount of users using a specific application that only presented a specific workload so everything was as similar as possible.  In addition, I looked at processes that aren’t configurable (or since the servers have the same base image they are configured identically).  Here were my results:

Just eye balling it, it appears that the mitigation has had an impact on the fractional level.  When taking the average of the Winlogon.exe and iexplore.exe processes into account:

 

These numbers may seem small, but once you start considering the number of users the amount wasted grows dramatically.  For 100 users, winlogon.exe goes from consuming a total of 1.6% to 7.1% of the CPU resulting in an additional load of 5.5%.  The iexplore.exe is even more egregious as it spawns 2 processes per user, and these averages are per process.  For 100 users, 200 iexplore.exe processes will be spawned.  The iexplore.exe CPU utilization goes from 15.6% to 38.8%, for an additional load of 23.2%.  Adding the mitigation patch can impact our load pretty dramatically, even though it may be under reported thus impacting users on a far greater scale by adding more users to servers that don’t have the resources that Windows is reporting it has For an application like IE, this may just mean greater slowness — depending on the workload/workflow — but if you have an application more sensitive to these performance scenarios your users may experience slowness even though the servers themselves look OK from most (all?) reporting tools.

Continuing with the HOSTS view, I exported all data ControlUp collects on a minute interval and then added the data to Excel and created my pivot tables with the hosts that are hosting servers with the mitigation patches and the ones without.  This is what I saw for Saturday-Sunday, these days are lightly loaded.

This is Location B, the host with VM’s that are unpatched is in orange and the host with patched VM’s is in blue.  the numbers are pretty identical when CPU utilization on the host is around or below 10%, but once it starts to get loaded the separation becomes apparent.

Since these datapoints were every minute, I used a moving average of 20 data points (3 per hour) to present the data in a cleaner way:

Looking at the data for this Monday morning, we see the following:

Location B

 

Some interesting events, at 2:00AM the VM’s reboot.  We reboot odd and even servers each day, and in my organization of testing this, I put all the odd VM’s on the blue host, and the even VM’s on the orange host.  So the blue line going up at 2:00AM is the odd (patched) VM’s rebooting.  The reboot cycle is staggered to take place over a 90 minute interval (last VM’s should reboot around 3:30AM).  After the reboot, the servers come up and do some “pre-user” startup work like loading AppV packages, App-V registry prestaging, etc.  I track the App-V registry pre-staging duration during bootup and here are my results:

Registry Pre-staging in AppV is a light-Read heavy-Write exercise.  Registry reading and writing are slow in 2008R2 and our time to execute this task went from 610 seconds to 693 seconds for an overall duration increase of 14%.

Looking at Location A and C

Location A

Location C (under construction)

We can see in Location A the CPU load is pretty similar until the 20% mark, then separation starts to ramp up fairly drastically.  For Location C, unfortunately, we are undergoing maintenance on the ‘patched’ VM’s, so I’m showing this data for transparency but it’s only relevant up to the 14th.  I’ll update this in the next few days when the ‘patched’ VM’s come back online.

Now, I’m going to look at how “Windows” is reporting CPU performance vs the Hosts CPU utilization.

Location A

 

Location B

 

The information this is conveying is to NOT TRUST the Windows CPU utilization meter (at least 2008 R2).  The CPU Utilization on the VM-level does not appear to reflect the load on the hosts.  While the VM’s with the patch and without the patch both report nearly identical levels of CPU utilization, on the host level the spread is much more dramatic.

 

Lastly, I am able to pull some other metrics that ControlUp tracks.  Namely, Logon Duration and Application Launch duration.  For each of the locations I got a report of the difference between the two environments

Location A: Average Application Load Time

Location B: Average Application Load Time

 

Location A: Logon Duration

 

Location B: Logon Duration

 

 

 

In each of the metrics recorded we experience a worsening experience for our user base, from the application taking longer to launch, to logon times increasing.

What does this all mean?

In the end, Meltdown has a significant impact on our Citrix XenApp 6.5 environment.  The perfect storm of older CPU’s, an older OS and applications that have workflows that are impacted by the patch means our environment is grossly impacted.  Location A has a maximum hit (as of today) of 21%.  Location B having a spread of 12%.  I had originally predicted that Location B would have the largest impact, however the newer V2 processors may be playing a roll and the performance of the V2 processors maybe more efficient than the older 2680.

In the end, the performance hit is not insignificant and reduces our capacity significantly once these patches are deployed.  I plan on adding new articles once I have more data on Meltdown and then further again once we start adding the mitigation’s against Spectre.

CPU Utilization on the hosts. Orange is a host with VM’s without the Meltdown patches, blue is with the patches.

 

Read More

Citrix Storefront – Adventures in customization – Add a help button to your Storefront UI

2017-12-27
/ /
in Blog
/

This customization is pretty easy.  Add the following to your custom.js file:

Replace “http://www.google.ca” with the URL you want your help screen to be.

Read More

Citrix Storefront – Adventures in customization – Default to “Store” view if you have no favourited app’s

2017-12-22
/ /
in Blog
/

We are in the process of migrating users from Web Interface to Storefront.  We have identified a potential issue; new users are directed to the “Favourites” view which doesn’t have any applications be default, instead it has instructions on how to add apps to the favourites view.

New users might say, “Where did my apps go?!”

The concern is users may become confused because Web Interface shows all your applications, and this new view shows none.  What we want to do to solve this is default to the “Store” view if you have no favourite apps, and default to the favourites view if you have at least 1 app favourite.

 

We can do this.

 

Just add the code above to your custom.js file and the default view will be changed to the store if you have no favorited apps.  Done!

Read More

Citrix XenDesktop/XenApp 7.15 – The local host cache in action

2017-11-29
/ /
in Blog
/

The Citrix Local Host Cache feature, introduced in XenDesktop/XenApp 7.12, has some nuances that maybe better demonstrated in realtime then typed out in text.  I will do both in this article to share both a ‘step by step’ of what happens when you have a network or site database outage and what occurs as well as a realtime video highlighting the feature in action.  There are many other blogs and articles that do a great job going into the step by step details of the feature but I find seeing it in action to be very informative.

To view a video of this process, scroll to the very end, or click here.

To start, I’ve created a powershell script that simulates a user querying the broker for a list of applicaitons.

Columns are time of the response, the payload size received (in bytes) and the total time to respond in milliseconds.

As we’re querying the broker, the broker is reaching out to the database and then responding to the user with the information requested.

 

Periodically, the Citrix Config Synchronizer Service will check to ensure the local host cache database is in sync with the site database. This is an event that occurs every 2 minutes during normal operation.

To show the network connection failing, I am going to setup a continuous ping to the database server

To simulate a network failure, I’m going to use the tool clumsy to drop all packets to and from the database server.

Clicking start in clumsy immediately stops the simulated user from getting their list of applications.

 

And the ping’s now time out in their requests.

The broker has a 20 second time out that after which it will respond to requests with what it thinks is the current status. The first timed out request receives a response of “working” and then thereafter a response of “pending failed” will be returned

Around 24 seconds the broker has noticed the database has failed and has logged it’s first event, 1201, “The connection between the Citrix Broker Service and the database has been lost”

Now one-minute thirty three seconds into the failure, other Citrix services are now reporting they cannot contact the database.

Just shy of 2 minutes, the broker service has now exceeded it’s timeout for contacting the database and is in the process of switching to the local host cache. It stops the “primary broker”.

And then the Citrix High Availability Service comes active, brokering user requests.

In my simulation the amount of time it took the user to receive a response from the LHC is a little faster than the site database. The LHC response time is 80-90 milliseconds where the response time for a request that includes the site database is 90-100. This information allows us to visually see the two different modes of operation in action.

Top, site database response times – middle is the outage – bottom is LHC response times

How long does it take to “fall back” to the database when connectivity is restored?

I “Stopped” clumsy to restore our network connection and started a timer.

 

We can see the ping responses from the database immediately to verify our connection is back.

 

Almost immediately, all services have noticed that they have connectivity again, including the broker service.

However, we do not fall back immediately.

At one minute thirty three seconds the broker has switched back to the primary broker. And all services have been restored.

To watch a video of this all in action, please view here:

 

 

Read More

Citrix Storefront – Adventures in customization – Define a custom resolution for a specific application

2017-11-14
/ /
in Blog
/

Currently, Storefront does not grant the ability to define applications with specific resolutions.  In order to configure the resolution, Citrix recommends you modify the default.ica file.  This is terrible!  If you had specific applications that required specific resolutions, what are you to do?  Direct users to a variety of stores depending on the resolution required?!

Fortunately, again, we can extend StoreFront to make it so we can configure custom resolutions for different applications on the same store.  The solution is a Storefront extension I’ve already written.

The steps to set this up:

  1. Download the Storefront_CustomizationLaunch.dll.
  2. Copy the file to C:\inetpub\wwwroot\Citrix\Store\bin
  3. Edit the web.config in the Store directory and enable the extension
  4. We need to enable Header pass-through for DesiredHRES, DesiredVRES, and TWIMode in the “C:\inetpub\wwwroot\Citrix\StoreWeb\web.config” file:
  5. Lastly, add the following to the custom.js file in your StoreWeb/custom folder:
  6. And enjoy the results!  🙂

Read More

Citrix Storefront – Adventures in customization – Prepopulate Explicit Logon Credentials

2017-10-31
/ /
in Blog
/

Citrix Storefront allows you to prepopulate the credentials for your Explicit Logon.  The explicit logon screen is generally seen here:

And you can prepopulate the Username/Password fields.  If you don’t want to prepopulate the password, that’s fine too.  There are 3 properties and none are required.  Username, Password and Domain.  In order to prepopulate you must pass your credentials through to Storefront somehow, either as a cookie, header or as a URL search query.  I will demo it in the URL search query since I already have that code for pulling the parameters.  You must have “Explicit Authentication” enabled, aka, “User name and Password”:

Put the following code into your custom.js file:

The url to query is:

And the result:

Read More

Citrix Storefront – Adventures in customization – Login via credentials in URL search query

2017-10-30
/ /
in Blog
/

If you use a 3rd party service to connect to your Citrix Storefront environment, you may want to “pass-through” credentials without using domain authentication or whatever.  This post illustrates how you can login to your Storefront environment using nothing more than a URL with your credentials embedded in them.  To enable this functionality, this code must be in your custom.js file.

You MUST have HTTP Basic enabled as an authentication method on your Citrix Storefront Store.

The URL to login would look like this:

Put it all together:

Read More
Page 1 of 1312345...10...Last »