Blog

Citrix PVS Image Management Design Challenges – Case Study

2018-10-29
/ /
in Blog
/

Challenge:

Different applications have differing levels of operating system support.  A vendor might only support their app on a specific operating system, or multiple /older/ operating systems, or the organization hasn’t updated to a new version that supports the new OS; whether because the new version has a cost attached to upgrading the org cannot afford at this time, the people that can assist the upgrade are tied up in other projects, or the org has lowered the priority of the app because of low user count and no impact to the app functionality outside of newer OS support.  Whatever the situation, inevitably, someone will come to the Citrix team and ask that these applications be made available.  And, in all honesty, Citrix has made this all but possible with the 7.1x product.  The LTSR of XenApp/XenDesktop supports 2008R2 for the next 4 years, and 2012R2 and 2016 will continue to support the latest release until the next LTSR where 2012R2 will probably receive another 5 year reprieve from the Citrix side.  You probably won’t have Microsoft support, but that doesn’t stop organizations from continuing to run Server 2003 for apps that can’t move beyond that platform.

What does this mean:

Application support requirements now necessitate multiple OS’s available thru Citrix.  The goal is to have the app on the highest supported OS.  For instance, if an app is supported on 2008R2 and 2012R2, have the app run on 2012R2.  By December 2018, this now means supplying the organization the opportunity and flexibility to have their app on a choice of “5” different operating systems:
2008R2
2012R2
2016
2019
Windows 10 ERS

All on the XenApp 7.1x platform.

By providing these choices, we should be ensuring maximum application compatibility and support from the vendor, and future proofing.

Current State:

The original design of our environment was a single image, and using layering technologies to add software without causing compatibility issues.  The technology of choice was AppV (version 4.6 as of the original design).

It was simple, it would be effective, and it would be easy to manage.

Using machine group memberships, AppV layers could be customized for the machine.  For instance, we could have pools of servers that have different personalities.  We had a “Generic” pool upon which 200 or so AppV applications would reside and if a new application request came down, instead of silo’ing a XenApp VM for 3-10 users, we could put them on the generic pool and grow the generic pool if required.  This is very cost effective, reduces overhead on the hosts by reducing the number of VM’s, reduces the number of uniquely sized VM’s and overall made management easier.  Utilizing AppV it became trivial to have multiple applications, even the exact same versions of the applications but different configurations available for users to run from the same VM.  However, we have large user-based applications with specific performance goals and monitored metrics that are easier to mange when they run in their own silo.  And with AppV we could still use that same image but apply only the specific AppV packages and have the machines fully unique for those users.  And from there it again became trivial to manage the single image.

Utilizing this model has reduced and kept our image count low while maintaining a high degree of flexibility.  We have servers targeted for specific and unique purposes or servers hosting applications that are more general in nature.  By reducing the amount of different VM configurations running on a variety of different hosts, we are able to fairly accurately predict our performance and thus plan accordingly for sizing.

Are we really on one image?

No.  We have 16 of them.

Although one image was the original design goal, required OS features crippled our ability to maintain a low image count.  The worst offenders were applications with specific Internet Explorer version requirements.  Being unable to run multiple versions of IE on a single image meant we had to split our single image.  And we ended up splitting it into 3: images with IE9, IE10, and IE11.  To top it further off, when we originally went through the design with AppV, AppV5.0 had just come out and had crippling limitations.  AppV 4.6 was still the preferred platform while AppV5 matured.  Eventually, we had a couple applications that would not work on 4.6 due to hitting the package size limit (2GB IIRC) or due to crashing when running within the bubble.  AppV5.0 at the time was just plain broken and a huge number of our apps just outright failed.  Due to these limitations, our decision was to create new images; branching the 3 images even further to a total of 8.

So how did you get to 16?

We have a completely separate test environment.  Once an image was created, it was duplicated for test.  The design was to ensure any application upgrades, component upgrades, Windows Updates, policy changes, troubleshooting, etc would be done in the test environment first before being implemented in prod.  This has worked fairly well, however we encountered something that caused our test environment to not be reflective of production: TIME.

One of the challenges we have faced was the testing of applications or components or tweaks lasting months.  Testing a fix/feature/upgrade usually followed this path:

The challenges in this process occurs if the development of fixes/upgrade/enhancements spans multiple image “baking” sessions over a time period of weeks or months.  It seems no matter how good you think your documentation is, if the development spans 10’s to 100’s of tweaks over sufficient time, something important will be lost in the chain.  The process of enabling the change into production followed this process:

Couldn’t you just copy test into production?

In the original design test and production were wholly separate and lived apart.  The process of documenting and pushing changes from test into prod actually worked quite well until we started encountering the lengthy development/test change time scales, and numerous /different/ changes that occur across the different images.  Copying from test into production was done a few times and painful results each time.  The original image and design was not done to accommodate such a scenario so items that needed to be updated from test to prod were numerous and onerous.  Scripts, registry, files, group policies…  during some of these copies outages were incurred, again, because something was inevitably missed.  At this point, trying to flush out all the different changes that would have to be done across 8 images and than maintaining the list was fairly daunting for our team.  Having outages is a really, really bad thing.

What’s wrong with 16 images?

Our images have branched from each other in various forms going back to that first image created around 2011.  These images all have lineages 7 years old and with age comes problems.  We have been finding that these images have had some corruption in some form or another.  Whether it’s a registry key that’s lost its ACL or is completely corrupted, Windows Updates that fail to install, a file whose content is now garbage as opposed to clean text, or the complete inability to uninstall or upgrade some software packages.  Thus far, none of these problems have been unsolvable, but the fact they are cropping with more regularity across all of our images is very worrisome.  The time taken to resolve these issues as they occur is getting excessive and more common.  When we looked at just updating these images for XenDesktop 7 we found we were unable to do so because XenApp 6.5 would not uninstall from some of them!

Last word on “Current State”

Being a large environment, we could have several application/components simultaneously under this test/development process; compounding the number of changes.

Because of a separation of test and production the amount of changes during the development stage targeting test servers/components/configs inevitable have something get lost/forgotten/missed.  Copying test images to production is untenable in this process.  So many hard-coded pieces embedded within the test images or configurations that need to be updated that when we *have* had to a test to prod copy things break and a lot of effort goes into tracking these additional hard-coded changes.

All of these issues exist within just a single OS, with a single version of Citrix.  The future will be multi-OS with multiple versions of the Citrix VDA.  Our current design is unfeasible if the various OS’s start encountering similar, unique image requirements.  And it may be naive to think we can avoid it.  But I believe we can put forth a design to make this more manageable, more predictable and more stable.

Read More

Meltdown + Spectre – Performance Impact Analysis (Take 2)

2018-09-14
/ /
in Blog
/

I want to start this off by thanking ControlUp, LoginVSI and Ryan Ververs-Bijkerk for his assistance in helping me with this post.

Building on my last evaluation of the performance impact of Meltdown and Spectre, I was graciously given a trial of the LoginVSI software product used to simulate user loads and ControlUp’s cloud based analytic tool, ControlUp Insights.

This analysis takes into consideration the differences of operating systems and uses the latest Dell server hardware platform with a Intel Gold 6150 as the processor at its heart.  This processor contains the full PCID instructions to give the best performance (as of today) with mitigations enabled.  However, only Server 2012R2 and Server 2016 can take advantage of these hardware features.

Test Setup

This test was setup on 4 hosts.  2 of the hosts had VM’s where all the mitigations were enabled and 2 hosts had all of the mitigation features disabled.  I tested live production workloads and simulated user loads from LoginVSI.  The live production workloads were run on XenApp 6.5 on 2008R2 and the simulated workloads were on XenApp 7.15CU2 with 2008R2, 2012R2 and 2016.

Odd host numbers had the mitigation disabled, even host number had the mitigation enabled.

I sorted my testing logically in ControlUp by folder.

Real World Production results

The ControlUp Insights cloud product produced graphs and results that were easy and quick to interpret.  These results are for XenApp 6.5, Server 2008R2.

Hosts View

CPU Utilization

The mitigation-disabled “Host 1” had a higher consistency of using less CPU than the mitigation-enabled “Host 2”. The biggest spread in CPU was ~20% on the Intel Gold 6150’s with mitigation enabled to disabled.

IO Utilization

Another interesting result was IO utilization increased by an average of 100 IOPS for mitigation-enabled VM’s.  This means that Meltdown/Spectre also tax the storage subsystem more.  This averaged out to a consistent 12% hit in performance.

User Experience

The logon duration of the VM’s increased 43% from an average of 8 seconds on a mitigation-disabled VM to 14 seconds on a mitigation enabled VM.  The biggest jump in the sub-metrics were Logon Duration (Other) and Group Policy time going from 3-5s and 6-8s.

Mitigation Disabled

Mitigation Enabled

For applications we have that measure “user interactivity” the reduction in the user experience was 18%.  This meant that an action on a mitigation-enabled VM took an average of 1180ms vs 990ms on a mitigation-disabled VM when measuring actions within the UI.

Honestly, I wish I had ControlUp Insights earlier when I did my original piece, it provides much more detail in terms of tracking additional metrics and presents it much more cleanly that I did.  Also, when the information was available it was super quick and fast to look and compare the various types of results.

Simulated Results

LoginVSI was gracious enough to grant me licenses to their software for this testing.  Their software provides simulated user actions including pauses for coffee and chatting between work like typing/sending emails, reading word or PDF documents, or reviewing presentations.  The suite of software tested by the users tends to be major applications produced by major corporations who have experience producing software.  It is not representative of applications that could be impacted the most by Spectre/Meltdown (generally, applications that are registry event heavy).  Regardless, it is interesting to test with these simulated users as the workload produced by them do fall under the spectrum of “real world”.  As with everything, your mileage will vary and it is important to test and record your before and after impacts.  ControlUp with Insights does an incredible job of this and you can easily compare different periods in time to measure the impacts, or just rely on the machine learning suggestions of their virtual experts to properly size your environment.

Since our production workloads are Windows Server 2008R2 based, I took advantage of the LoginVSI license to test all three available server operating systems: 2008R2, 2012R2, and 2016.  Since newer operating systems are supposed to enable performance enhancing hardware features that can reduce the impact of these vulnerabilities, I was curious to *how much*.  I have this information now. 

I tested user loads of 500, 300, and 100 users across 2 hosts.  I tested one with Spectre/Meltdown applied and one without.  Each host ran 15VM’s with each VM having 30GB RAM and 6vCPU for an CPU oversubscription of 2.5:1.  The host spec was a Dell PowerEdge M640 with Intel 6150 Gold processors and 512GB of memory.

2016 – Hosts View

500 LoginVSI users with this workload, on Server 2016 pegged the hosts CPU to 100% on both the Meltdown/Spectre enabled and disabled hosts.  We can still see the gap between the CPU utilization between the two with the Meltdown Spectre hosts

300 LoginVSI users with this workload, on Server 2016 we see the gap is narrow but still visible.

100 LoginVSI users with this workload, on Server 2016 we see the gap is barely visible, it looks even.

2012R2 – Hosts View

500 LoginVSI users in this workload on Server 2012 R2.  There definitely appears to be a much larger gap between the meltdown enabled and disabled hosts.  And Server 2012R2 non-mitigated doesn’t cap out like Server 2016.

300 LoginVSI users in this workload on Server 2012R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2012R2.  Again, the separation is noticeable but appears narrower with lighter loads.

2008R2 – Hosts View

500 LoginVSI users in this workload on Server 2008R2.   Noticeable additional CPU load with the Meltdown/Spectre host.  A more interesting thing is it apperas overall CPU utilization is lower than 2012R2 or 2016.

300 LoginVSI users in this workload on Server 2008R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2008R2.  I only captured one run and the low utilization makes the difference barely noticeable.

Some interesting results for sure.  I took the data and put it into a pivot table to highlight the CPU differences for each workload against each operating system.

This chart hightlights the difference in CPU percentage between mitigation enabled and disabled systems.  The raw data:

Again, interesting results.  2008R2 seems to have the largest average CPU seperation, hitting 14%, followed by 2012R2 at 11% and than 2016 having a difference of 4%.

One of things about these results is that they highlight the “headroom” of the operating systems.  2008R2 actually consumes less CPU and so it has more room for separation between the 3 tiers.  On the 2016, there is so much time spent where the CPU was pegged at 100% for both types of host that makes a difference of “0%”.  So although the smaller number on server 2016 may lead you to believe it’s better, it’s actually not.

This shows it a little more clear.  With mitigations *enabled*, Server 2008R2 can do 500 users at less average CPU load than 2016 can do 300 users with mitigations *disabled*.

From the get-go, Server 2016 appears to consume 2x more CPU than Server 2008R2 in all non-capped scenarios with Server 2012 somewhere in between.

When we compare the operating systems against the different user counts we see the impact the operating system choice has on resources. 

Mitigation Disabled:

100 Users
300 Users
500 Users

Mitigation Enabled:

100 Users
300 Users
500 Users

Final Word

Microsoft stated that they expected to see less of an impact of the Spectre/Meltdown mitigations with newer operating systems.  Indeed this does turn out to be the case.  However, the additional resource cost of newer operating systems is actually *more* than running 2008R2 or 2012R2 with mitigations enabled .  So if you’re environment is sized for running Server 2016, you probably have infrastructure that has already been spec’ed for the much heavier OS anyways.  If your infrastructure has been spec’ed for the older OS than you will see a larger impact.  However, if you’ve spec’ed for the larger OS (say for a migration activity) but are running your older OS’s on that hardware, you will see an impact but it will be less than when you go live with 2016.

Previously I had stated that there are two different, important performance mechanisms to consider; capacity and how fast work actually gets done.  All of these simulated measurements are about capacity.  I hope to see how speed is impacted between the OS’s, but that may have to wait for a future posting. 

Tabulating the LoginVSI simulated results without ControlUp Insights has taken me weeks to properly format the results.  I was able to use a trial of ControlUp Insights to look at the real world impact of our existing applications and workloads.  If my organization ever purchases Insights I would have had this post up a long time ago with even more data, looking at things like the storage subsystems.  Hopefully we acquire this product in the future and if you want to save yourself and your organization time, energy and effort getting precise, accurate data that can be compared against scenarios you create: get ControlUp Insights.


I’m going to harp on this, but YOUR WORKLOAD MATTERS MORE than these simulated results.  During this exercise, I was able to determine with ControlUp Insights that one of our applications is so light that we can host 1,000 users from a single host where that host struggled with 200 LoginVSI users.  So WORKLOAD MATTERS.  Just something to keep in mind when reviewing these results.  LoginVSI produces results that can serve as a proxy for what you can expect if you can properly relate these results to your existing workload.  LoginVSI also offers the capability to produce custom workloads tailored to your specific environment or applications so you can gauge impact with much more precision. 

Read More

User Profile Manager – Unavoidable Delays

2018-06-30
/ /
in Blog
/

I’ve been exploring optimizing logon times and noticed “User Profile Service” always showed up for 1-3 seconds.  I asked why and began my investigation.

The first thing I needed to do is separate the “User Profile Service” into it’s own process.  It’s originally configured to share the same process as other services which makes procmon’ing difficult.

Making this change is easy:

Now that the User Profile Service is running in it’s own process we can use Process Monitor to target that PID.

I logged onto the RDS server with a different account and started my procmon trace.  I then logged into server:

One of the beautiful things about a video like this is we can start to go through frame-by-frame if needed to observe the exact events that are occurring.  Process Monitor also gives us a good overview of what’s happening with the “Process Activity” view:

9,445 file events, 299,668 registry events.  Registry, by far, has the most events occurring on it.  And so we investigate:

  1. On new logins the registry hive is copied from Default User to your profile directory, the hive is mounted and than security permissions are set.

    Setting the initial permissions of the user hive began at 2:14:46.3208182 and finished at 2:14:46.4414112.  Spanning a total time of 121 milliseconds.  Pretty quick but to minimize logon duration it’s worth examining each key in the Default User hive and ensuring you do not have any unnecessary keys.  Each of these keys will have their permissions evaluated and modified.
  2. The Profile Notification system now kicks off.

    The user profile server now goes through each “ProfileNotification” and, if it’s applicable, executes whatever action the module is responsible for.  In my screenshot we can see the User Profile Service alerts the “WSE”.  Each key actually contains the friendly name, giving you a hint about its role:

    It also appears we can measure the duration of each module by the “RegOpenKey” and “RegCloseKey” events tied to that module.

    In my procmon log, the WSE took 512ms, the next module “WinBio” took 1ms, etc.  The big time munchers for my system were:
    WSE: 512ms
    SyncCenter: 260ms
    SHACCT: 14ms
    SettingProfileHandler: 4ms
    GPSvc: 59ms
    GamesUX: 60ms
    DefaultAssociationsProfileHandler: 4450ms (!)
  3. In the previous screenshot we can see the ProfileNotification has two events kicked off that it runs through it’s list of modules: Create and Load.  Load takes 153ms in total, so Create is what is triggering our event.
  4. DefaultAssociationsProfileHandler consumes the majority of the User Profile Service time.  What the heck is it doing?  It appears the Default Association Profile Handler is responsible for creating the associations between several different components and your ability to customize them.  It associates (that I can see):
    ApplicationToasts (eg, popup notifications)
    RegisteredApplications
    File Extensions
    DefaultPrograms
    UrlAssociations
    The GPO “Set Default Associations via XML file” is processed and the above is re-run with the XML file values.
  5. Do we need these associations?

    Honestly…   Maybe.

    However, does this need to be *blocking* the login process?  Probably not.  This could be an option to be run asynchronously with you, as the admin, gambling that any required associations will be set before the user gets the desktop/app…  Or if you have applications that are entirely single purpose that simply read and write to a database somewhere than this is superfluous.

  6. Can we disable it?

    Yes…

    But I’m on the fence if this is a good idea.  To disable it, I’ve found deleting the “DefaultAssociationsProfileHandler” does work, associations are skipped and we logon 1-4 seconds faster.  However, launching a file directly or shortcut with a url handler will prompt you to choose your default program (as should be expected).

I’m exploring this idea.  Deleting the key entirely and using SetUserFTA to set associations.

We have ~400 App-V applications that write/overwrite approximately 800 different registered applications and file extensions into our registry hive (we publish globally — this puts them there).  This is the reason why I started this investigation is some of our servers with lots of AppV applications were reporting longer UserProfileService times and tying it all together, this one module in the User Profile Service appears to be the culprit.  And with Spectre increasing the duration of registry operations 400% this became noticeable real quick in our testing.

Lastly, time is still being consumed on RDS and server platforms by the infuriating garbage services like GamesUX (“Games Explorer”).  It just tweaks a nerve a little bit when I see time being consumed on wasteful processes.

Read More

Citrix Provisioning Service – Network Service Starting/Stopping services remotely

2018-05-02
/ /
in Blog
/

Citrix Provisioning Services has a feature within the “Provisioning Services Console” that allows you to stop/restart/start the streaming service on another server:

 

This feature worked with Server 2008R2 but with 2012R2 and greater it stopped working.  Citrix partially identified the issue here:

 

I was exploring starting and stopping the streaming service on other PVS servers from the Console and I found this information was incorrect.  Adding the NetworkService does NOT enable the streaming service to be stop/started/restarted from other machines.  The reason is the NETWORKSERVICE is a LOCAL account on the machine itself.  When it attempts to reach out and communicate with another system it is translated into a proper SID, which matches the machine account.  Since that SID communicating across the wire does not have access to the service you get a failure.

In order to fix this properly we can add either the machine account permissions for each PVS Server on each service OR we can add all machine accounts into a security group and add that as permissions to manipulate the service on each PVS Server.

I created a PowerShell script to enable easily add a group, user or machine account to the Streaming Service.  It will also list all the permissions:

An example adding a Group to the permissions to the service:

And now we can start the service remotely:

 

In order to get this working entirely I recommend the following steps:

  1. Create a Group (eg, “CTX.Servers.ProvisioningServiceServer”)
  2. Add all the PVS Machine Accounts into that group
  3. Reboot your PVS server to gain that group membership token
  4. Run the powershell script on each machine to add the group permission to the streaming service:
  5. Done!

And now the script:

 

Read More

Meltdown + Spectre – Performance Analysis

2018-04-30
/ /
in Blog
/

Meltdown and Spectre (variant 2) are two vulnerabilities that came out at the same time, however they are vastly different.  Patches for both were released extremely quickly for Microsoft OS’s but because of a variety of issues with Spectre, only Meltdown was truly available to be mitigated.  Spectre (variant 2) mitigation had a problematic release, causing numerous issues for whomever installed the fix, that it had to be recalled and the release delayed weeks.  However, around March/April 2018, the release of the Spectre patch was finalized and the microcode released.

Threat on Performance

Spectre (variant 2), of the two, threatened to degrade performance fairly drastically.  Initial benchmarks made mention that storage was hit particularly hard.  Microsoft made comments that server OS’s could be hit particularly hard.  Even worse, older operating systems would not support CPU features (PCID) that could reduce the performance impact.  Older OS’s suffer more due to the design (at the time) that involved running more code in kernel mode (fonts was signalled out as an example of one of these design decision) than newer OS’s.

As with most things on my blog I am particularly interested in the impact against Citrix/Remote Desktop Services type of workloads.  I wanted to test ON/OFF workloads of the mitigation impacts.

Setup

My setup consists of two ESXi (version 6.0) hosts with identical VM’s on each, hosting identical applications.  I was able to setup 4 of these pairs of hosts.  Each pair of hosts have identical processors.  The one big notable change is one host of each pair has the Spectre and Meltdown patch applied to the ESXi hypervisor.

The operating system of all the VM’s is Windows Server 2008 R2.  Applications are published from Citrix XenApp 6.5.

 

 

This is simply a snapshot of a single point in time to show the metrics of these systems.

Performance Considerations

Performance within a Citrix XenApp farm can be described in two ways.  Capacity and speed.

Speed

Generally, one would test for a “best case” test of the speed aspect of your applications performance.

A simplified view of this is “how fast can the app do X task?”

This task can be anything.  I’ve seen it measured by an automated script flipping through tabs of an application, where each tab pulled data from a database – rendered it – then moved on to the next tab.  The total time to execute these tasks amounted to a number that they used to baseline the performance of this application.

I’ve seen it measured as simply opening an excel document with macros and lots of formulas that pull data and perform calculations and measuring that duration.

The point of each exercise is to generate a baseline that both the app team and the Citrix team can agree to.  I’ve almost never had the baseline equal “real world” workloads, typically the test is an exaggeration of the actual workflow of users (eg, the test exaggerates CPU utilization).  Sometimes this is communicated and understood, other times not, but hopefully it gives you a starting point.

In general, and for Citrix workloads specifically, running the baseline test on a desktop usually produces a reasonable number of, “well… if we don’t put it on Citrix this is what performance will be like so this is our minimum expectation.”  Sometimes this is helpful.

Once you’ve established the speed baseline you can now look at capacity.

Capacity

After establishing some measurable level of performance within the application(s) you should now be able to test capacity.

If possible, start loading up users or test users running the benchmark.  Eventually, you’ll hit a point where the server fails — either because it ran out of resources, performance degraded so much it errors, etc.  If you do it right, you should be able to start to find the curve that intersects descending performance with your “capacity”.

At this point, cost may come into consideration.

Can you afford ANY performance degradation?

If not, than the curve is fairly easy.  At user X we start to see performance degrade so X-1 is our capacity.

If yes, at what point does performance degrade so much that adding users stops making sense?  Using the “without Citrix this is how it performs on the desktop” can be helpful to establish a minimum level of performance that the Citrix solution cannot cross.

Lastly, if you have network bound applications, and you have an appropriately designed Citrix solution where the app servers sit immediately beside the network resources on super-high bandwidth, ultra-low latency you may never experience performance degradation (lucky you!).  However, you may hit resource constraints in these scenarios.  Eg, although performance of the application is dependent on network, the application itself uses 1GB of RAM per instance of the application — you’ll be limited pretty quickly be the amount of RAM you can have in your VM’s.  These cases are generally preferred because the easy answer to increase capacity is *more hardware* but sometimes you can squeeze some more users with software like AppSense or WEM.

Spectre on Performance

So what is the impact Spectre has on performance — speed and/or capacity?

If Spectre simply makes a task take longer, but you can fit the same number of tasks on a given VM/Host/etc. then the impact is only on speed. Example: a task that took 5 seconds at 5% CPU utilization now takes 10 seconds at 5% CPU utilization.  Ideally, the capacity should be identical even though the task now takes twice as long.

If Spectre makes things use *more* resources, but the speed is the same, then the impact is only on capacity.  Example: a task that took 5 seconds at 5% CPU utilization now takes 10% CPU utilization.  In this scenario, the performance should be identical but your capacity is now halved.

The worst case scenario is if the impact is on both, speed and capacity.  In this case, neither are recoverable except you might be able to make up some speed with newer/faster hardware.

I’ve tested to see the impacts of Spectre in my world.  This world consists of Windows 2008 R2 with XenApp 6.5 on hardware that is 6 years old.  I was also able to procure some newer hardware to measure the impact there as well.

Test Setup

Testing was accomplished by taking 2 identically configured ESXi hosts, applying the VMWare ESXi patch with the microcode for Spectre mitigation to one of the hosts, and enabling it in the operating system.  I added identical Citrix VM’s to both hosts and enabled user logins to start generating load.

 

Performance needs to measured at two levels.  At the Windows/VM level, and at the hypervisor/host level.  This is because the Hypervisor may pickup the additional work required for the mitigation that the operating system may not, and also due to the way Windows 2008 R2 does not accurately measure CPU performance.

Windows/VM Level – Speed

I used ControlUp to measure and capture performance information.  ControlUp is able to capture various metrics including average logon duration.  This singular metric includes various system interactions, from using the network by querying Active Directory, pulling files from network shares, disk to store group policies in a cache, CPU processing which policies are applicable, and executables being launched in a sequence.  I believe that measuring logons is a good proxy for understanding the performance impact.  So lets see some numbers:

 

The top 3 results are Spectre enabled machines, the bottom 3 are without the patch.  The results are not good.  We are seeing a 200% speed impact in this metric.

With ControlUp we can drill down further into the impact:

Without Spectre Patch

 

With Spectre Patch

 

The component that took the largest hit is Group Policy.  Again, ControlUp can drill down into this component.

Without Spectre

 

With Spectre

All group policy preference components take a 200% hit.  The Group Policy Preferences functions operate by pulling down an XML file from the SYSVOL store, reading the XML file, than applying whatever resultant set of policies finds applicable.  In order to trace down further to find more differences, I logged into each type of machine, one with Spectre and one without, and started a Process Monitor trace.  Group Policy is applied via the Group Policy service, which a seperate instance of the svchost.exe.  The process can be found via Task Manager:

Setting ProcMon to filter only on that PID we can begin to evaluate the performance.  I relogged in with procmon capturing the logon.

Spectre Patched system on left, no patch on right

Using ProcessMonitor, we can look at the various “Summaries” to see which particular component may be most affected:


We see that 8.45 seconds is spent on the registry, 0.40 seconds on file actions, 1.04 seconds on the ProcessGroupPolicyExRegistry instruction.

The big ticket item is the time spent with the registry.

So how does it compare to a non-spectre system?

 

We see that 1.97 seconds is spent on the registry, 0.33 seconds on file actions, 0.24 seconds on the ProcessGroupPolicyExRegistry instruction.

Here’s a table showing the results:

So it definitely appears we need to look at the registry actions.  One of the cool things about Procmon is you can set a filter on your trace and open up the summaries and it will show you only the objects in the filter.  I set a filter for RegSetValue to see what the impact is for setting values in the registry:

RegSetValue – without spectre applied

 

RegSetValue – with spectre applied

1,079 RegSetValue events and a 4x performance degradation.  Just to test if it is specific to write events I changed the procmon filter to filter on “category” “Read”

 

Registry Reads – Spectre applied

 

Registry Reads – Spectre not applied

We see roughly the same ratio of performance degradation, perhaps a little more so.  As a further test I created a PowerShell script that will just measure creating 1000 registry values and test it on each system:

Spectre Applied

 

Spectre Not Applied

 

A 2.22x reduction in performance.  But this is writing to the HKCU…  which is a much smaller file.  What happens if I force a change on the much larger HKLM?

Spectre Applied

 

Spectre Not Applied

 

Wow.  The size of the registry hive makes a big difference in performance.  We go from 2.22x to 3.42x performance degradation.  So on a minute level, Spectre appears to have a large impact on Registry operations and the larger the hive the worse the impact.  With this information there is a large element of sense as to why Spectre may impact Citrix/RDS more.  Registry operations occur with a high frequency in this world, and logon’s highlight it even more as group policy and the registry are very intertwined.

This actually brings to mind another metric I can measure.  We have a very large AppV package that has a 80MB registry hive that is applied to the SOFTWARE hive when the package is loaded.  The difference in the amount of time (in seconds) loading this package is:

“583.7499291” (not spectre system)
“2398.4593479” (spectre system)

This goes from 9.7 mins to 39.9 minutes.  Another 4x drop in performance and this would be predominately registry related.  So another bullet that registry operations are hit very hard.

Windows/VM Level – Capacity

Does Spectre affect the capacity of our Citrix servers?

I recorded the CPU utilization of several VM’s that mirror each other on hosts that mirror each other with a singular difference.  One set had the Spectre mitigation enabled.  I then took their CPU utilization results:

Red = VM with Spectre, Blue = VM without Spectre

By just glancing at the data we can see that the Spectre VM’s had higher peaks and they appear higher more consistently.  Since “spiky” data is difficult to read, I smoothed out the data using a moving average:

Red = VM with Spectre, Blue = VM without Spectre

We can get a better feel for the separation in CPU utilization between Spectre enabled/disabled.  We are seeing clearly higher utilization.

Lastly, I took all of the results for each hour and produced a graph in an additive model:

This graph gives a feel for the impact during peak hours, and helps smooth out the data a bit further.  I believe what I’m seeing with each of these graphs is a performance hit measured by the VM at 25%-35%.

Host Level – Capacity

Measuring from the host level can give us a much more accurate picture of actual resources consumed.  Windows 2008 R2 isn’t a very accurate counter and so if there are lots of little slices, they can add up.

My apologies for swapping colors.  Raw data:

Blue = Spectre Applied, Red = No Spectre

Very clearly we can see the hosts with Spectre applied consume more CPU resources, even coming close to consuming 100% of the CPU resources on the hosts.  Smoothing out the data using moving averages reveals the gap in performance with more clarity.

 

Showing an hourly “Max CPU” per hour hit gives another visualization of the performance hit.

 

Summary

Windows 2008 R2, for Citrix/RDS workloads will be impacted quite highly.  The impact that I’ve been able to measure appears to be focused on registry-related activities.  Applications that store their settings/values/preferences in registry hives, whether they be the SOFTWARE/SYSTEM/HKCU hive will feel a performance impact.  Logon actions on RDS servers would be particularly impacted because group policies are largely registry related items, thus logon times will increase as it takes longer to process reads and writes.  CPU utilization is higher on both the Windows VM-level and the hypervisor level,  up to 40%.  The impact of speed on the applications and other functions is notable although more difficult to measure.  I was able to measure a ~400% degradation in performance for CPU processing for Group Policy Preferences, but perception is a real thing, so going from 100ms to 400ms may not be noticed.  However, on applications that measure response time, it was found we had a performance impact of 165%.  What took 1000ms now takes 1650ms.

At the time of this writing, I was only able to quantify the performance impact between two of the different hosts.  The Intel Xeon E5-2660 v4 and Intel Xeon E5-2680.

The Intel Xeon E5-2660 v4 has a frequency of 26% less than the older 2680.  In order to overcome this handicap, the processor must have improved at a per-clock rate higher than the 26% frequency loss.  CPUBenchMark had the two processors with a single thread CPU score of 1616 for the Intel Xeon E5-2660 v4 and 1657 for the Intel Xeon E5-2680.  This put them close but after 4 years the 2680 was marginally faster.  This has played out in our testing that the higher frequency processor is faster.  Performance degradation for the two processors came out as such:

 

CPU Performance Hit
Intel Xeon E5-2680 2.70GHz 155%
Intel Xeon E5-2660 v4 2.00GHz 170%

 

This tells that frequency of the processor is more important to mitigate the performance hit.

Keep in mind, these findings are my own.  It’s what I’ve experienced in my environment with the products and operating systems we use.  Newer operating systems are supposed to perform better, but I don’t have the ability to test that currently so I’m sharing these numbers as this is an absolute worst case type of scenario that you might come across.  Ensure you test the impact to understand how your environment will be affected!

Read More

Group Policy Preferences Registry Extension vs Group Policy Registry Extension

2018-04-11
/ /
in Blog
/

In various discussions I’ve read about the drawbacks of Group Policy Preferences but is it really that bad?

 

…Or is it how you are using it?

 

There are two methods of applying registry keys/values with Group Policy.  The Group Policy Registry Extension is the “traditional” form of applying policies.  Also known as ADM or ADMX policies, when creating GPO’s with this method a binary file, “.pol”, is created.  When policy application occurs this file is read and applied to your registry.  As a binary file, this file is kept small and fast.  Reading and applying the settings should be nearly instant.

The second method of applying registry keys is with Group Policy Preferences (GPP).  This was a “new” method introduced in Windows Server 2008 with the purchase of PolicyMaker by Microsoft.  Group Policy Preferences are much, much more flexible than the traditional form.  There are different ways of applying registry values, including the CRUD model (Create, Replace, Update, Delete), filtering by the way of “Item Level Targeting“, either on an individual value or on a collection.

I’ve seen an organization heavily leverage GPP to great success.  I started to wonder though, what are the performance impacts of using GPP over the traditional method.  This post will explore the differences in the CRUD model and how it compares to the traditional method..

I intend to look the following scenarios:

  1. Creating a registry value
  2. Updating a previous registry value
  3. Removing a registry value

However, GPP has a fourth method, “Replace” and I’ll explore what it does in addition to these 3 methods.

Creating a Registry Value

In this scenario, the registry will be clean and a new value will be created.  I’m going to refer to the Group Policy Registry Extension (AKA, Administrative Templates, ADM/ADX) as the “traditional” method and use the abbreviation GPP for the Group Policy Preferences Registry Extension.

Traditional:

After reading the Registry.Pol from the sysvol, the application of the registry key takes just 3 operations.  RegCreateKey, RegSetValue, and RegCloseKey.

Each one of these operations took around 1-1.1ms, with the caveat that Process Monitor (procmon) consumes some resources capturing this information, slowing it down slightly.

 

GPP:

We can see a new operation “RegQueryValue”.  As described by William Stanek, “The Create action creates a preference if it doesn’t already exist. For example, you can use the Create action to create and set the value of a user environment variable called CurrentOrg on computers where it does not yet exist. If the variable already exists, the value of the variable will not be changed.”

The RegQueryValue is executing the check to see if a variable already exists.  So what does GPP look like if the value is already present?

3 operations with the process exiting on a success on the value being present.

The end result, is 3 operations for our traditional method, and 4 operations for the Group Policy Preferences method for creating a registry entry.

Updating a registry value

In this scenario, the registry will contain a value, and the policy will be updated with a new value.  For the traditional method this will involve changing the Microsoft “User Profiles” policy.  I set the “HomeDir” location to “TrententTest”, applied the value, then updated it to “TrententTye”.  This will ensure a new, changed key is applied.  For GPP I’m going to change the value on the policy to 0x0 from 0x1 and use the “Update” operation.

Traditional:

Traditional maintains a very simple “3 operation” action with updating a value having the same effect as if the value was never present to begin with.

GPP:

With the “Update” action, GPP now executes just 3 operations, same as the traditional.

The end result, is 3 operations for our traditional method, and 3 operations for the Group Policy Preferences method for updating a registry entry.

Removing a registry value

In this scenario, I am going to remove a registry value.  Using the traditional method this means modifying my group policy to “Not Configured”, and for GPP this means setting “Delete” for our operation.

Traditional:

Again, Traditional performs it’s work in just 3 operations.

 

GPP:

GPP also performs this work in just 3 operations.

 

GPP – The Replace Method

Group Policy Preferences has another operation to explore.  “Replace”.

This operation …”creates preferences that don’t yet exist, or deletes and then creates preferences that already exist.”

This sounds like it performs a few operations.  Lets see what it looks like:

Replace executes “6” operations.  RegOpenKey, RegDeleteValue, RegCloseKey, RegCreateKey, RegSetValue, RegCloseKey.  I’m not entirely sure why you’d want a DeleteValue before SetValue but that’s what this selection does.

 

Revisiting GPP: “Creating a Registry Value”

During the process of creating this post, I wondered if the 3 operation “Update” would work better for creating a key.  The GPP “Create” selection has 4 operations, but the “Update” selection only has 3 operations.  I deleted my “TrententTestPreferences” key and refreshed group policy:

 

3 operations!  So Group Policy Preferences has the potential to operate at the same speed as the traditional group policy IF YOU STICK TO USING “UPDATE”.  At the very least, these operations should take the same amount of time.  Of course, implementation might be a different story.

The final tally:

Stay tuned for part 2 — The Performance Comparison

Read More

Group Policy – Monolithic vs Functional Design and Performance Evaluation

2018-04-09
/ /
in Blog
/

Group Policy Design is a hotly discussed topic, with lots of different ideas and discussions.  However, there is not a whole lot of actual metrics.  ADM and ADMX templates apply registry keys in an ‘enforced’ manner.  That is, if you or the machine has access to read the policies, the registry keys within are applied.  If you stuck purely to ADM/ADMX policies but wanted to do dynamic filtering or application of the keys/values based on a set of criteria you’d probably design multiple policies and nested organizational units (OUs).  From here, you could filter certain policies based on the machine or user location in the OU structure or by filtering on the policies themselves and denying access to certain groups or doing explicit allows to certain groups.  This design style, back in the day, was called “functional” design.

 

However, the alternative style, “monolithic” design, simplifies the group policy object (GPO) design into much fewer GPO’s.

 

Setup

My test setup is very simple; a organizational unit (OU) with inheritance blocked to control the application of the GPO’s.  I created 100 individual GPO’s with a single registry value, and 1 GPO with 100 values.  I chose to do a simple registry addition as it should be the best performance option for group policy.  I created a custom ADMX file for this purpose:

Monolithic simulation:

 

Functional simulation:

 

Testing

In testing these two designs I elected to focus on the one factor that would have the most impact: latency.  I setup my client machine in the OU, put a WAN emulator that can manipulate latency and measured the performance between the functional and monolithic designs at varying latencies.  I looked for the following event ID’s: 4257, 5257, 4016, 5016.  The x257 events correspond to when group policy downloads the group policy objects off the SYSVOL file share.  The x016 event’s determine how long it took the policy to be processed.

 

The results:

 

Raw Data:

Functional GPO – applying 100 registry values
Event ID 4016 to 5016 (ms)
Latency Time (ms)
0 271
10 4089
25 8078
50 15315
75 22904
100 29820
Event ID 4257 to 5257 – Starting to download policies
Latency Time (s)
0 0
10 3
25 6
50 12
75 17
100 22

 

Monolithic GPO – applying 100 registry values
Event ID 4016 to 5016 (ms)
Latency Time (ms)
0 117
10 156
25 198
50 284
75 336
100 435
Event ID 4257 to 5257 – Starting to download policies
Latency Time (s)
0 0
10 0
25 1
50 1
75 1
100 1

 

Analysis:

There is a huge edge to the monolithic design.  Both in terms of how long it takes to process a single policy vs multiple policies and the resiliency to the effects of latency.  Even ‘light’ latency can have a profound impact on performance.  Going from 0ms to 10ms increased the length of time to process a functional design by 15 times!  The monolithic design, on the other hand, was barely impacted.  Even with a latency of 100ms, it only added ~300ms of time to process the policy.  I would consider this imperceptible in the real world, where as the functional design going from ~271ms to ~4000ms would be an extremely noticeable impact!  Let alone about 30 seconds at 100ms!

Another factor is how much additional time is required to download the policies.  This is time in addition to the processing time.  This probably shouldn’t be a huge surprise, it appears that group policies are downloaded and processed sequentially, in an order.  I’m sure this is necessary to maintain some semblance of prediction if you have conflicting policies settings, the one last in the list (whether sorted alphabetically or what have you) can be relied on to be the winner.

Adding latency, even just a little latency, has a noticeable impact.  And the more policies, the more traffic, the more the impact of latency.  Again, a loss for a functional design and a win for a more monolithic design.

Conclusion:

Group Policy Objects can have a large impact on user experience.  The goal should be to minimize them to as few as possible.  As with everything, there are exceptions to the rules, but for Group Policy it’s important to try and maintain this rule.  Even just a little latency between the domain controller and the client can have a massive impact in group policy performance.  This can impact the length of time it takes a machine to boot, to delaying a user logging into a system.

 

Read More

Corrupt Registry Repair with Citrix Provisioning Services

2018-04-02
/ /
in Blog
/

I encountered an interesting issue and worked through a solution with a corrupt registry.  The issue seemed innocuously enough, we upgraded PowerShell to 5.1.  Upon reboot I encountered a bluescreen with code 0xF4:

I have encountered this issue before, but I haven’t recorded my troubleshooting steps until now.

Since this was a PVS target device, the easy method was deleting the version and trying to upgrade PowerShell to 5.1, which resulted in the same BSOD.  So it was easily reproducible.  So I tried it a few more times, because, why not?

I then deleted this version, and booted into the system and looked at Event Viewer for hints of what could be at fault.  Going through it was pretty obvious that it was registry corruption:

Filtering for EventID 5 shows all the attempts of booting to BSOD:

 

I mean, it literally says the Registry was corrupted 🙂

Where can you find this corruption?  With PVS it’s fairly simple but I believe the same process can exist for other systems including physical.  The first step was to mount the vDisk to a VM.

 

Mount the registry hive you suspect with corruption.

Next is to scan the registry for corruption.  Thus far, I’ve only found corruption to be detectable if it’s a key or value that cannot be read.  If the data on a value can be read but contains garbage it’s much harder to detect.  In order to avoid permissions being a problem, I open a PowerShell prompt as SYSTEM using PSEXEC.  If you don’t elevate permissions, some keys maybe restricted from the Admins group and this will be detected as a failure.

Once at this stage, it’s a one-liner to scan the registry:

In my experience, corruption can be detected as “Permission Denied”, “Access Denied”,”Path does not exist” or some such:

 

At this point you can examine the text file to see the last path it explored:

 

At this point you can open regedit (as SYSTEM) and examine the keys within that path.  Clicking through each one revealed the corrupted key:

Attempting to get Permissions on this key reveals it also exists on the ACL level:

“The requested security information is either unavailable or can’t be displayed”.

Deleting the key may fail as well:

At this point you need to evaluate how to manage the corruption.  If you cannot delete the key, rename it, or in some way replace it you may have an option like I did…  You can rename a higher up branch in the tree, go to a existing system with the same keys (with PVS I can go to the previous version and export that tree) and reimport.

Unload the hive and boot up the system –> and you may have a fully working system!

I’ve used this trick here, and on a corrupt COMPONENTS hive in the past.  With the COMPONENTS hive I got lucky I could replace the corrupted keys with ones from a branched vDisk.  Other machines didn’t have the same key in COMPONENTS so I got lucky.

 

Read More

Meltdown – Performance Impact Evaluation (Citrix XenApp 6.5)

2018-01-15
/ /
in Blog
/

Meltdown came out and it’s a vulnerability whose fix may have a performance impact.  Microsoft has stipulated that the impact will be more severe if you:

a) Are an older OS
b) Are using an older processor
c) If your application utilizes lots of context switches

Unfortunately, the environment we are operating hits all of these nails on the head.  We are using, I believe, the oldest OS that Microsoft is patching this for.  We are using older processors from 2011-2013 which do not have the PCID optimization (as reported by the SpeculationControl test script) which means performance is impacted even more.

I’m in a large environment where we have the ability to shuffle VM’s around hosts and put VM’s on specific hosts.  This will allow us to measure the impact of Meltdown in its entirety.  Our clusters are dedicated to Citrix XenApp 6.5 servers.

Looking at our cluster and all of the Citrix XenApp VM’s, we have some VM’s that are ‘application siloed’ — that is they only host a single application and some VM’s that are ‘generic’.

In order to determine the impact I looked at our cluster, summed up the total of each type of VM and then just divided by the number of hosts.  We have 3 different geographical areas that have different VM types and user loads.  I am going to examine each of these workload types across the different geographical areas and see what are the outcomes.

Since Meltdown impacts applications and workloads that have lots of context switches I used perfmon each server to record the context switches of each server.

The metrics I am interested in are the context switch values as they’ve been identified as the element that highlight the impact.  My workloads look like this:

Based on this chart, our largest impact should be Location B, followed by Location A, last with Location C.

However, the processors for each location are as follows:

Location A: Intel Xeon 2680
Location B: Intel Xeon 2650 v2
Location C: Intel Xeon 2680

The processors may play a roll as newer generation processors are supposed to fair better.

In order to test Meltdown in a side-by-side comparison of systems with and without the mitigation I took two identical hosts and populated them with an identical amount and type of servers. On one host we patched all the VM’s with the mitigation and with the other host we left the VM’s without the patches.

Using the wonderful ControlUp Console, we can compare the results in their real-time dashboard.  Unfortunately, the dashboard only gives us a “real time” view, ControlUp offers a product called “Insights” that can show the historical data, our organization has not subscribed to this product and so I’ve had to try and track the performance by exporting the ControlUp views on a interval and then manually sorting and presenting the data.  Using the Insights view would be much, much faster.

ControlUp has 3 different views I was hoping to explore.  The first view is the hosts view, this will be performance metrics pulled directly from the VMWare Host.  The second view will be the computers view, and the last will be the sessions view.  The computers and sessions view are metrics pulled directly from the Windows server itself.  However, I am unable to accurately judge the performance of Windows Server metrics because of how it measures CPU performance.

Another wonderful thing about ControlUp is we can logically group our VM’s into folders, from there ControlUp can sum the values and present it in an easily digestible presentation.  I created a logical structure like so and populated my VM’s:

 

And then within ControlUp we can “focus” on each “Location” folder and if we select the “Folder” view it presents the sums of the logical view.

HOSTS

In the hosts view, we can very quickly we can see impact, ranging from 5%-26%.  However, this is a realtime snapshot, I tracked the average view and examined only the “business hours” as our load is VERY focused on the 8AM-4PM.  After these hours and we see a significant drop in our load.  If the servers are not being stressed the performance seems to be a lot more even or not noticeable (in a cumulative sense).

FOLDERS

Some interesting results.  We are consistently seeing longer login times and application launch times.  2/3 of the environments have lower user counts on the unpatched servers with the Citrix load balancing making that determination.  The one environment that had more users on the mitigation servers are actually our least loaded servers in terms of servers per host and users per server, so it’s possible that more users would drive into a gap, but as of now it shows that one of our environment can support an equal number of users.

Examining this view and the historical data presented I encountered oddities — namely the CPU utilization seemed to be fairly even more often than not, but the hosts view showed greater separation between the machines with mitigation and without.  I started to explore why and believe I may have come across this issue previously.

2008R2-era servers have less accurate reporting of CPU utilization.

I believe this is also the API that ControlUp is using with these servers to report on usage.  When I was examining a single server with process explorer I noticed a *minimum* CPU utilization of 6%, but task manager and ControlUp would report 0% utilization at various points.  The issue is an accuracy, adding and rounding issue.  The more users on a server with more processes and those processes consuming ever so slightly CPU, the more the inaccuracy.  Example:

Left, Task Manager. Right, Process Explorer

We have servers with hundreds of users utilizing a workflow like this where they are using just a fraction of a percent of CPU resources.  Taskmanager and the like will not catch these values and round *down*.  If you have 100 users using a process that consumes 0.4% CPU then our inaccuracy is in the 40% scale!  So relying on the VM metrics of ControlUp or Windows itself is not helpful.  Unfortunately, this destroys my ability to capture information within the VM, requiring us to solely rely on the information within VMWare.  To be clear, I do NOT believe Windows 2012 R2 and greater OS’s have this discrepancy (although I have not tested) so this issue manifests itself pretty viciously in the XenApp 6.5 -era servers.  Essentially, if Meltdown is increasing CPU times on processes by a fraction of a percent then Windows will report as if everything is ok and you will probably not actually notice or think there is an issue!

In order to try and determine if this impact is detectable I have two servers with the same base image, with one having the mitigation installed and other did not.  I used process explorer and “saved” the process list over the course of a few hours.  I ensured the servers had a similar amount of users using a specific application that only presented a specific workload so everything was as similar as possible.  In addition, I looked at processes that aren’t configurable (or since the servers have the same base image they are configured identically).  Here were my results:

Just eye balling it, it appears that the mitigation has had an impact on the fractional level.  When taking the average of the Winlogon.exe and iexplore.exe processes into account:

 

These numbers may seem small, but once you start considering the number of users the amount wasted grows dramatically.  For 100 users, winlogon.exe goes from consuming a total of 1.6% to 7.1% of the CPU resulting in an additional load of 5.5%.  The iexplore.exe is even more egregious as it spawns 2 processes per user, and these averages are per process.  For 100 users, 200 iexplore.exe processes will be spawned.  The iexplore.exe CPU utilization goes from 15.6% to 38.8%, for an additional load of 23.2%.  Adding the mitigation patch can impact our load pretty dramatically, even though it may be under reported thus impacting users on a far greater scale by adding more users to servers that don’t have the resources that Windows is reporting it has For an application like IE, this may just mean greater slowness — depending on the workload/workflow — but if you have an application more sensitive to these performance scenarios your users may experience slowness even though the servers themselves look OK from most (all?) reporting tools.

Continuing with the HOSTS view, I exported all data ControlUp collects on a minute interval and then added the data to Excel and created my pivot tables with the hosts that are hosting servers with the mitigation patches and the ones without.  This is what I saw for Saturday-Sunday, these days are lightly loaded.

This is Location B, the host with VM’s that are unpatched is in orange and the host with patched VM’s is in blue.  the numbers are pretty identical when CPU utilization on the host is around or below 10%, but once it starts to get loaded the separation becomes apparent.

Since these datapoints were every minute, I used a moving average of 20 data points (3 per hour) to present the data in a cleaner way:

Looking at the data for this Monday morning, we see the following:

Location B

 

Some interesting events, at 2:00AM the VM’s reboot.  We reboot odd and even servers each day, and in my organization of testing this, I put all the odd VM’s on the blue host, and the even VM’s on the orange host.  So the blue line going up at 2:00AM is the odd (patched) VM’s rebooting.  The reboot cycle is staggered to take place over a 90 minute interval (last VM’s should reboot around 3:30AM).  After the reboot, the servers come up and do some “pre-user” startup work like loading AppV packages, App-V registry prestaging, etc.  I track the App-V registry pre-staging duration during bootup and here are my results:

Registry Pre-staging in AppV is a light-Read heavy-Write exercise.  Registry reading and writing are slow in 2008R2 and our time to execute this task went from 610 seconds to 693 seconds for an overall duration increase of 14%.

Looking at Location A and C

Location A

Location C (under construction)

We can see in Location A the CPU load is pretty similar until the 20% mark, then separation starts to ramp up fairly drastically.  For Location C, unfortunately, we are undergoing maintenance on the ‘patched’ VM’s, so I’m showing this data for transparency but it’s only relevant up to the 14th.  I’ll update this in the next few days when the ‘patched’ VM’s come back online.

Now, I’m going to look at how “Windows” is reporting CPU performance vs the Hosts CPU utilization.

Location A

 

Location B

 

The information this is conveying is to NOT TRUST the Windows CPU utilization meter (at least 2008 R2).  The CPU Utilization on the VM-level does not appear to reflect the load on the hosts.  While the VM’s with the patch and without the patch both report nearly identical levels of CPU utilization, on the host level the spread is much more dramatic.

 

Lastly, I am able to pull some other metrics that ControlUp tracks.  Namely, Logon Duration and Application Launch duration.  For each of the locations I got a report of the difference between the two environments

Location A: Average Application Load Time

Location B: Average Application Load Time

 

Location A: Logon Duration

 

Location B: Logon Duration

 

 

 

In each of the metrics recorded we experience a worsening experience for our user base, from the application taking longer to launch, to logon times increasing.

What does this all mean?

In the end, Meltdown has a significant impact on our Citrix XenApp 6.5 environment.  The perfect storm of older CPU’s, an older OS and applications that have workflows that are impacted by the patch means our environment is grossly impacted.  Location A has a maximum hit (as of today) of 21%.  Location B having a spread of 12%.  I had originally predicted that Location B would have the largest impact, however the newer V2 processors may be playing a roll and the performance of the V2 processors maybe more efficient than the older 2680.

In the end, the performance hit is not insignificant and reduces our capacity significantly once these patches are deployed.  I plan on adding new articles once I have more data on Meltdown and then further again once we start adding the mitigation’s against Spectre.

CPU Utilization on the hosts. Orange is a host with VM’s without the Meltdown patches, blue is with the patches.

 

Read More

Citrix Storefront – Adventures in customization – Add a help button to your Storefront UI

2017-12-27
/ /
in Blog
/

This customization is pretty easy.  Add the following to your custom.js file:

Replace “http://www.google.ca” with the URL you want your help screen to be.

Read More
Page 1 of 2812345...1020...Last »