Performance

Citrix XenApp Enumeration Performance – IMA vs FMA

2017-04-11
/ /
in Blog
/

We’re exploring upgrading our Citrix XenApp 6.5 environment to 7.X (currently 7.13) and we have some architecture decisions that are driven by the performance of the infrastructure components of XenApp.  In 6.5 these components are the “Citrix Independent Management Architecture” and in 7.13 this is the “Citrix Broker Service”.  The performance I’ll be measuring is how long it takes to enumerate applications for a user.  In XenApp 6.5 this is the most intensive task put on the broker.  I’ve taken our existing XenApp 6.5 TEST environment and “migrated” it to 7.X.  The details of the environment are 189 enabled applications with various security groups applied to each application.  The user I will be testing with has access to 55 of them.  What the broker/IMA service has to do when it receives the XML request is evaluate each application to see if the user has permissions and return the results.  The ‘request’ is slightly different to the broker vs the IMA.  This is what the FMA requests will look like:

 

And the IMA requests:

In our environment, we have measured a ‘peak’ load of 600 concurrent connections per second to our XenApp 6.5 IMA service.  We split this load over 7 servers and the load is load-balanced via Netscaler VIP’s.  This lessens the peak load to 85 concurrent connections per server per second.  What’s a “connection”?  A connection is a request to the IMA service and a response from it.  This would be considered a single connection in my definition:

 

This is a single request (in RED) and response (in BLUE).  No further follow up is required by the client.

I’m going to profile a single response and request to better understand the individual performance of each product.

This is what my network traffic will look like (on the 7.X broker service):

The total time between when the FMA broker receives a single request to beginning the response is:

Initial receipt of traffic at 37.567664-37.567827.
Response starts at 37.633986

Response ends at 37.634432.

Total time for FMA request for list of applications and the response for that list:

For IMA the total time between when the IMA service receives a single request to beginning response is:

Initial receipt of traffic at 38.359944-38.360198.
Response starts at 38.440197

Response ends at 38.450032.

Total time for IMA request for list of applications and the response for that list:

Why the size difference (18KB vs 24KB)?

Looking at the data returned from the FMA via the IMA shows there is a new field passed by the FMA broker as apart of ‘AppData’

These two lines add 61 bytes per application.  A standard application response is (with title) ~331 bytes per IMA application and ~400 bytes per FMA application.

However, these single request are exactly that.  Single.

In order to get a better feel I ran the requests continuously in a loop, sending a request the FMA than the IMA, delay 1 second, and resend.  This should get me a more accurate feel for the performance differences.  I ran this over a period of 10 minutes.  My results were:

 

IMA is faster by approx 30ms per request.

Next up is load testing IMA vs FMA.

Read More

Citrix Workspace Environment Manager – Limitations (part 1?)

2017-04-03
/ /
in Blog
/

In a brief evaluation of Citrix Workspace Environment Manager, I looked at the utility of the product to replace Group Policy Preferences (GPP) in a XenApp environment context.  For this I focused on replacing a set of registry keys we apply via GPP for our XenApp environment.  My results were not favorable for using WEM in this context for the Registry portion as WEM pushes processing of it’s entries into the ‘shell’ session.  For XenApp, the Shell session is typically applied quickly and so the application may launch without those keys present (which is bad — the application needs those keys present first).  So although logon times maybe reduced, this scenario does not work for the Registry portion.  We are still exploring the effects of WEM and whether some other components that operate synchronously within GPP are needed.  Can these components be moved to WEM?  One of the big ‘wins’ for this approach maybe Drive Mappings, which apply synchronously and requires the Drive Mappings to be processed before allowing a user to logon.  Moving this to WEM may be a win worth exploring… IF the application doesn’t require drive mappings before being launched.  But that’s for another article..

However, for the registry portion of WEM we did encounter a few ‘gotcha’s worth mentioning if you are going to use WEM.

WEM does not do ‘Registry Binary’ keys.

Well…  it says it does.  And it kind of does.  But odds are you are not going to get the results you expect.

Looking at a simple REG_BINARY key it contains data is displayed as ‘hexadecimal’ data.

If I want to use WEM to apply this key, I would create an entry within WEM:

HINT: it's not.

Does this look correct to you? I thought it looked correct to me.

 

However, when I apply this key I get a completely wrong result.  Why?  Because WEM is applying the hexadecimal code in the ASCII field of the REG_BINARY

In order to get WEM to apply the code properly we would need to copy/convert the ASCII from the REG_BINARY.

 

However, I have bad results doing so:

WEM with ASCII converted binary values applied.

 

Result:

This is closer but grossly wrong.

 

Why is this wrong?  WEM stores everything as XML.  And XML files do not like storing binary or non-ascii data.  WEM stores these values as their ASCII representations and not REG_BINARY representations so if your REG_BINARY (which, there’s a 99% chance) contains a non-ASCII character it will fail to apply the key properly.

 

Even worse, during my time fiddling with this, I BROKE WEM.

If the ASCII representation of “&#x1” or whatever was set it caused WEM to crash when applying the values.  So REG_BINARY’s are completely out.  In my limited testing, we only had 1 REG_BINARY to apply, but in our environment we use GPP to apply 5 different REG_BINARY keys.  So using WEM for these applications is right out.  I filed and asked Citrix for a ‘feature enhancement’ to support applying REG_BINARY’s properly, but I was told this was operating as expected so I’m not holding my breathe but this does limit the utility of WEM.

 

 

Read More

Citrix Workspace Environment Manager vs. Group Policy Preferences – Performance Story

2017-03-10
/ /
in Blog
/

Part one: Citrix Workspace Environment Manager – First Impressions
Part two: Citrix Workspace Environment Manager – Second Impressions

If you’ve been following me, I’ve been exploring using WEM to apply some registry values instead of using Group Policy Preference.  WEM does things differently which requires some thinking on how to implement it.  The lack of a Boolean OR value is the biggest draw back, but it is possible to get around it, although our environment hasn’t required multiple AND OR AND OR AND OR statements, so all the settings migrations I have done were possible.

But the meat of this post is HOW quickly can WEM process registry entries vs. GPP.

In order to compare these two I’ve subscribed to an old standby — Procmon.  I logged into one of my Citrix servers with another account, and started procmon.  I then launched an application published on this server (notepad).  I used ControlUp to measure the performance of the actual logon and group policy extension processing.  The one we are particularly interested in is the Group Policy Registry entry.  This measures the performance of the Group Policy Registry portion:

 

Group Policy Registry executed in 3494ms.  However, I have two GPO objects with values that were evaluated by this client side extension.  For WEM I only migrated a single GPO so I’ll have to focus on that single one for the CSE.  To find the Group Policy engine, I used ControlUp to discover it via selecting svchost.exe’s processes and discovering them.  The PID was 1872:

One of the cool things about procmon is being able to suss out time stamps exactly.

For the Group Policy Registry CSE I can see it was activated at exactly 2:33:12.2056862.  From there it checks the group policy history for the XML file then compares it to the one in your sysvol:

With this particular policy, we actually check to see if it’s a 32bit or 64bit system, we check for various pieces of software to see if they’re installed or not and we then apply registry keys based on those results.  For instance:

We have a GPP Registry values that are set via some item-level-targetting that are dependent on whether PDF architect is installed or not.  You can literally see it check for PDF Architect and then set whatever values we determined need to be set by that result (ShowAssociationDialog,ShowPresentation, etc).

However cool this is, this GPO is not the one I want 🙂

I want the next GPO ({E6775312-…}).  This GPO is the one that I have converted to WEM as it only dealt with group membership.  WEM can filter on conditions like a file/folder exist but since I didn’t want to do another thousand or so registry entries I focused on the smaller GPO.

 

This is the real meat.  We can see the GPO I’m interested in started processing at 2:33:14:5252890.

And then completed at 2:33:15.2480580.

The CSE didn’t actually finish though, until 2:33:15.706579 :

 

It looks like it was finishing some RSOP stuff  (RSOPLogging is disabled, BTW) storing things in the user registry hive like GPOLink, GPOName, etc.  Either way, these actions add time to the CSE to complete.  The total time spent in the Group Policy Registry CSE was:

~3501ms.

The total time reported by ControlUp was 3494ms.  So I’m probably a bit off by the start/finish of the CSE but pretty goddamn close.  The real meat of the GPO Registry processing (eg, the GPO I was actually concerned about) was:

1181ms.

 

So how does WEM do?

One of the ways that WEM ‘helps’ counting stats is by pushing the processing into the user session where it can be processed asynchronously.  WEM creates a log in your %userprofile% folder that you can examine to see when it starts:

 

Unfortunately, WEM’s log isn’t very granular.  Procmon will fix that again 🙂

The entry I’m looking for in WEM is “MainController.ProcessRegistryValues()”.  This tells us when it starts doing the registry work:

The processing started after 3:34:39.  Procmon will help us zero in on that time:

We can see pretty clearly that it starts applying registry values at 3:34:39.4995477.

It completes at…

3:34:46.5753350.

Total time:

7076ms.

Hmmm.  About twice as slow.  It is certainly possible that the WMI queries are what is killing my performance, but without a way to check the group membership of the server the user is on, I am hobbled.  It’s possible that if we were implementing WEM from the get go we could think of a naming scheme that would work better for us, with the limitations of the wildcard (although I think it’s a bug in WEM as opposed to a poor naming scheme — just my opinion), but to rework our entire environment is not feasible.

In order to determine if WEM will perform better without the WQL queries, I manually edited the condition to be focused on ‘Computer Name Match’ and specified all the relevant servers.

The results?

Started processing: 12:37:56.029

Finished at 12:37:59.634

 

Total time: 3605ms.

So there is big savings without doing any WQL processing.

 

Final Results

 

 

But it still doesn’t compare to the Group Policy Preferences – Registry CSE.  The speed of the CSE is still faster.  Pretty significantly, actually.  And there are other considerations you need to consider for WEM as well.  It applies the registry values *after* you’ve logged in, whereas GPP does it before.  This allows WEM to operate asynchronously and should reduce logon times but there is a drawback.  And for us it’s a big drawback.  When it applies the registry values, for most of our apps they need to be in place *before* you launch the application.  So setting the values after or while an application is launching may lead to some inconsistent experiences.  For us, this caveat only applies to a XenApp environment.

When talking about a XenDesktop environment a premium is generally placed on getting to the desktop where a user has to navigate to an application to launch it, which would probably require enough time that WEM will be able to apply its required values before the user is able to launch their application.  In this scenario, saving 3-4 seconds of the user waiting for their desktop to appear are valuable and WEM (mostly) solves this issue by pushing it asynchronously to the shell.

For XenApp, we are still considering WEM to see how it performs with roaming users and having printers change for session reconnections; depending on their client name/ip space or some other variable.  That investigation will come in the future.  For now, it looks like we’re going to keep using Group Policy Preferences for our registry application for XenApp.

Stay tuned for some WEM gotcha’s I’ve learned doing this exercise.

 

Read More

Citrix Workspace Environment Manager – Second Impressions

2017-03-08
/ /
in Blog
/

Workspace Environment Manager does not do Boolean logic.  It only does AND statements.

Note: These conditions are AND statements, not OR statements. Adding multiple conditions will require all of them to trigger for the filter to be considered triggered.

So we have to reconsider how we will apply some settings.  An example:

How would you apply this setting without WEM?  The setting is targeted to a specific set of servers that we’ve put into a group, AND the user must be a member of at least 1 out of about 10 groups for it to apply.  And this is what we’ve set.  It seems fairly logical and the the way the targeting editor is setup it’s actually fairly easy to follow.

How do we configure this in WEM?

WEM works by applying settings based on your individual user account or a group containing users.  So each group needs to be added to WEM and then configured individually.  Alternatively, this can change your thinking and you can create groups that you want settings to apply towards.  For instance, in this example I only have two statements I need to be true.

The server must belong to a particular group.
The user must belong to at least one of several groups.

So I started by attempting to get to the result we wanted.

I exported the value as we wanted it from the registry and imported it into WEM:

Now to create our conditions. I want this registry action to only apply if you are on a specific server.  We add our servers into logical groups.  I attempted to set the filter condition to target the server with this specific AD group.  It turns out this was not possible with the builtin filter conditions.

The Active Directory filter conditions are only evaluated against the user and not the machine.  However, there is an option for ‘ComputerName Match’:

Our server naming convention is CTX (for Citrix) APP or silo name (eg, EPIC) ### (3 digit numerical designation) and the letter “T” at the end for a test server or no letter for a production server.  So our naming scheme would look like so:

CTXAPP301 – for a prod server
CTXAPP301T – for a test server

The guide for WEM does specify you can use a wild card “*” to allow arbitrary matches.  So I tested it and found it has limitations.

The wild card appears to only be valid if it’s at the beginning or end of the name.  For instance, setting the match string to:

CTXAPP3*

Will match both the test and prod servers.  However, setting CTXAPP3*T and the wildcard is ‘dropped’ from the search (according to the WEM logs).  So the search result is actually CTXAPP3T.  Which is utterly useless.  So a proper naming scheme is very important.  Unfortunately, this will not work for us unless we manually specify all of our computer names.  For example, we would have to do the following for this to work:

CTXAPP301T, CTXAPP302T, CTXAPP303T, etc.

Although this works, this is not workable.  Adding/removing servers would require us to keep this filter updated fairly constantly.  Is there another option?

Apparently, you can execute a WMI Query.  We can use this to see if the computer is a member of a group.  The command to do so is:

WEM supports variable replacement through a list of HASHTAGS

I modified my command as such:

Executing this command does result in checking to see if the computer is a member of a specific group.  The only caveat is the user account that logs on must have permissions within AD to check group memberships.  Which is usually granted.  And in my case, it is so this command works.

Next on the Group Policy Preference is the ‘AND – OR’.  We have the filter condition now that ensures these values will only be applied if the computer is in a certain group.  Now we need the value to apply if the user is a member of a certain group.

An easy solution might be to create a super-group “HungAppTimeout” or some such and add all the groups I want that value to apply to that group.  Another alternative is we can ‘configure’ each user group with the ‘server must belong to group’ filter and that should satisfy the same requirements.  I chose this route for our evaluation of WEM to avoid creating large swaths of groups simply for the purpose of applying a single registry entry.

Instead of doing the ‘OR’, we select the individual groups that this check would be against and actually specify that we apply this settings to that group.

To do that we add each group to the ‘Configured Users’:

And then for each group, under ‘Assignment’ we apply our setting with the filter:

And now each group has this value applied with the appropriate conditions.

Continuing, we have the following policy:

So this filtering is applied to a collection, not the individual settings.  The filtering checks to see if the computer is a member of a specific group of servers, and whether the user is a member of a specific group.

In order to accomplish this same result I have no choice but to create a parent group for the machines.  Instead of an ‘OR’ we create a new group and add both server groups within it.  This should result in the same effective ‘OR’ statement for the machine check, at least.  Then we apply all the settings to the specific groups so only they get the values applied.

In total we have 154 total individual registry entries we apply.

So how does it compare to Group Policy Preferences?

Stay Tuned 🙂

Read More

Citrix Workspace Environment Manager – First Impressions

2017-03-02
/ /
in Blog
/

Citrix Workspace Environment Manager can be used as a replacement for Active Directory (AD) Group Policy Preferences (GPP).  It does not deal with machine policies, however.  Because of this AD Group Policy Objects (GPO) are still required to apply policies to machines.  However, WEM’s goal isn’t to manipulate machine policies but to improve user logon times by replacing the user policy of an AD GPO.  A GPO has two different engines to apply settings.  A Registry Policy engine and the engine that drives “Client Side Extensions” (CSE).  The biggest time consumer of a GPO is processing the logic of a CSE or the action of the CSE.  I’ll look at each engine and what they mean for WEM.

Registry Extension

The first is the ‘Registry’ policy engine.  This engine is confusingly called the “Registry Extension” as opposed to the CSE “Group Policy Registry”.  The “Registry Extension” is engine applies registry settings set in via ‘Administrative Templates’.

These settings are ‘dumb’ in that there is no logic processing required.  When set to Enabled or Disabled whatever key needs to be set with that value is applied immediately.  Processing of this engine occurs very, very fast so migrating these policy settings would have minimal or no improvement to logon times (unless you have a ton of GPO’s apply and network latency becomes your primary blocker).

If you use ControlUp to Analyze GPO Extension Load Times it will display the Registry Extension and the Group Policy Registry CSE:

Client Side Extensions

However, CSE’s allow you to put complex logic and actions within that require processing to determine if a setting should be applied or how a settings should be applied.  One of the most powerful of these is the Registry CSE.  This CSE allows you to apply registry settings with Boolean logic and can be filtered on a huge number of variables.

All of this logic is stored in a XML document that is pulled when the group policy object is processed.  This file is located in “C:\ProgramData\Microsoft\Group Policy\History\GUID OF GPO\SID OF USER\Preferences\Registry”.

Parsing and executing the Boolean logic takes time.  This is where we would be hopeful that WEM can make this faster.  The processing of all this, in our existing environment consumes the majority of our logon time:

Migrating Group Policy Preferences to WEM

Looking at some of our Registry Preferences we’ll look at what is required to migrate it into WEM.

Basic settings “eg, ‘Always applied’”.

“Visual Effects”

These settings have no filters and are applied to all users.  To migrate them to WEM I’ve exported these values and set them into a registry file:

Switching to WEM I select ‘Actions’ then ‘Registry Entries’ and then I imported the registry file.

An interesting side note, it appears the import excluded the REG_BINARY.  However you can create the REG_BINARY via the GUI:

To set the Registry Entries I created a filter condition called “Always True”

And then created a rule “Always True”

We have a user group that encompasses all of our Citrix users upon which I added in ‘Configure Users’.  Then, during the assignment of the registry keys I selected the ‘Always True’ filter:

And now these registry keys have been migrated to WEM.  It would be nice to ‘Group’ these keys like you can do for a collection in Group Policy Preferences.  Without the ability to do so the name of the action becomes really important as it’s the only way you can filter:

Next I’ll look at replacing Group Policy Preferences that contain some boolean logic.

Read More

Tracing Citrix Provisioning Service (PVS) Target Device Boot Performance – Process Monitor

2017-01-31
/ /
in Blog
/

Non-Persistent Citrix PVS Target Devices have more complicated boot processes then a standard VM.  This is because the Citrix PVS server components play a big role in acting as the boot disk.  They send UDP packets over the network to the target device.  This adds a delay that you simply cannot avoid (albeit, possibly a small one but there is no denying network communication should be slower than a local hard disk/SSD).

One of the things we can do is set the PVS target devices up in such a way that we can get real, measurable data on what the target device is doing while it’s booting.  This will give us visibility into what we may actually require for our target devices.

There are two programs that I use to measure boot performance.  Windows Performance Toolkit and Process Monitor.  I would not recommend running both at the same time because the logging does add some overhead (especially procmon in my humble experience).

The next bit of this post will detail how to offline inject the necessary software and tools into your target device image to begin capturing boot performance data.

Process Monitor

For Process Monitor you must extract the boot driver and inject the process monitor executable itself into the image.

To extract the boot driver simple launch process monitor, under the Options menu, select ‘Enable Boot Logging’

Then browse to your C:\Windows\System32\Drivers folder, and with “Show Hidden Files” enabled, copy out Procmon23.sys

It might be a good idea to disable boot logging if you did it on your personal system now 🙂

 

Now we need to inject the follow registry entry into our image:

Here are the steps in action:

Seal/promote the image.

On next boot you will have captured boot information:

To see how to use Windows Performance Toolkit for boot tracing Citrix PVS Target Device’s click here.

Read More

Tracing Citrix Provisioning Service (PVS) Target Device Boot Performance – Windows Performance Toolkit

2017-01-31
/ /
in Blog
/

Non-Persistent Citrix PVS Target Devices have more complicated boot processes then a standard VM.  This is because the Citrix PVS server components play a big role in acting as the boot disk.  They send UDP packets over the network to the target device.  This adds a delay that you simply cannot avoid (albeit, possibly a small one but there is no denying network communication should be slower than a local hard disk/SSD).

One of the things we can do is set the PVS target devices up in such a way that we can get real, measurable data on what the target device is doing while it’s booting.  This will give us visibility into what we may actually require for our target devices.

There are two programs that I use to measure boot performance.  Windows Performance Toolkit and Process Monitor.  I would not recommend running both at the same time because the logging does add some overhead (especially procmon in my humble experience).

The next bit of this post will detail how to offline inject the necessary software and tools into your target device image to begin capturing boot performance data.

Windows Performance Toolkit

For the Windows Performance Toolkit it must be installed on the image or you can copy the files from an install to your image in the following path:

To offline inject, simply mount your vDisk image and copy the files there:

 

Then the portion of it that we are interested in is “xbootmgr.exe” (aka boot logging).  In order to enable boot logging we need to inject the following registry key into our PVS Image:

Seal/promote the image.

On next boot you will have captured boot information:

To see how to use Process Monitor for boot tracing Citrix PVS Target Device’s click here.

Read More

Lets Make PVS Target Device Booting Great Again (Part 2)

2017-01-05
/ /
in Blog
/

Continuing on from Part 1, we are looking to optimize the PVS boot process to be as fast as it possibly can be.  In Part 1 we implemented Jumbo Frames across both the PVS target device and the PVS server and discovered that Jumbo Frames only applies to the portion where BNIStack kicks in.

In this part we are going to examine the option “I/O burst size (KB)”  This policy is explained in the help file:

I/O burst size — The number of bytes that will be transmitted in a single read/write transaction before an ACK is sent from the server or device. The larger the IO burst, the faster the throughput to an individual device, but the more stress placed on the server and network infrastructure. Also, larger IO Bursts increase the likelihood of lost packets and costly retries. Smaller IO bursts reduce single client network throughput, but also reduce server load. Smaller IO bursts also reduce the likelihood of retries. IO Burst Size / MTU size must be <= 32, i.e. only 32 packets can be in a single IO burst before a ACK is needed.

What are these ACK’s and can we see them?  We can.  They are UDP packets sent back from the target device to the PVS server.  If you open Procmon on the PVS server and startup a target device an ACK looks like so:

These highlighted 48byte UDP Receive packets? They are the ACKS

And if we enable the disk view with the network view:

 

With each 32KB read of the hard disk we send out 24 packets, 23 at 1464 bytes and 1 at 440 bytes.  Add them all together and we get 34,112 Bytes of data.  This implies an overall overhead of 1344 bytes per sequence of reads or 56 bytes per packet.  I confirmed it’s a per-packet overhead by looking at a different read event at a different size:

If we look at the first read event (8,192) we can see there is 6 packets, 5 at 1464 and one at 1208, totaling 8528 bytes of traffic.  8528 – 8192 = 336 bytes of overhead / 6 packets = 56 bytes.

The same happens with the 16,384 byte read next in the list.  12 packets, 11 at 1464 and one at 952, totaling 17,056.  17056 – 16384 = 672 bytes of overhead / 12 packets = 56 bytes.

So it’s consistent.  For every packet at the standard 1506 MTU you are losing 3.8% to overhead.  But there is secretly more overhead than just that.  For every read there is a 48 byte ACK overhead on top.  Admittedly, it’s not much; but it’s present.

And how does this look with Jumbo Frames?

For a 32KB read we satisfied the request in 4 packets.  3 x 8972 bytes and 1 at 6076 bytes totalling 32,992 bytes of transmitted data.  Subtracting the transmitted data from what is really required 32,992-32,768 = 224 bytes of overhead or…  56 bytes per packet 🙂

This amounts to a measly 0.6% of overhead when using jumbo frames (an immediate 3% gain!).

But what about this 32KB value.  What happens if we adjust it longer (or shorter)?

Well, there is a limitation that handicaps us…  even if we use Jumbo Frames.  It is stated here:

IO Burst Size / MTU size must be <= 32, i.e. only 32 packets can be in a single IO burst before a ACK is needed

Because Jumbo Frames don’t occur until after the BNIStack kicks in, we are limited to working out this math at the 1506 MTU size.

The caveat of this is the size isn’t actually the MTU size of 1506.  The math is based on the data that fits within, which is 1464 bytes.  Doing the math in reverse gives us 1464 x 32 = 45056 bytes.  This equals a clear 44K (45056 /1024) maximum size.  Setting IO/Burst to 44K and the target device still boots.  Counting the packets, there are 32 packets.

So if we up the IO/Burst by 1K to 45K (45*1024 = 46,080 bytes) will it still boot?

It does not boot.  This enforces a hard limit of 44K for I/O Burst until the 1st stage supports a larger MTU size.  I have only explored EFI booting, so I suppose it’s possible another boot method allows for larger MTU?

The reads themselves are split now, hitting the ‘version’ and the ‘base’ with the base being 25,600 + 20,480 for the version (46,080 bytes).  I believe this is normal for versioning though.

So what’s the recommendation here?

Good question.  Citrix defaults to 32K I/O Burst Size.  If we break the operation of the burst size we have 4 portions:

  1. Hard drive read time
  2. Packet send time
  3. Acknowledgement of receipt
  4. Turnaround time from receipt to next packet send

The times that I have for each portion at a 32K size appear to be (in milliseconds):

  1. 0.3
  2. 0.5
  3. 0.2
  4. 0.4

A total time of ~1.4ms per read transaction at 32K.

For 44K I have the following:

  1. 0.1
  2. 0.4
  3. 0.1
  4. 0.4

For a total time of ~1.0ms per read transaction at 44K.

I suspect the 0.4ms difference could be well within a margin of error of my hand based counting.  I took my numbers from a random sampling of 3 transactions, and averaged.  I cannot guarantee they were at the same spot of the boot process.

However, it appears the difference between them is close to negligible.  Question that must be posed is what’s the cost of a ‘retry’ or a missed or faulty UDP packet?  From the evidence I have it should be fairly small, but I haven’t figured out a way to test or detect what the turnaround time of a ‘retry’ is yet.

Citrix has a utility that gives you some information on what kind of gain you might get.  It’s called ‘Stream Console’ and it’s available in the Provisioning Services folder:

 

With 4K I/O burst it does not display any packets sent larger because they are limited to that size

 

8K I/O Burst Size. Notice how many 8K sectors are read over 4K?

 

16K I/O Burst Size

 

What I did to compare the differences in performance between all the I/O Burst Size options is I simply tried each size 3 times and took the results as posted by the StatusTray utility for boot time.  The unfortunate thing about the Status Tray is that it’s time/throughput calculations are rounded to the second.  This means that the Throughput isn’t entirely accurate as a second is a LARGE value when your talking about the difference between 8 to 9 seconds.  If you are just under or over whatever the rounding threshold is it’ll change your results when we start getting to these numbers.  But I’ll present my results anyways:

To me, the higher value of I/O Burst Size the better the performance.  

Again, caveats are that I do not know what the impact of a retry is, but if reading from the disk and resending the packet takes ~1ms then I imagine the ‘cost’ of a retry is very low, even with the larger sizes.  However, if your environment has longer disk reads, high latency, and a poor network with dropped or lost packets then it’s possible, I suppose, that higher I/O burst is not for you.

But I hope most PVS environments are something better designed and you actually don’t have to worry about it.  🙂

Read More

Lets Make PVS Target Device Booting Great Again (Part 1)

2016-12-30
/ /
in Blog
/

Some discussions have swirled recently about implementing VDI.  One of the challenges with VDI are things like slow boot times necessitating having machines pre-powered on, requiring a pool of machines sitting using server resources until a logon request comes in and more machines are powered on to meet the demand…  But what if your boot time is measured in the seconds?  Something so low you could keep the ‘pool’ of machines sitting on standby to 1 or 2 or even none!

I’m interested in investigating if this is possible.   I previously looked at this as a curiosity and achieved some good results:

 

However, that was a non-domain Server 2012 R2 fresh out of the box.  I tweaked my infrastructure a bit by storing the vDisk on a RAM Disk with Jumbo Frames (9k) to supercharge it somewhat.

Today, I’m going to investigate this again with PVS 7.12, UEFI, Windows 10, on a domain.  I’ll show how I investigated booting performance and see what we can do to improve it.

The first thing I’m going to do is install Windows 10, join it to the domain and create a vDisk.

Done.  Because I don’t have SCVMM setup on my home lab I had to muck my way to enabling UEFI HDD boot.  I went into the PVS folder (C:\ProgramData\Citrix\Provisioning Services) and copied out the BDMTemplate_uefi.vhd to my Hyper-V target Device folder

I then edited my Hyper-V Target Device (Gen2) and added the VHD:

I then mounted the VHD and modified the PVSBOOT.INI file so it pointed to my PVS server:

 

 

I then created my target device in the PVS console:

 

And Viola!  It Booted.

 

And out of the gate we are getting 8 second boot times.  At this point I don’t have it set with a RAM drive or anything so this is pretty stock, albeit on really fast hardware.  My throughput is crushing my previous speed record, so if I can reduce the amount of bytes read (it’s literally bytes read/time = throughput) I can improve the speed of my boot time.  On the flip side, I can try to increase my throughput but that’s a bit harder.

However, there are some tricks I can try.

I have Jumbo Frames enabled across my network.  At this stage I do not have them set but we can enable them to see if it helps.

To verify their operation I’m going to trace the boot operation from the PVS server using procmon:

We can clearly see the UDP packet size is capping out at 1464 bytes, making it 1464+ 8 byte UDP header + 20 byte IP header = 1492 bytes.  I enabled Jumbo Frames

Under Server Properties in the PVS console I adjusted the MTU to match the NIC:

 

You then need to restart the PVS services for it take effect.

I then made a new vDisk version and enabled Jumbo Frames in the OS of the target device.  I did a quick ping test to validate that Jumbo Frames are passing correctly.

I then did started procmon on the PVS server, set the target device to boot…

and…

 

1464 sized UDP packets.  A little smaller than the 9000 bytes or so it’s supposed to be.  Scrolling down a little futher, however, shows:

 

Notice the amount of UDP packets sent in the smaller frame size?

 

Approximately 24 packets until it gets a “Receive” notification to send the next batch of packets.  These 24 packets account for ~34,112 bytes of data per sequence.  Total time for each batch of packets is 4-6ms.

If we follow through to when the jumbo frames kick in we see the following:

This is a bit harder to read because the MIO (Multiple Input Output) kicks in here and so there are actually two threads executing the read operations as opposed to the single thread above.

Regardless, I think I’ve hit on a portion that is executing more-or-less sequentially.  The total amount of data being passed in these sequences is ~32,992 bytes but the time to execute on them is 1-2ms!  We have essentially doubled the performance of our latency on our hard disk.

So why is the data being sent like this?  Again, procmon brings some visibility here:

Each “UDP Receieve” packet is a validation that the data it received was good and instructs the Sream Process to read and send the next portion of the file on the disk.  If we move to the jumbo frame portion of the boot process we can see IO goes all over the place in size and where the reads are to occur:

So, again, jumbo frames are a big help here as all requests under 8K can be serviced in 1 packet, and there are usually MORE requests under 8K then above.  Fortunately, Procmon can give us some numbers to illustrate this.  I started and stopped the procmon trace for each run of a Network Boot with Jumbo Frames and without:

Standard MTU (1506)

 

Jumbo Frame MTU (9014)

 

The number we are really after is the 192.168.1.88:6905.  The total number of events are solidly in half with the number of sends about a 1/3 less!  It was fast enough that it was able to process double the amount of data in Bytes sent to the target device and bytes received from the target device!

Does this help our throughput?  Yes, it does:

 

“But Trentent!  That doesn’t show the massive gains you are spewing!  It’s only 4MB/s more in Through-put!”

And you are correct.  So why aren’t we seeing more gains?  The issue lies with how PVS boots.  It boots in two stages.  If you are familiar with PVS on Hyper-V from a year ago or more you are probably more aware of this issue.  Essentially, PVS breaks the boot into the first stage (bootloader stage) which starts in, essentially, a lower-performance mode (standard MTU).  Once the BNIStack loads it kicks into Jumbo Packet mode with the loading of the Synthetic NIC driver.  The benefits from Jumbo Frames doesn’t occur until this stage.  So when does Jumbo Frames kick in?  You can see it in Event Viewer.

From everything I see with Procmon, first stage boot ends on that first Ntfs event.  So out of the original 8 seconds, 4 is spent on first stage boot where Jumbo Packets are not enabled.  Everything after there is impacted (positively).  So for our 4 seconds “standard MTU” boot, bringing that down by a second is a 25% improvement!  Not small potatoes.

I intend to do more investigation into what I can do to improve boot performance for PVS target devices so stay tuned!  🙂

Read More

AppV 5.1 Sequencer – Not capturing all registry keys – Update

2016-10-31
/ /
in Blog
/

My previous post touched on an issue we are having with some applications.  The AppV sequencer is not capturing all registry keys.  I have been working with Microsoft on this issue for over 2 years but just recently got some headway with getting this addressed.  And I have good news and bad news.  The good news is the root cause for this issue appears to have been found.

It appears that ETW (Event Tracing for Windows) will capture some of the events out of order and the AppV sequencer will then apply that out of order sequence.  The correct sequence of events should be:
RegDeleteKey
RegCreateKey

But in certain scenarios it’s capturing the events as:

RegCreateKey
RegDeleteKey

By capturing the deletion last in the order, the AppV sequencer is effectively being told to ‘Not’ capture the registry key.

Microsoft has corrected this issue in the ‘Anniversary’ edition of Windows 10 (Build 14393+) and sequencing in this OS will capture all the registry keys correctly.

The bad news is Microsoft is evaluating backporting the fix to older versions of Windows.  Specifically Windows 2008 R2.  Windows 2008R2 is still widely used and AppV best practice is to sequence on the OS you plan on deploying but if the older OS sequences unreliably this complicates the ability to ‘trust’ the product.  This fix still needs to be backported to 2012, 2012R2 and the related client OS’s, so hopefully they get it as well.  The reason I was told 2008 R2 may not get the fix is that it is no longer in Mainstream support, but Windows 7SP1 currently is, which is analogous to 2008 R2.  So hopefully the effort to fix this bug will be backported and we’ll have a solid AppV platform where we can sequence our packages in confidence.

Read More