Windows 10

Tracing Citrix Provisioning Service (PVS) Target Device Boot Performance – Process Monitor

2017-01-31
/ /
in Blog
/

Non-Persistent Citrix PVS Target Devices have more complicated boot processes then a standard VM.  This is because the Citrix PVS server components play a big role in acting as the boot disk.  They send UDP packets over the network to the target device.  This adds a delay that you simply cannot avoid (albeit, possibly a small one but there is no denying network communication should be slower than a local hard disk/SSD).

One of the things we can do is set the PVS target devices up in such a way that we can get real, measurable data on what the target device is doing while it’s booting.  This will give us visibility into what we may actually require for our target devices.

There are two programs that I use to measure boot performance.  Windows Performance Toolkit and Process Monitor.  I would not recommend running both at the same time because the logging does add some overhead (especially procmon in my humble experience).

The next bit of this post will detail how to offline inject the necessary software and tools into your target device image to begin capturing boot performance data.

Process Monitor

For Process Monitor you must extract the boot driver and inject the process monitor executable itself into the image.

To extract the boot driver simple launch process monitor, under the Options menu, select ‘Enable Boot Logging’

Then browse to your C:\Windows\System32\Drivers folder, and with “Show Hidden Files” enabled, copy out Procmon23.sys

It might be a good idea to disable boot logging if you did it on your personal system now 🙂

 

Now we need to inject the follow registry entry into our image:

Here are the steps in action:

Seal/promote the image.

On next boot you will have captured boot information:

To see how to use Windows Performance Toolkit for boot tracing Citrix PVS Target Device’s click here.

Read More

Tracing Citrix Provisioning Service (PVS) Target Device Boot Performance – Windows Performance Toolkit

2017-01-31
/ /
in Blog
/

Non-Persistent Citrix PVS Target Devices have more complicated boot processes then a standard VM.  This is because the Citrix PVS server components play a big role in acting as the boot disk.  They send UDP packets over the network to the target device.  This adds a delay that you simply cannot avoid (albeit, possibly a small one but there is no denying network communication should be slower than a local hard disk/SSD).

One of the things we can do is set the PVS target devices up in such a way that we can get real, measurable data on what the target device is doing while it’s booting.  This will give us visibility into what we may actually require for our target devices.

There are two programs that I use to measure boot performance.  Windows Performance Toolkit and Process Monitor.  I would not recommend running both at the same time because the logging does add some overhead (especially procmon in my humble experience).

The next bit of this post will detail how to offline inject the necessary software and tools into your target device image to begin capturing boot performance data.

Windows Performance Toolkit

For the Windows Performance Toolkit it must be installed on the image or you can copy the files from an install to your image in the following path:

To offline inject, simply mount your vDisk image and copy the files there:

 

Then the portion of it that we are interested in is “xbootmgr.exe” (aka boot logging).  In order to enable boot logging we need to inject the following registry key into our PVS Image:

Seal/promote the image.

On next boot you will have captured boot information:

To see how to use Process Monitor for boot tracing Citrix PVS Target Device’s click here.

Read More

Lets Make PVS Target Device Booting Great Again (Part 2)

2017-01-05
/ /
in Blog
/

Continuing on from Part 1, we are looking to optimize the PVS boot process to be as fast as it possibly can be.  In Part 1 we implemented Jumbo Frames across both the PVS target device and the PVS server and discovered that Jumbo Frames only applies to the portion where BNIStack kicks in.

In this part we are going to examine the option “I/O burst size (KB)”  This policy is explained in the help file:

I/O burst size — The number of bytes that will be transmitted in a single read/write transaction before an ACK is sent from the server or device. The larger the IO burst, the faster the throughput to an individual device, but the more stress placed on the server and network infrastructure. Also, larger IO Bursts increase the likelihood of lost packets and costly retries. Smaller IO bursts reduce single client network throughput, but also reduce server load. Smaller IO bursts also reduce the likelihood of retries. IO Burst Size / MTU size must be <= 32, i.e. only 32 packets can be in a single IO burst before a ACK is needed.

What are these ACK’s and can we see them?  We can.  They are UDP packets sent back from the target device to the PVS server.  If you open Procmon on the PVS server and startup a target device an ACK looks like so:

These highlighted 48byte UDP Receive packets? They are the ACKS

And if we enable the disk view with the network view:

 

With each 32KB read of the hard disk we send out 24 packets, 23 at 1464 bytes and 1 at 440 bytes.  Add them all together and we get 34,112 Bytes of data.  This implies an overall overhead of 1344 bytes per sequence of reads or 56 bytes per packet.  I confirmed it’s a per-packet overhead by looking at a different read event at a different size:

If we look at the first read event (8,192) we can see there is 6 packets, 5 at 1464 and one at 1208, totaling 8528 bytes of traffic.  8528 – 8192 = 336 bytes of overhead / 6 packets = 56 bytes.

The same happens with the 16,384 byte read next in the list.  12 packets, 11 at 1464 and one at 952, totaling 17,056.  17056 – 16384 = 672 bytes of overhead / 12 packets = 56 bytes.

So it’s consistent.  For every packet at the standard 1506 MTU you are losing 3.8% to overhead.  But there is secretly more overhead than just that.  For every read there is a 48 byte ACK overhead on top.  Admittedly, it’s not much; but it’s present.

And how does this look with Jumbo Frames?

For a 32KB read we satisfied the request in 4 packets.  3 x 8972 bytes and 1 at 6076 bytes totalling 32,992 bytes of transmitted data.  Subtracting the transmitted data from what is really required 32,992-32,768 = 224 bytes of overhead or…  56 bytes per packet 🙂

This amounts to a measly 0.6% of overhead when using jumbo frames (an immediate 3% gain!).

But what about this 32KB value.  What happens if we adjust it longer (or shorter)?

Well, there is a limitation that handicaps us…  even if we use Jumbo Frames.  It is stated here:

IO Burst Size / MTU size must be <= 32, i.e. only 32 packets can be in a single IO burst before a ACK is needed

Because Jumbo Frames don’t occur until after the BNIStack kicks in, we are limited to working out this math at the 1506 MTU size.

The caveat of this is the size isn’t actually the MTU size of 1506.  The math is based on the data that fits within, which is 1464 bytes.  Doing the math in reverse gives us 1464 x 32 = 45056 bytes.  This equals a clear 44K (45056 /1024) maximum size.  Setting IO/Burst to 44K and the target device still boots.  Counting the packets, there are 32 packets.

So if we up the IO/Burst by 1K to 45K (45*1024 = 46,080 bytes) will it still boot?

It does not boot.  This enforces a hard limit of 44K for I/O Burst until the 1st stage supports a larger MTU size.  I have only explored EFI booting, so I suppose it’s possible another boot method allows for larger MTU?

The reads themselves are split now, hitting the ‘version’ and the ‘base’ with the base being 25,600 + 20,480 for the version (46,080 bytes).  I believe this is normal for versioning though.

So what’s the recommendation here?

Good question.  Citrix defaults to 32K I/O Burst Size.  If we break the operation of the burst size we have 4 portions:

  1. Hard drive read time
  2. Packet send time
  3. Acknowledgement of receipt
  4. Turnaround time from receipt to next packet send

The times that I have for each portion at a 32K size appear to be (in milliseconds):

  1. 0.3
  2. 0.5
  3. 0.2
  4. 0.4

A total time of ~1.4ms per read transaction at 32K.

For 44K I have the following:

  1. 0.1
  2. 0.4
  3. 0.1
  4. 0.4

For a total time of ~1.0ms per read transaction at 44K.

I suspect the 0.4ms difference could be well within a margin of error of my hand based counting.  I took my numbers from a random sampling of 3 transactions, and averaged.  I cannot guarantee they were at the same spot of the boot process.

However, it appears the difference between them is close to negligible.  Question that must be posed is what’s the cost of a ‘retry’ or a missed or faulty UDP packet?  From the evidence I have it should be fairly small, but I haven’t figured out a way to test or detect what the turnaround time of a ‘retry’ is yet.

Citrix has a utility that gives you some information on what kind of gain you might get.  It’s called ‘Stream Console’ and it’s available in the Provisioning Services folder:

 

With 4K I/O burst it does not display any packets sent larger because they are limited to that size

 

8K I/O Burst Size. Notice how many 8K sectors are read over 4K?

 

16K I/O Burst Size

 

What I did to compare the differences in performance between all the I/O Burst Size options is I simply tried each size 3 times and took the results as posted by the StatusTray utility for boot time.  The unfortunate thing about the Status Tray is that it’s time/throughput calculations are rounded to the second.  This means that the Throughput isn’t entirely accurate as a second is a LARGE value when your talking about the difference between 8 to 9 seconds.  If you are just under or over whatever the rounding threshold is it’ll change your results when we start getting to these numbers.  But I’ll present my results anyways:

To me, the higher value of I/O Burst Size the better the performance.  

Again, caveats are that I do not know what the impact of a retry is, but if reading from the disk and resending the packet takes ~1ms then I imagine the ‘cost’ of a retry is very low, even with the larger sizes.  However, if your environment has longer disk reads, high latency, and a poor network with dropped or lost packets then it’s possible, I suppose, that higher I/O burst is not for you.

But I hope most PVS environments are something better designed and you actually don’t have to worry about it.  🙂

Read More

Lets Make PVS Target Device Booting Great Again (Part 1)

2016-12-30
/ /
in Blog
/

Some discussions have swirled recently about implementing VDI.  One of the challenges with VDI are things like slow boot times necessitating having machines pre-powered on, requiring a pool of machines sitting using server resources until a logon request comes in and more machines are powered on to meet the demand…  But what if your boot time is measured in the seconds?  Something so low you could keep the ‘pool’ of machines sitting on standby to 1 or 2 or even none!

I’m interested in investigating if this is possible.   I previously looked at this as a curiosity and achieved some good results:

 

However, that was a non-domain Server 2012 R2 fresh out of the box.  I tweaked my infrastructure a bit by storing the vDisk on a RAM Disk with Jumbo Frames (9k) to supercharge it somewhat.

Today, I’m going to investigate this again with PVS 7.12, UEFI, Windows 10, on a domain.  I’ll show how I investigated booting performance and see what we can do to improve it.

The first thing I’m going to do is install Windows 10, join it to the domain and create a vDisk.

Done.  Because I don’t have SCVMM setup on my home lab I had to muck my way to enabling UEFI HDD boot.  I went into the PVS folder (C:\ProgramData\Citrix\Provisioning Services) and copied out the BDMTemplate_uefi.vhd to my Hyper-V target Device folder

I then edited my Hyper-V Target Device (Gen2) and added the VHD:

I then mounted the VHD and modified the PVSBOOT.INI file so it pointed to my PVS server:

 

 

I then created my target device in the PVS console:

 

And Viola!  It Booted.

 

And out of the gate we are getting 8 second boot times.  At this point I don’t have it set with a RAM drive or anything so this is pretty stock, albeit on really fast hardware.  My throughput is crushing my previous speed record, so if I can reduce the amount of bytes read (it’s literally bytes read/time = throughput) I can improve the speed of my boot time.  On the flip side, I can try to increase my throughput but that’s a bit harder.

However, there are some tricks I can try.

I have Jumbo Frames enabled across my network.  At this stage I do not have them set but we can enable them to see if it helps.

To verify their operation I’m going to trace the boot operation from the PVS server using procmon:

We can clearly see the UDP packet size is capping out at 1464 bytes, making it 1464+ 8 byte UDP header + 20 byte IP header = 1492 bytes.  I enabled Jumbo Frames

Under Server Properties in the PVS console I adjusted the MTU to match the NIC:

 

You then need to restart the PVS services for it take effect.

I then made a new vDisk version and enabled Jumbo Frames in the OS of the target device.  I did a quick ping test to validate that Jumbo Frames are passing correctly.

I then did started procmon on the PVS server, set the target device to boot…

and…

 

1464 sized UDP packets.  A little smaller than the 9000 bytes or so it’s supposed to be.  Scrolling down a little futher, however, shows:

 

Notice the amount of UDP packets sent in the smaller frame size?

 

Approximately 24 packets until it gets a “Receive” notification to send the next batch of packets.  These 24 packets account for ~34,112 bytes of data per sequence.  Total time for each batch of packets is 4-6ms.

If we follow through to when the jumbo frames kick in we see the following:

This is a bit harder to read because the MIO (Multiple Input Output) kicks in here and so there are actually two threads executing the read operations as opposed to the single thread above.

Regardless, I think I’ve hit on a portion that is executing more-or-less sequentially.  The total amount of data being passed in these sequences is ~32,992 bytes but the time to execute on them is 1-2ms!  We have essentially doubled the performance of our latency on our hard disk.

So why is the data being sent like this?  Again, procmon brings some visibility here:

Each “UDP Receieve” packet is a validation that the data it received was good and instructs the Sream Process to read and send the next portion of the file on the disk.  If we move to the jumbo frame portion of the boot process we can see IO goes all over the place in size and where the reads are to occur:

So, again, jumbo frames are a big help here as all requests under 8K can be serviced in 1 packet, and there are usually MORE requests under 8K then above.  Fortunately, Procmon can give us some numbers to illustrate this.  I started and stopped the procmon trace for each run of a Network Boot with Jumbo Frames and without:

Standard MTU (1506)

 

Jumbo Frame MTU (9014)

 

The number we are really after is the 192.168.1.88:6905.  The total number of events are solidly in half with the number of sends about a 1/3 less!  It was fast enough that it was able to process double the amount of data in Bytes sent to the target device and bytes received from the target device!

Does this help our throughput?  Yes, it does:

 

“But Trentent!  That doesn’t show the massive gains you are spewing!  It’s only 4MB/s more in Through-put!”

And you are correct.  So why aren’t we seeing more gains?  The issue lies with how PVS boots.  It boots in two stages.  If you are familiar with PVS on Hyper-V from a year ago or more you are probably more aware of this issue.  Essentially, PVS breaks the boot into the first stage (bootloader stage) which starts in, essentially, a lower-performance mode (standard MTU).  Once the BNIStack loads it kicks into Jumbo Packet mode with the loading of the Synthetic NIC driver.  The benefits from Jumbo Frames doesn’t occur until this stage.  So when does Jumbo Frames kick in?  You can see it in Event Viewer.

From everything I see with Procmon, first stage boot ends on that first Ntfs event.  So out of the original 8 seconds, 4 is spent on first stage boot where Jumbo Packets are not enabled.  Everything after there is impacted (positively).  So for our 4 seconds “standard MTU” boot, bringing that down by a second is a 25% improvement!  Not small potatoes.

I intend to do more investigation into what I can do to improve boot performance for PVS target devices so stay tuned!  🙂

Read More