2016

Printers and their impact on logon duration

2019-03-05
/ /
in Blog
/

40 second logons.  100 second logons.  200 second logons.  

Our users had an average logon time of 8-11 seconds.  But some users were hitting 40-200 seconds.  Users were frustrated and calling the help desk.

Why?

I did not want to individually examine all of the users whom had a long logon duration.  There was dozens of dozens of users, but after sampling 5 of them, the root cause was consistent.  Printers were causing long logons.  

What I found was users were encountering the long logons on specific workstations.  Further investigation revealed these workstations had globally mapped printers.  Some workstations had dozens of printers.  Some of the printers were long gone, having been moved/re-IP’ed, shutdown, or re-provisioned and the entry in the workstation not updated.  Citrix, depending on Receiver version, attempts to do a check to validate the printer’s operation before mapping and this would wait for a network timeout (or a crash of Receiver).  When I stopped the print spooler on the local workstation, logons came back down to the proper range.

What I sought to do was create something to make identifying this root cause easier, quicker, and with more specificity — in other words was there a particular printer causing the delay?

Citrix Printing Methods

Citrix has two methods of printing.  “Direct connection” and local printing.  Direct connection has the potential for larger impact on logon duration as it executes additional steps and actions on the Citrix server.

To enable “Direct Connection”, Citrix has a policy “Direct connection to print servers”.  By default, this policy is enabled.

With this policy enabled, it changes the behavior of how Citrix connects printers you have mapped locally as network printers.  Citrix has a diagram here:

But I feel this is missing temporal information.  I’ve created a video to highlight the steps.

Direct Connection Process

Why is this important?

Citrix has an option that is an absolute requirement for some applications.  
“Wait for printers to be created (server desktop)” (Virtual Apps and Desktops 7.x) or “Start this application without waiting for printers to be created. (Unchecked)” (Citrix XenApp 6.5).

This policy needs to be enabled for numerous applications that require a specific printer is present before the application starts.  This is usually due to applications that have pre-defined printers like label printers and do a check on app launch.  If you have an application like that you probably  have this feature enabled.

Citrix has many options for managing printers and they can impact your logon process.

Focusing on Direct Connection being enabled, if “Wait for printers to be created (server desktop)” policy is also enabled, care MUST be taken to minimize logon times.

Direct connection does something that can have a very adverse affect on logon time that is enabled by default.  When it connects directly to the print server it will check and install the print driver on the Citrix server.  The installation activity is visible via process monitor.

The DrvInst.exe process is the installation of printer drivers

The installation of printer driver, by default, will only occur if the driver is inbox or prestaged on the Citrix server.  But the check to see if a print driver needs to be installed occurs every time a session is created.  This behavior can be changed by policy.

Why does this impact logon times?  Driver installation can take a while, especially if there is communication issues between the Citrix server and print server.  Or if the print server is under stress and just simply slow to respond.  Or if the driver package that needs to be loaded is especially large, has lots of forms or page formats, multiple trays, etc.  This all adds up.

How can I tell what the impact of printers might be on Logon Duration?

I’ve been working on updating a Script-Based Action (SBA) originally posted by Guy Leech on the Citrix blogs.  This enhancement to the SBA will output more information breaking down what’s consuming your logon times.  Specifically, this tackles how long printers take in the ControlUp column: “Logon Duration – Other”

I’ve created a video showing how to enable the additional logging and how to run the new Script Based Action and what the output looks like:

What’s new in the output?

For the individual printers, direct connection printers now show two lines, the Driver load time and Printer load/connection time. 

The driver load time is titled with Driver and then the UNC path to the printer. 

The Printer and UNC path shows the the actual connection, printer configuration and establishment time.  All Direct Connection printers will be shown with UNC paths.

Citrix Universal Print Driver (UPD) mapped printers using the Citrix Universal Print Driver (print jobs that get sent back to the client for processing) are shown with their “Friendly Name” and do not have a matching “Driver” component as no driver loading is required.  In the example output above, “\\printsrv\HP Color LaserJet 5550” is a direct connection printer and “Zebra R110Xi HF (300 dpi)” is a Citrix UPD printer.

Here is an example of the full output:

The “Connect to Printers” is a sub-phase under “Pre-Shell (Userinit)“.  Each “Printer:” and “Driver:” is a sub-phase under “Connect to Printers“.  In this example screenshot, “Pre-Shell (Userinit)” was consuming 117.6 seconds of a 133.2 second logon duration.  Prior to this script that was all that was known.  Now we can see that connecting to printers consumes 97.7 of those 117.6 seconds!

Awesome!  What’s the catch?

In order to generate this information, some more verbose logging is required on the Citrix server side.   I specifically worked to ensure that no 3rd party utilities would be required so we simply need to enable 4 additional features.

Command Line Audit
PrintServer/Operational Logs
Audit Process Creation/Termination

I’ve written a batch script to enable these features:

Easy!

Wait a minute!  Windows Server 2008R2 doesn’t have Command Line Auditing!

Good catch!  Windows Server 2008R2 doesn’t have command line auditing so it will miss the driver load information.  There is no way around this using native tools, but modification could be done to use a 3rd party tool like sysmon.  The output on a 2008R2 system would show the following (with all other features enabled).

Awesome!  Let me at this script!

This script is available on the ControlUp site here.  If you have ControlUp you can run simply add it via the “Script Based Actions” button and further derive more information on your users logon performance!  

Last Gotcha

In some instances the “Connect to Printers phase” may start before “Pre-Shell”. This has been observed and is actually happening! The operating system is attempting to start *some* printer events asynchronously in certain situations and so the connect to Printers may start before userinit, but overall connecting to printers may still be blocking logons while it trys and completes.

Steve Elgan kindly pointed out that it’s important you size your event logs to ensure the data is present for the users whom you want to monitor. If your event logs are sized too small than the data being queried by the script won’t be present. So ensure you have your Security and the Printer event logs sized large enough to capture all of this data! For an idea of scale, you can look at your log size and the oldest entry there. If your log is 1MB in size and your oldest entry is from an hour ago, than sizing your log to 24MB should capture about a day’s worth of information.

Hope this helps you reduce your long logon durations!

Read More

Meltdown + Spectre – Performance Impact Analysis (Take 2)

2018-09-14
/ /
in Blog
/

I want to start this off by thanking ControlUp, LoginVSI and Ryan Ververs-Bijkerk for his assistance in helping me with this post.

Building on my last evaluation of the performance impact of Meltdown and Spectre, I was graciously given a trial of the LoginVSI software product used to simulate user loads and ControlUp’s cloud based analytic tool, ControlUp Insights.

This analysis takes into consideration the differences of operating systems and uses the latest Dell server hardware platform with a Intel Gold 6150 as the processor at its heart.  This processor contains the full PCID instructions to give the best performance (as of today) with mitigations enabled.  However, only Server 2012R2 and Server 2016 can take advantage of these hardware features.

Test Setup

This test was setup on 4 hosts.  2 of the hosts had VM’s where all the mitigations were enabled and 2 hosts had all of the mitigation features disabled.  I tested live production workloads and simulated user loads from LoginVSI.  The live production workloads were run on XenApp 6.5 on 2008R2 and the simulated workloads were on XenApp 7.15CU2 with 2008R2, 2012R2 and 2016.

Odd host numbers had the mitigation disabled, even host number had the mitigation enabled.

I sorted my testing logically in ControlUp by folder.

Real World Production results

The ControlUp Insights cloud product produced graphs and results that were easy and quick to interpret.  These results are for XenApp 6.5, Server 2008R2.

Hosts View

CPU Utilization

The mitigation-disabled “Host 1” had a higher consistency of using less CPU than the mitigation-enabled “Host 2”. The biggest spread in CPU was ~20% on the Intel Gold 6150’s with mitigation enabled to disabled.

IO Utilization

Another interesting result was IO utilization increased by an average of 100 IOPS for mitigation-enabled VM’s.  This means that Meltdown/Spectre also tax the storage subsystem more.  This averaged out to a consistent 12% hit in performance.

User Experience

The logon duration of the VM’s increased 43% from an average of 8 seconds on a mitigation-disabled VM to 14 seconds on a mitigation enabled VM.  The biggest jump in the sub-metrics were Logon Duration (Other) and Group Policy time going from 3-5s and 6-8s.

Mitigation Disabled

Mitigation Enabled

For applications we have that measure “user interactivity” the reduction in the user experience was 18%.  This meant that an action on a mitigation-enabled VM took an average of 1180ms vs 990ms on a mitigation-disabled VM when measuring actions within the UI.

Honestly, I wish I had ControlUp Insights earlier when I did my original piece, it provides much more detail in terms of tracking additional metrics and presents it much more cleanly that I did.  Also, when the information was available it was super quick and fast to look and compare the various types of results.

Simulated Results

LoginVSI was gracious enough to grant me licenses to their software for this testing.  Their software provides simulated user actions including pauses for coffee and chatting between work like typing/sending emails, reading word or PDF documents, or reviewing presentations.  The suite of software tested by the users tends to be major applications produced by major corporations who have experience producing software.  It is not representative of applications that could be impacted the most by Spectre/Meltdown (generally, applications that are registry event heavy).  Regardless, it is interesting to test with these simulated users as the workload produced by them do fall under the spectrum of “real world”.  As with everything, your mileage will vary and it is important to test and record your before and after impacts.  ControlUp with Insights does an incredible job of this and you can easily compare different periods in time to measure the impacts, or just rely on the machine learning suggestions of their virtual experts to properly size your environment.

Since our production workloads are Windows Server 2008R2 based, I took advantage of the LoginVSI license to test all three available server operating systems: 2008R2, 2012R2, and 2016.  Since newer operating systems are supposed to enable performance enhancing hardware features that can reduce the impact of these vulnerabilities, I was curious to *how much*.  I have this information now. 

I tested user loads of 500, 300, and 100 users across 2 hosts.  I tested one with Spectre/Meltdown applied and one without.  Each host ran 15VM’s with each VM having 30GB RAM and 6vCPU for an CPU oversubscription of 2.5:1.  The host spec was a Dell PowerEdge M640 with Intel 6150 Gold processors and 512GB of memory.

2016 – Hosts View

500 LoginVSI users with this workload, on Server 2016 pegged the hosts CPU to 100% on both the Meltdown/Spectre enabled and disabled hosts.  We can still see the gap between the CPU utilization between the two with the Meltdown Spectre hosts

300 LoginVSI users with this workload, on Server 2016 we see the gap is narrow but still visible.

100 LoginVSI users with this workload, on Server 2016 we see the gap is barely visible, it looks even.

2012R2 – Hosts View

500 LoginVSI users in this workload on Server 2012 R2.  There definitely appears to be a much larger gap between the meltdown enabled and disabled hosts.  And Server 2012R2 non-mitigated doesn’t cap out like Server 2016.

300 LoginVSI users in this workload on Server 2012R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2012R2.  Again, the separation is noticeable but appears narrower with lighter loads.

2008R2 – Hosts View

500 LoginVSI users in this workload on Server 2008R2.   Noticeable additional CPU load with the Meltdown/Spectre host.  A more interesting thing is it apperas overall CPU utilization is lower than 2012R2 or 2016.

300 LoginVSI users in this workload on Server 2008R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2008R2.  I only captured one run and the low utilization makes the difference barely noticeable.

Some interesting results for sure.  I took the data and put it into a pivot table to highlight the CPU differences for each workload against each operating system.

This chart hightlights the difference in CPU percentage between mitigation enabled and disabled systems.  The raw data:

Again, interesting results.  2008R2 seems to have the largest average CPU seperation, hitting 14%, followed by 2012R2 at 11% and than 2016 having a difference of 4%.

One of things about these results is that they highlight the “headroom” of the operating systems.  2008R2 actually consumes less CPU and so it has more room for separation between the 3 tiers.  On the 2016, there is so much time spent where the CPU was pegged at 100% for both types of host that makes a difference of “0%”.  So although the smaller number on server 2016 may lead you to believe it’s better, it’s actually not.

This shows it a little more clear.  With mitigations *enabled*, Server 2008R2 can do 500 users at less average CPU load than 2016 can do 300 users with mitigations *disabled*.

From the get-go, Server 2016 appears to consume 2x more CPU than Server 2008R2 in all non-capped scenarios with Server 2012 somewhere in between.

When we compare the operating systems against the different user counts we see the impact the operating system choice has on resources. 

Mitigation Disabled:

100 Users
300 Users
500 Users

Mitigation Enabled:

100 Users
300 Users
500 Users

Final Word

Microsoft stated that they expected to see less of an impact of the Spectre/Meltdown mitigations with newer operating systems.  Indeed this does turn out to be the case.  However, the additional resource cost of newer operating systems is actually *more* than running 2008R2 or 2012R2 with mitigations enabled .  So if you’re environment is sized for running Server 2016, you probably have infrastructure that has already been spec’ed for the much heavier OS anyways.  If your infrastructure has been spec’ed for the older OS than you will see a larger impact.  However, if you’ve spec’ed for the larger OS (say for a migration activity) but are running your older OS’s on that hardware, you will see an impact but it will be less than when you go live with 2016.

Previously I had stated that there are two different, important performance mechanisms to consider; capacity and how fast work actually gets done.  All of these simulated measurements are about capacity.  I hope to see how speed is impacted between the OS’s, but that may have to wait for a future posting. 

Tabulating the LoginVSI simulated results without ControlUp Insights has taken me weeks to properly format the results.  I was able to use a trial of ControlUp Insights to look at the real world impact of our existing applications and workloads.  If my organization ever purchases Insights I would have had this post up a long time ago with even more data, looking at things like the storage subsystems.  Hopefully we acquire this product in the future and if you want to save yourself and your organization time, energy and effort getting precise, accurate data that can be compared against scenarios you create: get ControlUp Insights.


I’m going to harp on this, but YOUR WORKLOAD MATTERS MORE than these simulated results.  During this exercise, I was able to determine with ControlUp Insights that one of our applications is so light that we can host 1,000 users from a single host where that host struggled with 200 LoginVSI users.  So WORKLOAD MATTERS.  Just something to keep in mind when reviewing these results.  LoginVSI produces results that can serve as a proxy for what you can expect if you can properly relate these results to your existing workload.  LoginVSI also offers the capability to produce custom workloads tailored to your specific environment or applications so you can gauge impact with much more precision. 

Read More