ControlUp

Printers and their impact on logon duration

2019-03-05
/ /
in Blog
/

40 second logons.  100 second logons.  200 second logons.  

Our users had an average logon time of 8-11 seconds.  But some users were hitting 40-200 seconds.  Users were frustrated and calling the help desk.

Why?

I did not want to individually examine all of the users whom had a long logon duration.  There was dozens of dozens of users, but after sampling 5 of them, the root cause was consistent.  Printers were causing long logons.  

What I found was users were encountering the long logons on specific workstations.  Further investigation revealed these workstations had globally mapped printers.  Some workstations had dozens of printers.  Some of the printers were long gone, having been moved/re-IP’ed, shutdown, or re-provisioned and the entry in the workstation not updated.  Citrix, depending on Receiver version, attempts to do a check to validate the printer’s operation before mapping and this would wait for a network timeout (or a crash of Receiver).  When I stopped the print spooler on the local workstation, logons came back down to the proper range.

What I sought to do was create something to make identifying this root cause easier, quicker, and with more specificity — in other words was there a particular printer causing the delay?

Citrix Printing Methods

Citrix has two methods of printing.  “Direct connection” and local printing.  Direct connection has the potential for larger impact on logon duration as it executes additional steps and actions on the Citrix server.

To enable “Direct Connection”, Citrix has a policy “Direct connection to print servers”.  By default, this policy is enabled.

With this policy enabled, it changes the behavior of how Citrix connects printers you have mapped locally as network printers.  Citrix has a diagram here:

But I feel this is missing temporal information.  I’ve created a video to highlight the steps.

Direct Connection Process

Why is this important?

Citrix has an option that is an absolute requirement for some applications.  
“Wait for printers to be created (server desktop)” (Virtual Apps and Desktops 7.x) or “Start this application without waiting for printers to be created. (Unchecked)” (Citrix XenApp 6.5).

This policy needs to be enabled for numerous applications that require a specific printer is present before the application starts.  This is usually due to applications that have pre-defined printers like label printers and do a check on app launch.  If you have an application like that you probably  have this feature enabled.

Citrix has many options for managing printers and they can impact your logon process.

Focusing on Direct Connection being enabled, if “Wait for printers to be created (server desktop)” policy is also enabled, care MUST be taken to minimize logon times.

Direct connection does something that can have a very adverse affect on logon time that is enabled by default.  When it connects directly to the print server it will check and install the print driver on the Citrix server.  The installation activity is visible via process monitor.

The DrvInst.exe process is the installation of printer drivers

The installation of printer driver, by default, will only occur if the driver is inbox or prestaged on the Citrix server.  But the check to see if a print driver needs to be installed occurs every time a session is created.  This behavior can be changed by policy.

Why does this impact logon times?  Driver installation can take a while, especially if there is communication issues between the Citrix server and print server.  Or if the print server is under stress and just simply slow to respond.  Or if the driver package that needs to be loaded is especially large, has lots of forms or page formats, multiple trays, etc.  This all adds up.

How can I tell what the impact of printers might be on Logon Duration?

I’ve been working on updating a Script-Based Action (SBA) originally posted by Guy Leech on the Citrix blogs.  This enhancement to the SBA will output more information breaking down what’s consuming your logon times.  Specifically, this tackles how long printers take in the ControlUp column: “Logon Duration – Other”

I’ve created a video showing how to enable the additional logging and how to run the new Script Based Action and what the output looks like:

What’s new in the output?

For the individual printers, direct connection printers now show two lines, the Driver load time and Printer load/connection time. 

The driver load time is titled with Driver and then the UNC path to the printer. 

The Printer and UNC path shows the the actual connection, printer configuration and establishment time.  All Direct Connection printers will be shown with UNC paths.

Citrix Universal Print Driver (UPD) mapped printers using the Citrix Universal Print Driver (print jobs that get sent back to the client for processing) are shown with their “Friendly Name” and do not have a matching “Driver” component as no driver loading is required.  In the example output above, “\\printsrv\HP Color LaserJet 5550” is a direct connection printer and “Zebra R110Xi HF (300 dpi)” is a Citrix UPD printer.

Here is an example of the full output:

The “Connect to Printers” is a sub-phase under “Pre-Shell (Userinit)“.  Each “Printer:” and “Driver:” is a sub-phase under “Connect to Printers“.  In this example screenshot, “Pre-Shell (Userinit)” was consuming 117.6 seconds of a 133.2 second logon duration.  Prior to this script that was all that was known.  Now we can see that connecting to printers consumes 97.7 of those 117.6 seconds!

Awesome!  What’s the catch?

In order to generate this information, some more verbose logging is required on the Citrix server side.   I specifically worked to ensure that no 3rd party utilities would be required so we simply need to enable 4 additional features.

Command Line Audit
PrintServer/Operational Logs
Audit Process Creation/Termination

I’ve written a batch script to enable these features:

Easy!

Wait a minute!  Windows Server 2008R2 doesn’t have Command Line Auditing!

Good catch!  Windows Server 2008R2 doesn’t have command line auditing so it will miss the driver load information.  There is no way around this using native tools, but modification could be done to use a 3rd party tool like sysmon.  The output on a 2008R2 system would show the following (with all other features enabled).

Awesome!  Let me at this script!

This script is available on the ControlUp site here.  If you have ControlUp you can run simply add it via the “Script Based Actions” button and further derive more information on your users logon performance!  

Last Gotcha

In some instances the “Connect to Printers phase” may start before “Pre-Shell”. This has been observed and is actually happening! The operating system is attempting to start *some* printer events asynchronously in certain situations and so the connect to Printers may start before userinit, but overall connecting to printers may still be blocking logons while it trys and completes.

Steve Elgan kindly pointed out that it’s important you size your event logs to ensure the data is present for the users whom you want to monitor. If your event logs are sized too small than the data being queried by the script won’t be present. So ensure you have your Security and the Printer event logs sized large enough to capture all of this data! For an idea of scale, you can look at your log size and the oldest entry there. If your log is 1MB in size and your oldest entry is from an hour ago, than sizing your log to 24MB should capture about a day’s worth of information.

Hope this helps you reduce your long logon durations!

Read More

Citrix Logon Simulator’s – Part 2

2019-02-06
/ /
in Blog
/

In my previous post I was looking at utilizing a Logon Simulator to setup some proactive monitoring of a Citrix environment.  I setup some goals for myself:

  1. Minimize the number of VM’s to run the robots
  2. As little resource consumption as possible
  3. Still provide operational alerts
  4. Operate on-premise

I want the footprint of these robots to be tiny.  This must done on Server Core.

I want to run multiple instances of the logon simulator concurrently.

I need to be able to test “Stores” that do not have “Receiver for Web” sites enabled.

I want it so if I reboot the robot he picks up and starts running.

The Choice.

In order to successfully hit these specified targets I opt’ed to use ControlUp’s Logon Simulator.  It can target the Store Service so it works with our “Receiver for Web”-less stores.  It also has the features to generate events that can be targeted to send out notifications of an application launch failure.

The Setup

In order to achieve my goals I need the following:

  • A Service Account that will be logging onto the Citrix servers
  • The robot (Windows 2019 Server Core)

I installed Server Core 2019 and added it to the domain.

Configure Autologon

I configured group policy preferences to setup AutoLogon for my service account.  This group policy object is set to the OU the robots reside in.

Group Policy Preferences settings to configure Autologon for our service account.

However, I did not include the required “DefaultPassword” registry with the password.  In order to embed the password in a more secure fashion, I had to manually use Sysinternals AutoLogons.  This keeps the password from being stored in plain text in the registry but this does need to be manually executed on each robot.

Configuring Autologon

The account MUST be a regular user and not a member of the “Administrators” group.  This is a requirement of the ControlUp Logon Simulator.

Prerequisites gotcha’s

Because I selected to use pure Server Core, there are some components that require fixing for full compatibility.  This can actually be alleviated immediately by installing the Feature On Demand (FoD) “Server App Compatibility”, but this would increase both memory utilization and consume more disk space for our robot.  However, if you prefer the easy way out, adding the FoD fixes everything and you can skip the “Fixes” section.  Or just run the Logon Simulator on a operating system with the desktop experience.  Otherwise, follow the steps in each of the solutions.

Fixes

Unable to install Citrix Reciever/Workspace App

“TrolleyExpress.exe – System Error”
The code execution cannot proceed because oledlg.dll was not found.  Reinstalling the program may fix this problem.

Solution

Copy oledlg.dll from the SysWow64 in either the install.wim or another “Windows Server 2019 (Desktop Experience)” install and put it in the C:\Windows\SysWow64 folder of your robot.

 

ControlUp Logon Simlulator is cropped on smaller displays

Solution

Set the resolution larger in your VM.

 

ControlUp Logon Simulator errors when you attempt to save the configuration

“The logon simulator failed for the following reason: Creating an instance of the COM component with CLSID {C0B4E2F3-BA21-4773-8DBA-335EC946EB8B} from the IClassFactory failed due to the following error: 80040111 ClassFactory cannot supply requested class (Exception from HRESULT: 0x80040111 (CLASS_E_CLASSNOTAVAILABLE)).  An application event log has been written to handle this crash.”

Solution

Copy ExplorerFrame.dll from the SysWow64 in either the install.wim or another “Windows Server 2019 (Desktop Experience)” install and put it in the C:\Windows\SysWow64 folder of your robot.  Add the following registry:

ControlUp Logon Simulator detects admin rights

Admin rights detected The logon simulator should not be run as an administrator, please restart the app as a standard user.

Solution

Run the logon simulator as a standard user.


Configuration

Once you’ve implemented all the fixes, install Citrix Workspace App and ControlUp Logon Simulator with an account that is an administrator.

Configure ControlUpLogonSim.  With the simulator open, enter your Storefront details, ensuring to use the “Store” account as seen in the Storefront console.

 

 

For the “Resource to Launch” ensure the name matches the display name in Storefront:

 

In order to avoid session stealing in the simulator, each application will require a unique user name.  Setup a unique account per application you are going to test.

 

From here, enter your logon credentials for the account associated with the application.  Run your first test by clicking the green triangle and ensure it works correctly.

 

Now that we have a successful run we set “Repeat Test” to ON and save the configuration.

I then created another application to monitor by renaming the “Resource to Launch” as another application and saved a second configuration.  I saved all my files to a C:\Swinst folder.

 

The point of all of this is to ensure the simulator is running in an automated fashion.  To do so, we need to be able to configure the simulator to “launch” multiple different applications when the operating system starts.  We have already configured autologon, we’ve setup our configuration files for each application we want to monitor, now we need to set the monitor to auto-start.

Add the following registry key:

And create a file “C:\Swinst\StartAppMonitors.cmd” with the following contents:

And watch the magic fly!

And so the final question and the point of all this work, how much does this consume for our resources?

 

1.1GB of RAM for the ENTIRE system, a peak CPU consumption of 7%, and the processes required to do the monitor use no CPU and only ~55MB of RAM.  Each Citrix process consumes ~20MB of memory and is the most significant consumer of CPU but at the single digit % range.

I anticipate doing some more stress testing to determine what the maximum amount of monitors I can get on one system, but I’m thrilled with these results.  With this one box I would expect to be able to monitor dozens of application…  Maybe a hundred?

In the end, this was a fair bit of work to get this setup on Server Core, but I do believe the savings in resource consumption and overhead reduction will pay off.

Read More

Meltdown + Spectre – Performance Impact Analysis (Take 2)

2018-09-14
/ /
in Blog
/

I want to start this off by thanking ControlUp, LoginVSI and Ryan Ververs-Bijkerk for his assistance in helping me with this post.

Building on my last evaluation of the performance impact of Meltdown and Spectre, I was graciously given a trial of the LoginVSI software product used to simulate user loads and ControlUp’s cloud based analytic tool, ControlUp Insights.

This analysis takes into consideration the differences of operating systems and uses the latest Dell server hardware platform with a Intel Gold 6150 as the processor at its heart.  This processor contains the full PCID instructions to give the best performance (as of today) with mitigations enabled.  However, only Server 2012R2 and Server 2016 can take advantage of these hardware features.

Test Setup

This test was setup on 4 hosts.  2 of the hosts had VM’s where all the mitigations were enabled and 2 hosts had all of the mitigation features disabled.  I tested live production workloads and simulated user loads from LoginVSI.  The live production workloads were run on XenApp 6.5 on 2008R2 and the simulated workloads were on XenApp 7.15CU2 with 2008R2, 2012R2 and 2016.

Odd host numbers had the mitigation disabled, even host number had the mitigation enabled.

I sorted my testing logically in ControlUp by folder.

Real World Production results

The ControlUp Insights cloud product produced graphs and results that were easy and quick to interpret.  These results are for XenApp 6.5, Server 2008R2.

Hosts View

CPU Utilization

The mitigation-disabled “Host 1” had a higher consistency of using less CPU than the mitigation-enabled “Host 2”. The biggest spread in CPU was ~20% on the Intel Gold 6150’s with mitigation enabled to disabled.

IO Utilization

Another interesting result was IO utilization increased by an average of 100 IOPS for mitigation-enabled VM’s.  This means that Meltdown/Spectre also tax the storage subsystem more.  This averaged out to a consistent 12% hit in performance.

User Experience

The logon duration of the VM’s increased 43% from an average of 8 seconds on a mitigation-disabled VM to 14 seconds on a mitigation enabled VM.  The biggest jump in the sub-metrics were Logon Duration (Other) and Group Policy time going from 3-5s and 6-8s.

Mitigation Disabled

Mitigation Enabled

For applications we have that measure “user interactivity” the reduction in the user experience was 18%.  This meant that an action on a mitigation-enabled VM took an average of 1180ms vs 990ms on a mitigation-disabled VM when measuring actions within the UI.

Honestly, I wish I had ControlUp Insights earlier when I did my original piece, it provides much more detail in terms of tracking additional metrics and presents it much more cleanly that I did.  Also, when the information was available it was super quick and fast to look and compare the various types of results.

Simulated Results

LoginVSI was gracious enough to grant me licenses to their software for this testing.  Their software provides simulated user actions including pauses for coffee and chatting between work like typing/sending emails, reading word or PDF documents, or reviewing presentations.  The suite of software tested by the users tends to be major applications produced by major corporations who have experience producing software.  It is not representative of applications that could be impacted the most by Spectre/Meltdown (generally, applications that are registry event heavy).  Regardless, it is interesting to test with these simulated users as the workload produced by them do fall under the spectrum of “real world”.  As with everything, your mileage will vary and it is important to test and record your before and after impacts.  ControlUp with Insights does an incredible job of this and you can easily compare different periods in time to measure the impacts, or just rely on the machine learning suggestions of their virtual experts to properly size your environment.

Since our production workloads are Windows Server 2008R2 based, I took advantage of the LoginVSI license to test all three available server operating systems: 2008R2, 2012R2, and 2016.  Since newer operating systems are supposed to enable performance enhancing hardware features that can reduce the impact of these vulnerabilities, I was curious to *how much*.  I have this information now. 

I tested user loads of 500, 300, and 100 users across 2 hosts.  I tested one with Spectre/Meltdown applied and one without.  Each host ran 15VM’s with each VM having 30GB RAM and 6vCPU for an CPU oversubscription of 2.5:1.  The host spec was a Dell PowerEdge M640 with Intel 6150 Gold processors and 512GB of memory.

2016 – Hosts View

500 LoginVSI users with this workload, on Server 2016 pegged the hosts CPU to 100% on both the Meltdown/Spectre enabled and disabled hosts.  We can still see the gap between the CPU utilization between the two with the Meltdown Spectre hosts

300 LoginVSI users with this workload, on Server 2016 we see the gap is narrow but still visible.

100 LoginVSI users with this workload, on Server 2016 we see the gap is barely visible, it looks even.

2012R2 – Hosts View

500 LoginVSI users in this workload on Server 2012 R2.  There definitely appears to be a much larger gap between the meltdown enabled and disabled hosts.  And Server 2012R2 non-mitigated doesn’t cap out like Server 2016.

300 LoginVSI users in this workload on Server 2012R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2012R2.  Again, the separation is noticeable but appears narrower with lighter loads.

2008R2 – Hosts View

500 LoginVSI users in this workload on Server 2008R2.   Noticeable additional CPU load with the Meltdown/Spectre host.  A more interesting thing is it apperas overall CPU utilization is lower than 2012R2 or 2016.

300 LoginVSI users in this workload on Server 2008R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2008R2.  I only captured one run and the low utilization makes the difference barely noticeable.

Some interesting results for sure.  I took the data and put it into a pivot table to highlight the CPU differences for each workload against each operating system.

This chart hightlights the difference in CPU percentage between mitigation enabled and disabled systems.  The raw data:

Again, interesting results.  2008R2 seems to have the largest average CPU seperation, hitting 14%, followed by 2012R2 at 11% and than 2016 having a difference of 4%.

One of things about these results is that they highlight the “headroom” of the operating systems.  2008R2 actually consumes less CPU and so it has more room for separation between the 3 tiers.  On the 2016, there is so much time spent where the CPU was pegged at 100% for both types of host that makes a difference of “0%”.  So although the smaller number on server 2016 may lead you to believe it’s better, it’s actually not.

This shows it a little more clear.  With mitigations *enabled*, Server 2008R2 can do 500 users at less average CPU load than 2016 can do 300 users with mitigations *disabled*.

From the get-go, Server 2016 appears to consume 2x more CPU than Server 2008R2 in all non-capped scenarios with Server 2012 somewhere in between.

When we compare the operating systems against the different user counts we see the impact the operating system choice has on resources. 

Mitigation Disabled:

100 Users
300 Users
500 Users

Mitigation Enabled:

100 Users
300 Users
500 Users

Final Word

Microsoft stated that they expected to see less of an impact of the Spectre/Meltdown mitigations with newer operating systems.  Indeed this does turn out to be the case.  However, the additional resource cost of newer operating systems is actually *more* than running 2008R2 or 2012R2 with mitigations enabled .  So if you’re environment is sized for running Server 2016, you probably have infrastructure that has already been spec’ed for the much heavier OS anyways.  If your infrastructure has been spec’ed for the older OS than you will see a larger impact.  However, if you’ve spec’ed for the larger OS (say for a migration activity) but are running your older OS’s on that hardware, you will see an impact but it will be less than when you go live with 2016.

Previously I had stated that there are two different, important performance mechanisms to consider; capacity and how fast work actually gets done.  All of these simulated measurements are about capacity.  I hope to see how speed is impacted between the OS’s, but that may have to wait for a future posting. 

Tabulating the LoginVSI simulated results without ControlUp Insights has taken me weeks to properly format the results.  I was able to use a trial of ControlUp Insights to look at the real world impact of our existing applications and workloads.  If my organization ever purchases Insights I would have had this post up a long time ago with even more data, looking at things like the storage subsystems.  Hopefully we acquire this product in the future and if you want to save yourself and your organization time, energy and effort getting precise, accurate data that can be compared against scenarios you create: get ControlUp Insights.


I’m going to harp on this, but YOUR WORKLOAD MATTERS MORE than these simulated results.  During this exercise, I was able to determine with ControlUp Insights that one of our applications is so light that we can host 1,000 users from a single host where that host struggled with 200 LoginVSI users.  So WORKLOAD MATTERS.  Just something to keep in mind when reviewing these results.  LoginVSI produces results that can serve as a proxy for what you can expect if you can properly relate these results to your existing workload.  LoginVSI also offers the capability to produce custom workloads tailored to your specific environment or applications so you can gauge impact with much more precision. 

Read More

ControlUp – Realtime Troubleshooting Example

2017-04-13
/ /
in Blog
/

I’ve been using a troubleshooting tool for some time and I can’t speak highly enough about.  It has saved myself so much time in finding/resolving and remediating issues.  I decided to capture myself utilizing the tool *in realtime* to remediate a reported issue that came in during a weekend call.  I was able to find the servers that had the issues within 5 mins and prevent them from hosting users (effectively remediating the issue).

The root of the issue is we had some servers reboot during the daylight savings time change and this messed up some of our scripts that were relying on some timing to load our AppV packages.  After receiving the call I started my recording.

Youtube link.  Please ensure to select the highest quality.

Without ControlUp, trying to search all the servers was a manual task or a scripted task that was executed serially.  These tasks, in the past, would take tons of time.  Generally, an issue like this would have taken an hour or so to resolve.  With ControlUp this was resolved in less than 5 minutes.

Higher Quality Version here.

Read More

ControlUp – Script Based Action (SBA) Action Auditing

2017-01-30
/ /
in Blog
/

ControlUp is a tool we use to monitor our Citrix environment.  We have multiple people and multiple times actions are run via ControlUp and an easier way to review the actions would be nice.  ControlUp keeps all machine actions executed by them on the local machine’s event log.  To review these logs I decided, what better way than to use ControlUp!

The Script Based Action (SBA):

And the steps to create the SBA:

  1. Create a new SBA and name it “ControlUp Action Auditing” and click ‘Next’
  2. Set the ‘Assigned to:’ “Computer” and ‘Execution Context:’ as “ControlUp Console”
  3. Add the script and set the ‘Execution Timeout (seconds)’ to whatever will satisfy querying your remote systems (I set mine to 120).
  4. Setup the variables
    $args[0] = ‘Name’ property
    $args[1]: 
    (note the “pipe” symbol in the ‘Input Validation string’
    $args[2]: 
  5. Save and Finalize the SBA.  And now it in action:
Read More

Examining Logon Durations with Control Up – Profile Load Time

2016-09-15
/ /
in Blog
/

Touching on my last point with an invalid AD Home Directory attribute, I decided to examine it in more detail as what is causing the slowness on logon.

60s

52 seconds for Profile Load time.  This is caused by this:

bad_attribute

The Home Folder server ‘WSAPVSEQ07’ I created that share and set the home directory to it.  I then simulated a ‘server migration’ by shutting this box down.  This means that the server responds to pings because it’s still present in DNS.

ping

But why does it take so long?

If I do a packet capture on the server I’m trying to launch my application from, and set it to trace the ip.addr of the ‘powered off’ server, here’s what we see:

slow-logon-packet-capture

The logon process attempts to query the server at the ‘28.9’ second mark and stops at the ‘73.9’ second mark.  A total of 45 seconds.  This is all lumped into the ‘User Profile Load Time’.  So what’s happening here?

We can see the initial attempt to connection occurs at 28.9 seconds, then 31.9 seconds, then finally 37.9 seconds.  The time span is 3s between the first and second try then 6s between the second and third try.  This is explained here.

In Windows 7 and Windows Server 2008 R2, the TCP maximum SYN retransmission value is set to 2, and is not configurable. Because of the 3-second limit of the initial time-out value, the TCP three-way handshake is limited to a 21-second timeframe (3 seconds + 2*3 seconds + 4*3 seconds = 21 seconds).

Is this what we are seeing?  It’s close.  We are seeing the first two items for sure, but then in instead of a ‘3rd’ attempt, it starts over but at the same formula.

= 45 seconds.

According to the KB article, we are seeing the Max SYN retransmissions (2) for each syn sent.  This article contains a hotfix we can install to change the value of the Max SYN retransmissions, but it’s a minimum of 2 which it’s set to anyways.  However, there is an additional hotfix to modify the 3 second time period.

The minimum I’ve found is we can reduce the 3 second time period to 100ms.

values

This reduces the logon time to:

faster_logon

19 seconds.

What does this packet capture look like?  Like this:
lower_syn_values

Even with the ‘Initial RTO’ set to 100ms, Windows has a MinRTO value of 300ms:

minrto

After the initial ‘attempt’ there is a 10-12 second delay.

Setting the MinRTO to the minimum 20ms

minrto_20

Reduces our logon time further now because our SYN packets are now about 200ms apart:

minrto_200ms

We are now 16 seconds, 13 seconds spent on the profile upon which 12 seconds was timing out this connection.

16s-logon

 

Would you implement this?  I would strongly recommend against it.  SYN’s were designed so blips in the network can be overcome.  Unfortunately, I know of no way to get around the ‘Home Directory responds to DNS but not to ping’ timeout.  The following group policies have no effect:

gpo1

gpo2

 

And I suspect the reason it has no effect is because the server still responds to DNS so the SYN sequence takes place and blocks regardless of this GPO settings.  Undoubtedly, it’s because this comes into play:
gpo_explain

Since our user has a home directory the preference is to ‘always wait for the network to be initialized before logging the user on’.  There is no override.

Is there a way to detect a dead home directory server during a logon?  Outside of long logon’s I don’t see any event logging or anything that would point us to determine if this is the issue without having to do a packet capture in a isolated environment.

 

 

Read More

ControlUp – Dissecting Logon Times a step further (invalid Home Directory impact)

2016-09-07
/ /
in Blog
/

Continuing on from my previous post, we were still having certain users with logons in the dozens of seconds to minutes.  I wanted to find out why and see if there is anything further that could be done.

60second_profile

 

After identifying a user with a long logon with ControlUp I ran the ‘Analyze Logon Duration’ script:

51-1second_profile

 

Jeez, 59.4 seconds to logon with 51.2 seconds of that spent on the User Profile portion.  What is going on?  I turned to process monitor to capture the logon process:

screen-shot-2016-09-07-at-8-24-18-pm

Well, there appears to be a 1 minute gap between the cmd.exe command from when WinLogon.exe starts it.  The stage it ‘freezes’ at is “Please wait for the user profile service”.

 

Since there is no data recorded by Process Monitor I tested by deleting the users profile.  It made no difference, still 60 seconds.  But, since I now know it’s not the user profile it must be something else.  Experience has taught me to blame the user object and look for network paths.  50 seconds or so just *feels* like a network timeout.  So I examined the users AD object:

screen-shot-2016-09-07-at-8-46-19-pm

 

Well well well, we have a path.  Is it valid?

screen-shot-2016-09-07-at-8-49-42-pm

 

 

It is not valid.  So is my suspicion correct?  I removed the Home Directory path and relaunched:

without_homedir_logon_time

Well that’s much, much better!

So now I want ControlUp to identify this potential issue.  Unfortunately, I don’t really see any events I can key in on that says ‘Attempting to map home drive’.  But what we can do is pull that AD attribute and test to see if it’s valid and let us know if it’s not.  This is the output I now generate:

new_script

 

I revised the messaging slightly as I’ve found the ‘Group Policy’ phase can be affected if GPP Drive Maps reference the home directory attribute as well.

 

So I took my previous script and updated it further.  This time with a check for valid home directories.  I also added some window sizing information to give greater width for the response as ‘Interim Delay’ was getting truncated when there were long printer names.  Here is the further updated script:

Read More

ControlUp – Dissecting Logon Times a step further (Printer loading)

2016-08-31
/ /
in Blog
/

We have applications that require printers be loaded before the application is started.  This is usually because the application will check for a specific printer or default printer and if one is not set (because it hasn’t mapped into the session) then it’ll throw up a dialog or not start entirely.

So we have this value ‘unchecked’ for some applications:

Screen Shot 2016-08-31 at 12.12.08 AM

But how does this impact our logon times?

Well… Our organization just underwent a print server migration/upgrade where some print servers were decommissioned and don’t exist.  But some users still have them or references to them on their end points.  We do have policies that only map your default printer, but some users are on a policy to map ‘all’ printers they have on their system.

What’s the impact?

Screen Shot 2016-08-31 at 12.15.10 AM

Waiting for printers before starting the application…

 

Screen Shot 2016-08-31 at 12.26.50 AM

Without waiting for printers

16 Seconds?  How is that so?

Well, it turns out waiting for printers and the subsystem components to support them add a fair amount of time, and then worse is network printers that don’t go anywhere anymore.  I’ve seen these logons wait for connection before timing out, all the while the user sits there and waits.  The script that comes with ControlUp for analyzing logons is good, but I wanted to know more on why some systems had long logon times and the only clue was Pre-Shell (userinit) taking up all the time.  So I dug into the print logs and found a way to measure their impact.

Screen Shot 2016-08-31 at 12.32.05 AM

With my modified script we can clearly see waiting for the printers takes ~15.4s with a few printers over a few seconds and the rest at 0.5 seconds or so.  One thing about this process is that mapping printers is synchronous.  So when or if 1 stalls, the whole process gets stuck.  All my printers were local except for the ‘Generic / Text Only’ which was a network printer where I powered off the server.  It hung the longest at 5.9 seconds, but I’ve seen ‘non-existant’ network mapped printers hang for 150 seconds or so…

To facilitate finding the printers we need to pass the clientName to the server and the Print Service Logs need to be enabled.

You can enable the print service logs on server 2008R2 by executing the following:

The ControlUp arguments need to look like this now:

Screen Shot 2016-08-31 at 12.40.36 AM

Here is my updated script:

I hope to dig into other startup components and further drill down into what our user launch process looks like.  We wait, and we see 🙂

Read More

ControlUp – List AppV5 recent events on various servers

2016-08-26
/ /
in Blog
/

David Falkus just posted a blog post on using Powershell to combine multiple AppV5 logs into a single view and orders them chronologically so you can see the events as they occurred.

Since this was a PowerShell script we can use ControlUp to import it, tweak it to accept some server variables and then get the output back to us.  Here is a video of this in action:

Here is the recipe for it:

1

2

3

4

 

And the script:

 

Read More

Using ControlUp to launch a Citrix application published on a server

2016-08-23
/ /
in Blog
/

Occasionally, we have Citrix servers that ‘die’ in a peculiar way.  What happens may vary when they die but the usual symptoms are something like:

  1. The server is still somewhat responsive.  It responds to pings, RPC requests (tasklist /s %servername%)
  2. The server is not responsive.  You cannot RDP to it, console CTRL-ALT-DEL fails, etc.

This is frustrating because the services appear to be operating so the Citrix server will say, “hey, I’m working!  I can take sessions!”  And usually these servers won’t have any sessions because logons actually fail so their “XenApp Server Load” is low, so its priority for sessions to be directed to it is higher!  So how do we detect these servers with these issues?  Unfortunately, I haven’t seen any events in the Event Viewer or anything that stands out to search and find these troublesome servers.  Using ControlUp, sometimes it’s obvious because that troublesome server will have a much lower session count than other servers or something else is at fault and triggers the ‘Stress Level’ to go critical.  But these flags don’t usually exist if the problem has just occurred, they usually are more visible after time has passed.

Our helpdesk asked if there was a way they could test servers to help pinpoint a troublesome one.  I came up with a “Script-based Action” that targets a specific Citrix server and lists all published applications on that server.  You then select the application and it generates a ICA file and tries to launch it.  You need to have permission to the application on Citrix and Powershell remoting enabled on the XenApp servers/ZDC’s .  So if your a Citrix admin and PS Remoting is enabled this script will work out of the box.

However, I tried to make the script dynamic so you could query the XenApp servers from a standalone server without installing Citrix Powershell SDK locally.  To do this I use PowerShell remoting so you need to have PowerShell remoting enabled on your Citrix servers in your environment.  Secondly, if you have ‘lower’ privilege users you need to grant them the ability to connect to the servers via PowerShell remoting (by default only Administrators have access).  To grant them access you need to do the following:

powershell-perms

And in the ‘Set-PSSessionConfiguration’ command you need to enable the ‘Invoke’ permissions on your support group:
permissions
As well, you need to grant view properties on your Citrix farm since the group needs to query application properties, and workergroups (if you publish your applications to workergroups):

IMA_Perms

Now that we have our permissions configured we can create our ControlUp Script-Based action:

SBA_1 SBA_2 SBA_3 SBA_4

So what does this look like?

And the script:

 

Read More