ControlUp

Meltdown + Spectre – Performance Impact Analysis (Take 2)

2018-09-14
/ /
in Blog
/

I want to start this off by thanking ControlUp, LoginVSI and Ryan Ververs-Bijkerk for his assistance in helping me with this post.

Building on my last evaluation of the performance impact of Meltdown and Spectre, I was graciously given a trial of the LoginVSI software product used to simulate user loads and ControlUp’s cloud based analytic tool, ControlUp Insights.

This analysis takes into consideration the differences of operating systems and uses the latest Dell server hardware platform with a Intel Gold 6150 as the processor at its heart.  This processor contains the full PCID instructions to give the best performance (as of today) with mitigations enabled.  However, only Server 2012R2 and Server 2016 can take advantage of these hardware features.

Test Setup

This test was setup on 4 hosts.  2 of the hosts had VM’s where all the mitigations were enabled and 2 hosts had all of the mitigation features disabled.  I tested live production workloads and simulated user loads from LoginVSI.  The live production workloads were run on XenApp 6.5 on 2008R2 and the simulated workloads were on XenApp 7.15CU2 with 2008R2, 2012R2 and 2016.

Odd host numbers had the mitigation disabled, even host number had the mitigation enabled.

I sorted my testing logically in ControlUp by folder.

Real World Production results

The ControlUp Insights cloud product produced graphs and results that were easy and quick to interpret.  These results are for XenApp 6.5, Server 2008R2.

Hosts View

CPU Utilization

The mitigation-disabled “Host 1” had a higher consistency of using less CPU than the mitigation-enabled “Host 2”. The biggest spread in CPU was ~20% on the Intel Gold 6150’s with mitigation enabled to disabled.

IO Utilization

Another interesting result was IO utilization increased by an average of 100 IOPS for mitigation-enabled VM’s.  This means that Meltdown/Spectre also tax the storage subsystem more.  This averaged out to a consistent 12% hit in performance.

User Experience

The logon duration of the VM’s increased 43% from an average of 8 seconds on a mitigation-disabled VM to 14 seconds on a mitigation enabled VM.  The biggest jump in the sub-metrics were Logon Duration (Other) and Group Policy time going from 3-5s and 6-8s.

Mitigation Disabled

Mitigation Enabled

For applications we have that measure “user interactivity” the reduction in the user experience was 18%.  This meant that an action on a mitigation-enabled VM took an average of 1180ms vs 990ms on a mitigation-disabled VM when measuring actions within the UI.

Honestly, I wish I had ControlUp Insights earlier when I did my original piece, it provides much more detail in terms of tracking additional metrics and presents it much more cleanly that I did.  Also, when the information was available it was super quick and fast to look and compare the various types of results.

Simulated Results

LoginVSI was gracious enough to grant me licenses to their software for this testing.  Their software provides simulated user actions including pauses for coffee and chatting between work like typing/sending emails, reading word or PDF documents, or reviewing presentations.  The suite of software tested by the users tends to be major applications produced by major corporations who have experience producing software.  It is not representative of applications that could be impacted the most by Spectre/Meltdown (generally, applications that are registry event heavy).  Regardless, it is interesting to test with these simulated users as the workload produced by them do fall under the spectrum of “real world”.  As with everything, your mileage will vary and it is important to test and record your before and after impacts.  ControlUp with Insights does an incredible job of this and you can easily compare different periods in time to measure the impacts, or just rely on the machine learning suggestions of their virtual experts to properly size your environment.

Since our production workloads are Windows Server 2008R2 based, I took advantage of the LoginVSI license to test all three available server operating systems: 2008R2, 2012R2, and 2016.  Since newer operating systems are supposed to enable performance enhancing hardware features that can reduce the impact of these vulnerabilities, I was curious to *how much*.  I have this information now. 

I tested user loads of 500, 300, and 100 users across 2 hosts.  I tested one with Spectre/Meltdown applied and one without.  Each host ran 15VM’s with each VM having 30GB RAM and 6vCPU for an CPU oversubscription of 2.5:1.  The host spec was a Dell PowerEdge M640 with Intel 6150 Gold processors and 512GB of memory.

2016 – Hosts View

500 LoginVSI users with this workload, on Server 2016 pegged the hosts CPU to 100% on both the Meltdown/Spectre enabled and disabled hosts.  We can still see the gap between the CPU utilization between the two with the Meltdown Spectre hosts

300 LoginVSI users with this workload, on Server 2016 we see the gap is narrow but still visible.

100 LoginVSI users with this workload, on Server 2016 we see the gap is barely visible, it looks even.

2012R2 – Hosts View

500 LoginVSI users in this workload on Server 2012 R2.  There definitely appears to be a much larger gap between the meltdown enabled and disabled hosts.  And Server 2012R2 non-mitigated doesn’t cap out like Server 2016.

300 LoginVSI users in this workload on Server 2012R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2012R2.  Again, the separation is noticeable but appears narrower with lighter loads.

2008R2 – Hosts View

500 LoginVSI users in this workload on Server 2008R2.   Noticeable additional CPU load with the Meltdown/Spectre host.  A more interesting thing is it apperas overall CPU utilization is lower than 2012R2 or 2016.

300 LoginVSI users in this workload on Server 2008R2.  The separation between enabled and disabled is still very prominent.

100 LoginVSI users in this workload on Server 2008R2.  I only captured one run and the low utilization makes the difference barely noticeable.

Some interesting results for sure.  I took the data and put it into a pivot table to highlight the CPU differences for each workload against each operating system.

This chart hightlights the difference in CPU percentage between mitigation enabled and disabled systems.  The raw data:

Again, interesting results.  2008R2 seems to have the largest average CPU seperation, hitting 14%, followed by 2012R2 at 11% and than 2016 having a difference of 4%.

One of things about these results is that they highlight the “headroom” of the operating systems.  2008R2 actually consumes less CPU and so it has more room for separation between the 3 tiers.  On the 2016, there is so much time spent where the CPU was pegged at 100% for both types of host that makes a difference of “0%”.  So although the smaller number on server 2016 may lead you to believe it’s better, it’s actually not.

This shows it a little more clear.  With mitigations *enabled*, Server 2008R2 can do 500 users at less average CPU load than 2016 can do 300 users with mitigations *disabled*.

From the get-go, Server 2016 appears to consume 2x more CPU than Server 2008R2 in all non-capped scenarios with Server 2012 somewhere in between.

When we compare the operating systems against the different user counts we see the impact the operating system choice has on resources. 

Mitigation Disabled:

100 Users
300 Users
500 Users

Mitigation Enabled:

100 Users
300 Users
500 Users

Final Word

Microsoft stated that they expected to see less of an impact of the Spectre/Meltdown mitigations with newer operating systems.  Indeed this does turn out to be the case.  However, the additional resource cost of newer operating systems is actually *more* than running 2008R2 or 2012R2 with mitigations enabled .  So if you’re environment is sized for running Server 2016, you probably have infrastructure that has already been spec’ed for the much heavier OS anyways.  If your infrastructure has been spec’ed for the older OS than you will see a larger impact.  However, if you’ve spec’ed for the larger OS (say for a migration activity) but are running your older OS’s on that hardware, you will see an impact but it will be less than when you go live with 2016.

Previously I had stated that there are two different, important performance mechanisms to consider; capacity and how fast work actually gets done.  All of these simulated measurements are about capacity.  I hope to see how speed is impacted between the OS’s, but that may have to wait for a future posting. 

Tabulating the LoginVSI simulated results without ControlUp Insights has taken me weeks to properly format the results.  I was able to use a trial of ControlUp Insights to look at the real world impact of our existing applications and workloads.  If my organization ever purchases Insights I would have had this post up a long time ago with even more data, looking at things like the storage subsystems.  Hopefully we acquire this product in the future and if you want to save yourself and your organization time, energy and effort getting precise, accurate data that can be compared against scenarios you create: get ControlUp Insights.


I’m going to harp on this, but YOUR WORKLOAD MATTERS MORE than these simulated results.  During this exercise, I was able to determine with ControlUp Insights that one of our applications is so light that we can host 1,000 users from a single host where that host struggled with 200 LoginVSI users.  So WORKLOAD MATTERS.  Just something to keep in mind when reviewing these results.  LoginVSI produces results that can serve as a proxy for what you can expect if you can properly relate these results to your existing workload.  LoginVSI also offers the capability to produce custom workloads tailored to your specific environment or applications so you can gauge impact with much more precision. 

Read More

ControlUp – Realtime Troubleshooting Example

2017-04-13
/ /
in Blog
/

I’ve been using a troubleshooting tool for some time and I can’t speak highly enough about.  It has saved myself so much time in finding/resolving and remediating issues.  I decided to capture myself utilizing the tool *in realtime* to remediate a reported issue that came in during a weekend call.  I was able to find the servers that had the issues within 5 mins and prevent them from hosting users (effectively remediating the issue).

The root of the issue is we had some servers reboot during the daylight savings time change and this messed up some of our scripts that were relying on some timing to load our AppV packages.  After receiving the call I started my recording.

Youtube link.  Please ensure to select the highest quality.

Without ControlUp, trying to search all the servers was a manual task or a scripted task that was executed serially.  These tasks, in the past, would take tons of time.  Generally, an issue like this would have taken an hour or so to resolve.  With ControlUp this was resolved in less than 5 minutes.

Higher Quality Version here.

Read More

ControlUp – Script Based Action (SBA) Action Auditing

2017-01-30
/ /
in Blog
/

ControlUp is a tool we use to monitor our Citrix environment.  We have multiple people and multiple times actions are run via ControlUp and an easier way to review the actions would be nice.  ControlUp keeps all machine actions executed by them on the local machine’s event log.  To review these logs I decided, what better way than to use ControlUp!

The Script Based Action (SBA):

And the steps to create the SBA:

  1. Create a new SBA and name it “ControlUp Action Auditing” and click ‘Next’
  2. Set the ‘Assigned to:’ “Computer” and ‘Execution Context:’ as “ControlUp Console”
  3. Add the script and set the ‘Execution Timeout (seconds)’ to whatever will satisfy querying your remote systems (I set mine to 120).
  4. Setup the variables
    $args[0] = ‘Name’ property
    $args[1]: 
    (note the “pipe” symbol in the ‘Input Validation string’
    $args[2]: 
  5. Save and Finalize the SBA.  And now it in action:
Read More

Examining Logon Durations with Control Up – Profile Load Time

2016-09-15
/ /
in Blog
/

Touching on my last point with an invalid AD Home Directory attribute, I decided to examine it in more detail as what is causing the slowness on logon.

60s

52 seconds for Profile Load time.  This is caused by this:

bad_attribute

The Home Folder server ‘WSAPVSEQ07’ I created that share and set the home directory to it.  I then simulated a ‘server migration’ by shutting this box down.  This means that the server responds to pings because it’s still present in DNS.

ping

But why does it take so long?

If I do a packet capture on the server I’m trying to launch my application from, and set it to trace the ip.addr of the ‘powered off’ server, here’s what we see:

slow-logon-packet-capture

The logon process attempts to query the server at the ‘28.9’ second mark and stops at the ‘73.9’ second mark.  A total of 45 seconds.  This is all lumped into the ‘User Profile Load Time’.  So what’s happening here?

We can see the initial attempt to connection occurs at 28.9 seconds, then 31.9 seconds, then finally 37.9 seconds.  The time span is 3s between the first and second try then 6s between the second and third try.  This is explained here.

In Windows 7 and Windows Server 2008 R2, the TCP maximum SYN retransmission value is set to 2, and is not configurable. Because of the 3-second limit of the initial time-out value, the TCP three-way handshake is limited to a 21-second timeframe (3 seconds + 2*3 seconds + 4*3 seconds = 21 seconds).

Is this what we are seeing?  It’s close.  We are seeing the first two items for sure, but then in instead of a ‘3rd’ attempt, it starts over but at the same formula.

= 45 seconds.

According to the KB article, we are seeing the Max SYN retransmissions (2) for each syn sent.  This article contains a hotfix we can install to change the value of the Max SYN retransmissions, but it’s a minimum of 2 which it’s set to anyways.  However, there is an additional hotfix to modify the 3 second time period.

The minimum I’ve found is we can reduce the 3 second time period to 100ms.

values

This reduces the logon time to:

faster_logon

19 seconds.

What does this packet capture look like?  Like this:
lower_syn_values

Even with the ‘Initial RTO’ set to 100ms, Windows has a MinRTO value of 300ms:

minrto

After the initial ‘attempt’ there is a 10-12 second delay.

Setting the MinRTO to the minimum 20ms

minrto_20

Reduces our logon time further now because our SYN packets are now about 200ms apart:

minrto_200ms

We are now 16 seconds, 13 seconds spent on the profile upon which 12 seconds was timing out this connection.

16s-logon

 

Would you implement this?  I would strongly recommend against it.  SYN’s were designed so blips in the network can be overcome.  Unfortunately, I know of no way to get around the ‘Home Directory responds to DNS but not to ping’ timeout.  The following group policies have no effect:

gpo1

gpo2

 

And I suspect the reason it has no effect is because the server still responds to DNS so the SYN sequence takes place and blocks regardless of this GPO settings.  Undoubtedly, it’s because this comes into play:
gpo_explain

Since our user has a home directory the preference is to ‘always wait for the network to be initialized before logging the user on’.  There is no override.

Is there a way to detect a dead home directory server during a logon?  Outside of long logon’s I don’t see any event logging or anything that would point us to determine if this is the issue without having to do a packet capture in a isolated environment.

 

 

Read More

ControlUp – Dissecting Logon Times a step further (invalid Home Directory impact)

2016-09-07
/ /
in Blog
/

Continuing on from my previous post, we were still having certain users with logons in the dozens of seconds to minutes.  I wanted to find out why and see if there is anything further that could be done.

60second_profile

 

After identifying a user with a long logon with ControlUp I ran the ‘Analyze Logon Duration’ script:

51-1second_profile

 

Jeez, 59.4 seconds to logon with 51.2 seconds of that spent on the User Profile portion.  What is going on?  I turned to process monitor to capture the logon process:

screen-shot-2016-09-07-at-8-24-18-pm

Well, there appears to be a 1 minute gap between the cmd.exe command from when WinLogon.exe starts it.  The stage it ‘freezes’ at is “Please wait for the user profile service”.

 

Since there is no data recorded by Process Monitor I tested by deleting the users profile.  It made no difference, still 60 seconds.  But, since I now know it’s not the user profile it must be something else.  Experience has taught me to blame the user object and look for network paths.  50 seconds or so just *feels* like a network timeout.  So I examined the users AD object:

screen-shot-2016-09-07-at-8-46-19-pm

 

Well well well, we have a path.  Is it valid?

screen-shot-2016-09-07-at-8-49-42-pm

 

 

It is not valid.  So is my suspicion correct?  I removed the Home Directory path and relaunched:

without_homedir_logon_time

Well that’s much, much better!

So now I want ControlUp to identify this potential issue.  Unfortunately, I don’t really see any events I can key in on that says ‘Attempting to map home drive’.  But what we can do is pull that AD attribute and test to see if it’s valid and let us know if it’s not.  This is the output I now generate:

new_script

 

I revised the messaging slightly as I’ve found the ‘Group Policy’ phase can be affected if GPP Drive Maps reference the home directory attribute as well.

 

So I took my previous script and updated it further.  This time with a check for valid home directories.  I also added some window sizing information to give greater width for the response as ‘Interim Delay’ was getting truncated when there were long printer names.  Here is the further updated script:

Read More

ControlUp – Dissecting Logon Times a step further (Printer loading)

2016-08-31
/ /
in Blog
/

We have applications that require printers be loaded before the application is started.  This is usually because the application will check for a specific printer or default printer and if one is not set (because it hasn’t mapped into the session) then it’ll throw up a dialog or not start entirely.

So we have this value ‘unchecked’ for some applications:

Screen Shot 2016-08-31 at 12.12.08 AM

But how does this impact our logon times?

Well… Our organization just underwent a print server migration/upgrade where some print servers were decommissioned and don’t exist.  But some users still have them or references to them on their end points.  We do have policies that only map your default printer, but some users are on a policy to map ‘all’ printers they have on their system.

What’s the impact?

Screen Shot 2016-08-31 at 12.15.10 AM

Waiting for printers before starting the application…

 

Screen Shot 2016-08-31 at 12.26.50 AM

Without waiting for printers

16 Seconds?  How is that so?

Well, it turns out waiting for printers and the subsystem components to support them add a fair amount of time, and then worse is network printers that don’t go anywhere anymore.  I’ve seen these logons wait for connection before timing out, all the while the user sits there and waits.  The script that comes with ControlUp for analyzing logons is good, but I wanted to know more on why some systems had long logon times and the only clue was Pre-Shell (userinit) taking up all the time.  So I dug into the print logs and found a way to measure their impact.

Screen Shot 2016-08-31 at 12.32.05 AM

With my modified script we can clearly see waiting for the printers takes ~15.4s with a few printers over a few seconds and the rest at 0.5 seconds or so.  One thing about this process is that mapping printers is synchronous.  So when or if 1 stalls, the whole process gets stuck.  All my printers were local except for the ‘Generic / Text Only’ which was a network printer where I powered off the server.  It hung the longest at 5.9 seconds, but I’ve seen ‘non-existant’ network mapped printers hang for 150 seconds or so…

To facilitate finding the printers we need to pass the clientName to the server and the Print Service Logs need to be enabled.

You can enable the print service logs on server 2008R2 by executing the following:

The ControlUp arguments need to look like this now:

Screen Shot 2016-08-31 at 12.40.36 AM

Here is my updated script:

I hope to dig into other startup components and further drill down into what our user launch process looks like.  We wait, and we see 🙂

Read More

ControlUp – List AppV5 recent events on various servers

2016-08-26
/ /
in Blog
/

David Falkus just posted a blog post on using Powershell to combine multiple AppV5 logs into a single view and orders them chronologically so you can see the events as they occurred.

Since this was a PowerShell script we can use ControlUp to import it, tweak it to accept some server variables and then get the output back to us.  Here is a video of this in action:

Here is the recipe for it:

1

2

3

4

 

And the script:

 

Read More

Using ControlUp to launch a Citrix application published on a server

2016-08-23
/ /
in Blog
/

Occasionally, we have Citrix servers that ‘die’ in a peculiar way.  What happens may vary when they die but the usual symptoms are something like:

  1. The server is still somewhat responsive.  It responds to pings, RPC requests (tasklist /s %servername%)
  2. The server is not responsive.  You cannot RDP to it, console CTRL-ALT-DEL fails, etc.

This is frustrating because the services appear to be operating so the Citrix server will say, “hey, I’m working!  I can take sessions!”  And usually these servers won’t have any sessions because logons actually fail so their “XenApp Server Load” is low, so its priority for sessions to be directed to it is higher!  So how do we detect these servers with these issues?  Unfortunately, I haven’t seen any events in the Event Viewer or anything that stands out to search and find these troublesome servers.  Using ControlUp, sometimes it’s obvious because that troublesome server will have a much lower session count than other servers or something else is at fault and triggers the ‘Stress Level’ to go critical.  But these flags don’t usually exist if the problem has just occurred, they usually are more visible after time has passed.

Our helpdesk asked if there was a way they could test servers to help pinpoint a troublesome one.  I came up with a “Script-based Action” that targets a specific Citrix server and lists all published applications on that server.  You then select the application and it generates a ICA file and tries to launch it.  You need to have permission to the application on Citrix and Powershell remoting enabled on the XenApp servers/ZDC’s .  So if your a Citrix admin and PS Remoting is enabled this script will work out of the box.

However, I tried to make the script dynamic so you could query the XenApp servers from a standalone server without installing Citrix Powershell SDK locally.  To do this I use PowerShell remoting so you need to have PowerShell remoting enabled on your Citrix servers in your environment.  Secondly, if you have ‘lower’ privilege users you need to grant them the ability to connect to the servers via PowerShell remoting (by default only Administrators have access).  To grant them access you need to do the following:

powershell-perms

And in the ‘Set-PSSessionConfiguration’ command you need to enable the ‘Invoke’ permissions on your support group:
permissions
As well, you need to grant view properties on your Citrix farm since the group needs to query application properties, and workergroups (if you publish your applications to workergroups):

IMA_Perms

Now that we have our permissions configured we can create our ControlUp Script-Based action:

SBA_1 SBA_2 SBA_3 SBA_4

So what does this look like?

And the script:

 

Read More

Using VMWare Remote Console with ControlUp

2016-08-13
/ /
in Blog
/

I wanted to connect to the console session of some of our VM’s but ControlUp doesn’t have a native way of doing so.  Enter Script-Based-Actions and the ability to create those features!  Here is a video of it in action:

VMWare Remote Console on ControlUp

We use multiple individual vCenter servers so I have a list of them I need to connect to in order to find the VM and get the required data.  This takes a bit longer but is still faster than running 6 different vCenter consoles.  You will need to modify the vCenter list in my script and add your own:

 

Read More