Performance

Exploring the Citrix XML 6.5 Broker in more detail

2014-11-28
/ / /

The Citrix XML broker actually relies on many pieces to ensure fast and proper operation. This CTX article describes the process for XenApp 6 (seems applicable to 6.5 as well).

The part that is relevant to the XML broker is steps 4-9.

4. The user’s credentials are forwarded from XML to the IMA service in HTTP (or HTTPS) form.

5. The IMA then forwards them to the local Lsass.exe.

6. The Lsass.exe encrypts the credentials and passes them to the domain controller.

7. The domain controller returns the SIDs (user’s SID and the list of group SIDs) back to Lsass.exe and to IMA.

8. IMA uses the SIDs to search the Local Host Cache (LHC) for a list of applications and the Worker Group Preference policy for that authenticated user.

9. The list of the applications together with the user’s worker group preference policy are returned to the Web Interface.

So what does this look like (click to blow it up)?

Starting with packet #32 we see the initial POST request for a list of applications.

Steps 5, 6 and 7 are packets 36-86.  LSASS goes back to AD to grab the SID’s.

Step 7.5: It appears that the first time you enumerate applications on a broker that information is queried to the SQL database and stored in the local host cache.  This would be packets 87-94.  Additional queries done do not show this traffic.

Lastly, step 9 is all the traffic we see after packet 95 in red; the return of the XML data.  For our setup, our XML brokers responded with the following timings:

Step 4: 1ms
Step 5: 2ms
Steps 6 and 7: 47ms
Step 7.5 (DB query is not in LHC): 2ms
Step 8: 14ms
Step 9: 14ms

Total: 80ms

With a freshly created LHC this is what process monitor sees:

Again, the IMASrv.exe will actually NOT be present in this list if you have executed at least one query against the XML broker as it is only displayed when it queries the database (wssql011n02) and stores the response in the LHC.

So, what could contribute to slow XML broker performance?

Utilizing the WCAT script we can continuously hammer the XML broker with however many connections we desire.  The XML brokers have a maximum of 16 threads to deal with the incoming traffic but at 80ms per request/response the queue would have to get fairly long to create a noticeable performance impact.  In addition, previous tests on the 4.5 broker showed additional CPU’s help improve the performance of the XML broker, I think the 6.5 broker shows better performance with lesser CPU’s.

Utilizing the bandwidth emulator, clumsy, I’m going to simulate some poor network performance on the XML broker to see what the effects will be of how XML response times will vary.  The only network hits I can see are from the source (Web Interface), Active Directory, and (potentially) the SQL database.

Just starting the clumsy software with it’s filtering by IP capabilities added about 800ms to the total roundtrip time.  Something to consider with other network management/threat protection software I imagine…

Anyways, adding just 20ms of lag to and from the web interface to the XML broker increased processing time by another 500ms of total time.  That is, 1500ms on the low end on 2200 on the high.  Increasing packet latency to 100ms brought the total processing time to 2700ms on the lowend and 3800ms on the high end.  I think it’s safe to say that having the web interface and XML brokers beside each other for the lowest possible latency is a big performance win.

Targeting Active Directory with a latency of 20ms brought times from 420ms to 550ms.  Increasing that to 100ms brought the response times put to 890ms.  Not too shabby.  Seems AD is more resilient to latency.

Targeting the SQL database with a latency of 100ms showed the first query after the LHC was rebuilt went to 600ms and then back down to 420ms there after.  Locality to the database seems to have the lowest impact, but the 100ms lag did increase the LHC rebuild time to about 3 minutes from near instantly before.
I did try to test a heavy disk load against the XML broker but I was running this server with PVS with RAMDisk Cache overflow enabled which means my LHC is stored in RAM and no matter how hard I punted the C:\ drive I couldn’t make an impact.

Read More

Load testing Citrix XML broker

2014-11-27
/ / /

Previously, we encountered performance issues with the Citrix Web Interface due to our user load.  I devised a test using the Microsoft WCAT to hammer the web interface servers.  We found that after removing the ASPX processing limitation, logins were slow and we found that some XML brokers were taking a long time to respond.

I’ve been tasked with finding out why.  The Citrix XML server is a basic web server that takes an XML post, processes it and spits back a XML file in response.  To test the performance of the XML server I created a PowerShell script to send the same XML request that occurs when you login through the web interface.

XML-Test.ps1:

 
The list_of_xml.csv looks like this:
Farm,Broker,Port
farm,wsctxrshipxml1,80
The output of the file looks like so:
11/26/2014 3:34:29 PM,wsctxrshipxml1,80,312.197
11/26/2014 3:34:30 PM,wsctxrshipxml1,80,345.0929
11/26/2014 3:34:31 PM,wsctxrshipxml1,80,255.165
11/26/2014 3:34:33 PM,wsctxrshipxml1,80,294.3027
11/26/2014 3:34:34 PM,wsctxrshipxml1,80,300.1806
The times are in milliseconds on the far right.
Utilizing this information, we can gather how quickly the XML brokers respond and using this as a baseline we can start to load test to see the impact of how quickly the XML brokers will respond.
To do this, we go back to WCAT.  I set my XML.ubr file like so:
I then started the
start “” wcclient.exe localhost
And let it roll.  When monitoring the server the process that takes up the most time is the IMASRV.EXE.  I imagine that is because the XML service is really just a simple web server that accepts the traffic and hands it off to the IMASRV.exe to actually process and gets the response back and sends it back to the requestor.
Started load testing at 11:49:00.  After 40 concurrent connections XML service is responding to requests at 2000ms per request.
With this testing we can now try to improve the performance of the XML broker.  We monitored one of our brokers and made some changes to it and reran the test to see the impact.  The largest positive impact we saw was adding CPU’s to the XML brokers.  The following graphs illustrates the differences we saw:

 

I stopped the graph at around 20,000ms for responding to XML request for the top graph, and the maximum number of concurrent connections for the bottom graph at 3,000ms.  3,000ms would be very high for XML enumeration in my humble opinion, but still tolerable.  On a single CPU system, XML enumeration can only sustain about 12-15 concurrent requests before it tops out, 2 CPU systems do slightly better at 24-28 concurrent requests and 4CPU at 60 concurrent requests.  8 CPU’s and we exceed 3,000ms at 120 concurrent requests.  Ideally, you would keep all requests under 1,000ms, which for 8 CPU is at 30 concurrent sessions, and 21 sessions for 4CPU.  1 CPU can only sustain about 2 concurrent requests to stay under 1,000ms while 2 CPU can sustain about 5 concurrent requests.
Again, this is the same query that the web interface sends when you login to a farm.  So if you have 10 farms and they all take 1,000ms to respond to an XML request you will sit at the login screen for 10 seconds.  Storefront allows parallel requests which would reduce the time to 1 second (potentially) but for Web Interface (and even Store Front), having an optimized XML broker configuration is ideal and, apparently, is very dependent on the number of CPU’s you can give it.  Recommendation:
As many as possible.  Unfortunately, I did not have the ability to test 16 or 32 core systems but for an Enterprise environment, I would try to keep it at 8 as a minimum.
Read More

Optimizing Citrix Web Interface 5.4

2014-11-27
/ / /

My previous post detailed testing the limits of your Citrix Web Interface.  During this testing we discovered there appeared to be a limit on the number of connections the Citrix Web Interface could accommodate.  This discovery masked the true limitation: the Citrix Web Interface is limited by the number of threads its w3wp.exe process can spawn to handle ASPX pages.  The w3wp.exe process spawn threads to handle the number of connections/load and once it exceeded 48 threads then further requests go into the application queue.  This limitation exists when running a ASP.NET application under ASP.NET 2.0 in Classic mode.  In integrated mode this limitation is determined differently and not applicable.  Unfortunately, the Citrix Web Interface runs in Classic mode under ASP.NET 2.0.

This formula is determined by the number of CPU’s you have times an arbitrary number Microsoft decided would be good for dealing with ASPX pages set back in the IIS 5 or IIS 6 days.  It is woefully low and sadly, Citrix does not resolve it when you install the Web Interface application.

But, we can increase the per processor thread count to prevent the queuing.  To do so you must edit the “C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config” and set the processModel with a new set of thread values.

The values are MaxWorkerThreads, MaxIoThreads, MinWorkerThreads and MinIoThreads.  All values are set * the number of processors you have and are *per w3wp.exe process*.  So if you have a XenApp website and services site running on separate application pools each process then has a 1000 thread limit on a 2 processor system.  When we implement these we found we could hammer the web interface and the ASPX pages would consistently respond without issue.  We did find that hammering the web interface with 1000 threads then revealed a new performance limitation.  Though the ASPX pages were being processed so the explicit login page now constantly appeared without issue, we found that logging in and showing the applications was now slow, taking several dozen seconds at times.  By allowing our web interface to process more connections and pass those connections on; some of our dedicated XML brokers were having issues keeping up with the requests.

My next post will detail testing the capabilities of the XML server and seeing if we can determine the best way to optimize XML performance.

Read More

Slow Citrix Web Interface 5.4 on ASPX pages (or how to load test your WI)

2014-11-23
/ / /

We’ve been having issues with our Citrix Web Interface (5.4.2) with it cratering to the point that you cannot even get to the explicit logon page, stuck on SilentDetection.aspx or another aspx page.

One Moment Please…  This screen takes minutes to resolve

So what could be causing this?  More often than not, it’s a broken XML broker that isn’t responding in a reasonable amount of time.  But XML brokers don’t come into play with ‘Explicit Logon’, especially if you can’t even get to the logon page.  In our scenario, we have two production Citrix Web Interface servers which are load-balanced in a round-robin scenario.

So I theorized that what’s happening is we are experiencing a load issue.  That is our server cannot handle the number of connections that we our clients and PNA agents are producing.  How can we prove this out?  How many connections can our web interface accept?

To test out what the maximum number of connections our web interface servers can sustain I came up with a plan.  We need to thoroughly exercise the ASP.NET scripts that Citrix has put together for the web interface.  To do that we need I decided that logging in via explicit logon, skipping client installation (so client detection occurs) and then application enumeration.  This sequence hits the following ASPX pages:

 

The timeline of processing the aspx pages of a unloaded server looks like this:
The initial silentDetection.aspx’s take the majority of the processing time, about 4.3 seconds.  To create the ability to simulate this load, Microsoft has a tool called “WCAT”, or the Web Capacity Analysis Tool.  This tool allows you to setup “scenario’s” which are a sequence of HTTP actions (GET/POST/etc) which simulates each step of the sequence above.  To record this sequence for WCAT, you need to install Fiddler 2 then install the WCAT scenario creator.  There are numerous WCAT scenario creators, but I used this one.  To create the scenario do the following:
1) Launch Fiddler 2 and click ‘Launch’
2) Go through your login sequence on your web server until you hit the page with application enumerations.

 

 

3) Go back to Fiddler 2 and at the bottom of the screen on the left in the black bar, type “wcat reset”
4) Go Edit > Select All.  Then in the black bar, type ‘wcat addtrans’
5) Lastly, type ‘wcat save’.
This will create a fiddler.wcat file in your C:Program Files (x86)Fiddler2 directory.
You can now open this file as a text file to see the scenario you just created.  One of the issues with this WCAT scenario creator you will need to fix up some cookie values.  Some cookie values are captured exactly as described, which has some issues as double quotes need to be escaped in the scenario file.
So, for example, this line:
Won’t work.  You need to escape out the double-quotes with a back-slash so it looks like so (the backslashes are bolded and underlined):
Once you’ve escape’d the double quotes, save the file as ‘scenario.ubr’; you can now setup WCAT.
Install WCAT on some test boxes or a single box (for the purposes of this demo I’m doing just a single box).
Copy over your scenario.ubr to the C:Program Fileswcat folder.  I took the settings.ubr file from the Samples directory in the wcat folder and put it in the root of the wcat folder as well.
You then need to start the controller and attach clients to this server.  Open a command prompt to the wcat directory.  The command line I used for the controller is:
wcctl -t scenario.ubr -f settings.ubr -s wsctxwi01t -c 20 -v 20 -w 300 -u 300 -n 100
the values are:
-c = # of clients to connect to this controller.
-v = # of virtual clients to launch per client
-w = warm up time before all clients are connected
-u = duration of testing
-n = cooldown period for clients to disconnect.
Since I’m testing from one computer I then used the following command to start all my clients:
for /l %A IN (1,1,20) do (start “” wcclient.exe localhost)
The 20 clients connect and start loading the server, following the sequence we recorded earlier.
If you want to monitor to see where your maximum user load is vs. maximum performance (e.g., near instant speed), go to your web interface server and create a data collector set with the following counters:
Without any clients connected, start your data collector, then start the WCAT load testing.
This is what I see with regards to our load:
You can see around 220 current connections the Requests in Application Queue shoots up.  At this point any additional requests to ASPX files goes into a queue that makes the website appear ‘frozen’ which it waits for the ASPX file to be processed and sent down to you.  Citrix attempts to ‘hide’ this page by loading the ‘loading.htm’ during ASPX processing.  This page is your ‘One Moment Please’ page.  If you show the Request Execution Time counter, we can see the longest time taken to process a ASPX page:

 

In my example here, the longest time it took to process a ASPX page was is the lighter pink line.  The longest time was 14 seconds.
The Request Wait Time is how long it takes of while you wait in the queue for the ASPX page to be processed.  So the dark blue line with the spike in the bottom right was a 16 second wait time, waiting for my page to start processing, and with the request execution time being 14 seconds this particular request may have taken 30 seconds in total.
With this information we can fairly accurately determine that we would want to keep the number of current connections to around or under 150 just to keep a safe buffer.  To do that, we could increase the number of web interface servers or break out our web server to different Application Pools.  Our Citrix web interface was originally designed with a monolithic application pool and our PNA website and other websites all run under one process.  In our situation, we could exceed our connection limit during times when Citrix Receiver checks in to enumerate applications at the same time as users trying to use the web site.  If we break the PNA website to a new application pool then we can actually double the number of connections we can process at any one time.  The limit appears to be the w3wp.exe ability to process ASPX pages.  By adding another application pool it actually creates another w3wp.exe process and if we max out the connections on one of the websites the other website/process will continue to work without issue.
Read More

AppV 5 first launch application performance

2014-11-17
/ / /

Our AppV 5 environment is a full infrastructure implementation.  We utilize the management/streaming server to pull the applications down to our Citrix XenApp 6.5 servers.  Our XenApp servers are Citrix PVS servers, we enable the Cache on RAM with disk overflow and the write-cache intermediate mode.  To maximize CPU performance we have our ESXi hosts set to maximum performance and disable power management in the BIOS of the hosts.  We have some applications that are very latency sensitive and the switching of power states on the ESXi hosts have caused performance degradation so we have power management disabled.  We have setup our PVS servers with the secondary D: WriteCache disk where we fully mount the AppV 5 packages, removing the streaming latency that going over a network may add.

Because of some performance concerns with the Shared Content Store (SCS) I was tasked with coming up with a way of determining if there is a performance impact of switching from fully mounted applications.  In order to determine the impact my plan was to measure a baseline based on disk performance.  Our SMB share that we are storing our .appv packages actually has the same performance as the local disk.  Since AppV packages are immutable, the only performance consideration we should be concerned is READ performance from the SMB share compared to the local disk.  The writes occur in the %userprofile%appdatavfs which is stored on the C:\ drive.  The Cache to RAM with disk overflow feature would ensure that write performance into those directories are fast and should be near instant.

With that said, I’ve used the diskspd.exe application (new from Microsoft) to measure performance. AppV 5 utilizes a 64KB allocation size so that’s what we’ll set as our -b value.  We’ll measure Latency statistics comparing local disk to the file share as well.

D:\diskspd.exe -c1G -b64K -L -d60 D:test.dat
D:\diskspd.exe -c1G -b64K -L -d60 \citrixnas01ctx_images_testtest.dat

Results:
D:\

SMB share:

 

Performance of the SMB share vs local disk:
MB/s: 96%
IO per s: 96%
AvgLat: 46%

Based on these results, the local disk appears to be nearly identical to the SMB share with the average latency a little more than half on the local disk.  Although it’s half on average, we are still in the sub 1ms time range which is significantly faster than you could get with a physical server with a single local disk.

The next test I have is launching an application and getting to the splash screen to see how long it takes to load.  For this test I’ve written a AutoIt script that takes two parameters, the name of the program to launch and the window title to monitor for.

I setup a cmd file with my program (Epic) because it takes some parameters prior to launch.  I then pointed my timer application at it.

The results (with SCS):

The columns are Time Completed, Duration (in ms), Window to check for, Command executed.

After doing a Net Stop AppvClient / Net Start Appclient and then executing our AppV application it takes 196 seconds to start the application.  After that initial launch it takes 12-15 seconds to start.  Something is really dragging our initial application launch time down.  I’ve found if I stop/start the service I need to do a add/publish via Powershell for that application to reduce the 196 seconds.  This then takes first launch down to 48 seconds.  This is how long is takes to start the same application after a system restart:

First launch time after restart is 48 seconds then subsequent launches are essentially identical to just the stop/start appvclient service + add/publish.  Which makes sense as our AppV5_Data_Precache script does a add/publish.  Evidently, we’re going to have to go further into AppV to understand what’s causing it to take so long.  To start, I’m going to detail our package a bit.

The application I’m testing this with is Epic.  It’s a huge application.  AppXManifest is 72MB, FilesystemMetaData.xml is 1.7MB, Registry.Dat is 62MB.

It contains 22,000 files totalling around 2GB in size.

When AppV is “launching” the application for the first time it starts consuming memory and CPU for the 196 seconds that it’s launching, peaking at nearly 600MB RAM and 50% CPU (though most of the time it’s peaked at 25% CPU).

AppV utilization before application launch

 

Start of application launch

 

Peak during launch
Application launched

The AppV Debug logs do not give a whole lot of info as to what AppVClient.exe is doing during this time.  Most of the logs show the application “start” as they setup their components, and when the application has launched.  Almost all the logs show the first second or two of application launch and the last second or two before the GUI.

I launched the application at 12:36:24, it finally displayed the GUI at 12:39:38

The only log that shows data during the entire time is the SHARED PERFORMANCE log.  Unfortunately, the log is undecipherable to me.

Lots of PreCreate, PreCleanup, PreAcquireForSection with no relevant data.

Perfmon.exe doesn’t do a whole lot better with large gaps between file/process/network accesses:

What is it doing between 1:17:41 and 1:18:29?  CPU is pegged but no disk activity

Showing Registry accesses also shows huge gaps between the AppVClient.exe process accessing the system.

Registry information still shows huge gaps in time where the AppVClient.exe is processing

So I’m not sure what the hold up is with regard to the delay for this application.  None of the usual tools I use to monitor performance is giving me any hints or indications of why it’s delaying launch.

First launch delay when stopping/restarting AppV service.
Total time is 221s for the first launch, 15s for subsequent launches

 

 

First launch delay when starting application after a full system restart.
Total time is 38s for first launch then 13s for subsequent launches.

 

Read More

AppV 5 – Formatted volume allocation size matters

2014-08-15
/ / /

After discovering that AppV 5 configures an allocation size independent of the file system beneath it, I explored the impact of different formatted allocation sizes on AppV packages; specifically mounting AppV packages.

I took our AppV setup and set the PackageInstallationRoot to D:AppVDataPackageInstallationRoot and then formatted the D: to different allocation sizes and then mounted the AppV package.  I timed how long it took to mount the package over 4 runs per allocation choice, took the AppV file size allocation, and the total, actual size of the package on the drive.  Package Details:

The results:

Different Allocation Sizes

 

Actual Used Space vs. Allocation size

 

Actual AppV 5 file allocation size vs. formatted allocation size

 

AppV 5 Mount Time vs. formatted allocation size

I would propose using 64KB allocation sizes for the AppV volume if possible (note, this only applies to fully mounted packages).  There does appear to be some benefit to using larger allocation sizes, one of the main ones is the properties of the PackageInstallationRoot now reflects more accurately the actual consumed filesystem on Windows 2008 / Windows 7.  Another is there is a speed improvement around the 4KB allocation size then minor increases afterwards.

Read More

AppV 5 Shared Content Store performance profiled vs. local disk

2014-06-10
/ / /

We are looking to utilize AppV 5’s shared content store and one of the things I was interested in was knowing what kind of overhead network performance may vs. local disk.  What I have is a physical server with 3x300GB 10,000 RPM SAS drives in a RAID-5 vs. a CIFS share on SSD.  The catch about the CIFS share is it is 300 kilometers (200 miles) away in another city.  I used this share as I don’t have another share local to the physical box and this brings the performance of the share down, but it’s still faster than the disk.  The IOMETER readings for the Shared Content Store and the local disk hosting the PackageInstallationRoot are:

 

IOMeter settings was 100% read with 4k transfer sizes

The CIFS share is 2x faster on average, pulling 2803 IOPS vs 1324.  Bandwidth of the CIFS share is faster as well, 11.48MB/s vs. 5.43MB/s for the local disk.

To test the performance I did the following, I loaded an AppV 5 application to disk, I then wrote a AutoIT script to get the start time, launch the program, wait for the login screen, once the login screen is visible get the finish time.  I then stopped and started the appvclient as this ensures nothing is cached in RAM and ran the AutoIT script again.  I did this thirty times.  I then turned on Shared Content Store (SCS) and loaded the package in that manner and retested.  The results are (average of all 30 runs):

Local Disk: 9.6s
SCS: 9.7s

For the 30 runs the Local Disk deviated +/- 0.2s from average so the range was from 9.4s to 9.8s.

For the SCS the deviation was much bigger, probably to be expected when you’re pulling from a file share 6ms away across a shared pipe.  Deviation on the network was:
+/- 1.85s with a range from 8.2s to 11.9s.  Amazingly, there were numerous runs where the network bested the local disk under 9.4s for launch time.  I’m sure if the SCS was local we’d get even more consistent performance and probably even faster performance.  A local server to the share gets about 8x better IOMeter numbers.

Now, how about when the application has already been launched so files are cached?  Results (average of all 30 runs):

Local Disk: 4.8s
SCS: 4.8s

Local Disk deviation/range: +/- 1.4s, range 3.5-6.3s
SCS deviation/range: +/- 1.85s, range 3.4s-7.1s

Again, the SCS has a bigger range but again pulls ahead of the local disk numerous times.

A curious thing I found was that after the 14th run in the loop the times improved greatly:

I’m not sure if that’s the application or AppV has a caching structure but I’m leading towards AppV/Windows doing something with the file cache.  I’m unsure how to prove this out.

Read More

Time how long a WMI filter call takes

2014-02-25
/ / /

The following command will time how long a WMI filter call will take to execute on your server/PC.

 

Read More

Slow PowerShell CSV reading and object generation

2013-12-17
/ / /

I am attempting to read a CSV file and decided the best way to do it was with PowerShell’s native import-csv tool.  What was required was to read this CSV and then generate a registry file for import into numerous computers.  This was my result:

The CSV file we have has about 1500 lines in it.  To generate the registry key utilizing this method took 6 minutes and 45 seconds.  This is unacceptably slow.  I then started googling ways to speed up this processing and came across this article:
http://stackoverflow.com/questions/6386793/how-to-use-powershell-to-reorder-csv-columns

Where Roman Kuzmin suggested to handle the file as a text file instead of a PowerShell object.  The syntax used to generate convert the file into objects that can replace text as needed is a bit different but I decided to explore it.  His example code is as follows:

Essentially, he is proposing reading the file line by line and extracting the data by manually splitting the text string. The string then turns into an array that you can use for substitution. This is my final code using his example:

 

The total time for this? 0.07 seconds. Incredibly fast. So, utilize the Import-CSV command and object based creation with caution. Utilizing even a moderately sized file will take an unacceptable amount of time.

Read More

Testing Windows Storage Spaces Performance on Windows 2012 R2

2013-08-08
/ / /

Windows Storage Spaces parity performance on Windows Server 2012 is terrible.  Microsoft’s justification for it is that it’s not meant to be used for anything except “workloads that are almost exclusively read-based, highly sequential, and require resiliency, or workloads that write data in large sequential append blocks (such as bulk backups).”

I find this statement to be a bit amusing because trying to back up anything @ 20MB/sec takes forever.  If you setup a Storage Spaces parity volume at 12TB (available space) and you have 10TB of data to copy to it just to get it going it will take you 8738 seconds, or 145 hours, or 6 straight days.  I have no idea who thought anything like that would be acceptable.  Maybe they want to adjust their use case to volumes under 1GB?

Anyways, with 2012R2 there maybe some feature enhancements including a new feature for storage spaces; ‘tiered storage’ and write back caching.  This allows you to use fast media like flash to be  a staging ground so writes complete faster and then the writes to the fast media can transfer that data to the slower storage at a time that is more convient.  Does this fix the performance issues in 2012?  How does the new 2-disk parity perform?

To test I made two VM’s.  One a generic 2012 and one a 2012R2.  They have the exact same volumes, 6x10GB volumes in total.  The volumes are broken down into 4x10GB volumes on a 4x4TB RAID-10 array, 1x10GB volume on a 256GB Samsung 840 Pro SSD and 1x10GB volume on a RAMDisk (courtesy of DataRAM).  Performance for each set of volumes is:

4x4TB RAID-10 -> 220MB/s write, 300MB/s read
256MB Samsung 840 Pro SSD -> ~250MB/s write, 300MB/s read
DataRAM RAMDisk -> 4000MB/s write, 4000MB/s read

The Samsung SSD volume has a small sequential write advantage, it should have a significant seek advantage, as well since the volume is dedicated on the Samsung it should be significantly faster as you could probably divide by 6 to get the individual performance of the 4x10GB volumes on the single RAID.  The DataRAM RAMDisk drive should crush both of them for read and write performance under all situations.  For my weak testing, I only tested sequential performance.

First thing I did was create my storage pool with my 6 volumes that reside on the RAID-10.  I used this powershell script to create them:

The first thing I did was create a stripe disk to determine my maximum performance amoung my 6 volumes.  I mapped to my DataRAM Disk drive and copied a 1.5GB file from it using xcopy /j

Performance to the stripe seemed good.  About 1.2Gb/s (150MB/s)

I then deleted the volume and recreated it as a single parity drive.

Executing the same command xcopy /j I seemed to be averaging around 348Mb/s (43.5MB/s)

This is actually faster than what I remember getting previously (around 20MB/s) and this is through a VM.

I then deleted the volume and recreated it as a dual parity drive.  To get the dual parity drive to work I actually had to add a 7th disk.  5 nor 6 would work as it would tell me I lacked sufficient space.

Executing the same command xcopy /j I seemed to be averaging around 209Mb/s (26.1MB/s)

I added my SSD volume to the VM and deleted the storage spaces volume.  I then added my SSD volume to the pool and recreated it with “tiered” storage now.

When I specified to make use the SSD as the tiered storage it removed my ability to create a parity volume.  So I created a simple volume for this testing.

Performance was good.  I achieved 2.0Gb/s (250MB/s) to the volume.

With the RAMDisk as the SSD tier I achieved 3.2Gb/s (400MB/s).  My 1.5GB file may not be big enough to ramp up to see the maximum speed, but it works.  Tiered storage make a difference, but I didn’t try to “overfill” the tiered storage section.

I wanted to try the write-back cache with the parity to see if that helps.  I found this page that tells me it can only be enabled through PowerShell at this time.

I enabled the writecache with both my SSD and RAMDisk as being a part of the pool and the performance I got for copying the 1.5GB file was 1.8Gb/s (225MB/s)

And this is on a single parity drive!  Even though the copy completed quickly I could see in Resource Manager the copy to the E: drive did not stop, after hitting the cache at ~200MB/s it dropped down to ~45-30MB/s for several seconds afterwards.

You can see xcopy.exe is still going but there is no more network activity.  The total is in Bytes per second and you can see it’s writing to the E: drive at about 34.13MB/s

I imagine this is the ‘Microsoft Magic’ going on where the SSD/write cache is now purging out to the slower disks.

I removed the RAMDisk SSD to see what impact it may have if it’s just hitting the stock SSD.

Removing the RAMDisk SSD and leaving the stock SSD I hit about 800Mb/s (100MB/s).

This is very good!  I reduced the writecache size to see what would happen if the copy exceeded the cache…  I recreated the volume with the writecachesize at 100MB

As soon as the writecache filled up it was actually a little slower then before, 209Mb/s (26.1MB/s).  100MB just isn’t enough to help.

100MB of cache is just not enough to help

Here I am now at the end.  It appears tiered storage only helps mirrored or stripe volumes.  Since they are the fastest volumes anyways, it appears the benefits aren’t as high as they could be.  With parity drives though, the writecachesetting has a profound impact in the initial performance of the system.  As long as whatever fills the cache as enough time to purge to disk in the inbetweens you’ll be ok.  By that I mean without a SSD present and write cache at default a 1GB file will copy over at 25MB/s in 40 seconds.  With a 100MB SSD cache present it will take 36 seconds because once the cache is full it will be bottlenecked by how fast it can empty itself.  Even worse, in my small scale test, it hurt performance by about 50%.  A large enough cache probably won’t encounter this issue as long as there is sufficient time for it to clear.  Might be worthwhile to invest in a good UPS as well.  If you have a 100GB cache that is near full and the power goes out it will take about 68 minutes for the cache to finish dumping itself to disk.  At 1TB worth of cache you could be looking at 11.37 hours.  I’m not sure how Server 2012R2 deals with a power outage on the write cache, but since it’s a part of the pool I imagine on reboot it will just pick up where it left off…?

Anyways, with storage spaces I do have to give Microsoft kudos.  It appears they were able to come close to doubling the performance on the single parity to ~46MB/s.  On the dual-parity it’s at about 26MB/s under my test environment.  With the write cache everything is exteremely fast until the write cache becomes full.  After that it’s painful.  So it’s very important to size up your cache appropriately.  I have a second system with 4x4TB drives in a storage pool mirrored configuration.  Once 2012 R2 comes out I suspect I’ll update to it and change my mirror into a single parity with a 500GB SSD cache drive.  Once that happens I’ll try to remember to retest these performance numbers and we’ll see what happens 🙂

Read More