Being able to perform reliable I/O tests on a storage system is something that can digress into art. And it’s both the art of defining trustworthy and repeatable methods, but sadly more often the art of configuring ad-hoc tests in order to make the measurement tool say what the vendor want to be said; faking I/O tests is one of the easiest tasks, and often you only need to omit certain parameters in the published results as latency or block size in order to make them completely different. You only need to use a 512 byte block size to skyrocket your IOPs, even if you know no “real” application will use that block size, or again you can show your results omitting a huge latency you suffered while performing those tests…
There are some professional tools to do reliable tests, like TPC-C or SPECsfs. Those are really powerful tools, and most of all they offer repeatable tests regardless the storage they are run against. They would be almost perfect, but they are really expensive too, in fact even many vendors use them only on their high-end storage arrays; for us simple users, they are out of our reach.
In a “amateur” situation instead, one of the most common tool is with no doubt IOmeter. It’s really easy to use, and allows you to run tests really quickly. However, its tests are far from being “real”: it’s easy to measure the maximum performances of a storage, but not to check real performances in production scenarios. It’s like an acceleration race, compared to timing a circuit track lap: a dragster is the fastest way to win an acceleration race, but it’s not at all the best solution to drive around.
Lately also IOmeter has become even more unreliable, especially since many storage systems are using caching or SSDs, or when you are using a server side caching solution. IOmeter is not able to create “hot spots”, has Howard Marks has clearly explained here. Once it has created the test file with a size of some GBs, IOmeter reads and writes evenly along the whole file. There is no way to have a “new” data block that the caching system has never seen before, so it cannot simulate a “read miss”.
Another solution, often quoted in VMware environments, is VMmark. It’s has been created directly by VMware, and it configures several virtual machine executing different applications (Exchange Server, Web Server, Application servers…) running some common workloads. for sure is really realistic, really near to a proper production environment, but one of its problem is its configuration, it’s really cumbersome and time consuming, and it uses a large number of virtual machines.
In order to workaround all those problems, I choose to follow a different path and I created my own solution. I have no claim this is going to be the best one. But I think is a simple way to run reliable tests, and most of all it creates easily repeatable tests, so you can then compare the results.
In my lab I’m using a NetApp FAS2020 storage array, and you can read the details of my lab in this dedicated page. I wanted to have a starting point, so I run different tests on my storage. In the future, this will be my baseline for comparisons.
The Virtual Machine
My solution is based on a single virtual machine running Microsoft Windows Server 2008 R2 Standard; it has 4 vCPU configured as 1 socket and 4 core, 4 GB RAM and a 30 Gb thick disk, with another secondary disk at 350 Gb used to run the tests. I need to run many vCPU and a large disk in order to create enough data and I/O to saturate the several caches of the storage array and the SSDs used as caching inside ESXi, otherwise data are never read or written from disks and final results are too high, and not true.
Virtual hardware is version 9, and both operating system and VMware tools are updated until November 2013. In order to have the same results, I’m not doing any further update.
Finally, I exported this VM as an OVA file, so I can install it in other environments.
FIO
I discovered FIO (Flexible IO tester) thanks to a suggestion from one of my friend at Fusion-IO (funny enough, both the tool and the company have the same abbreviation… :D). Even if is really less known, after testing it I can say for sure it’s way better than IOmeter. First of all, its code is continuously updated, in fact when I wrote this article the last version was 2.1.2, released on 7th August 2013. Think about it, the last IOmeter version was released in 2006… Also, you can run it both on Linux or Windows, for the latter there is a porting of the penultimate version 2.1.1, and you can get it here.
FIO has some interesting configuration options, and thanks to them you can generate multiple binary block files to be used, in order to mix the I/O among them and access all of them randomly; thus in order to create as much entropy as possible and ultimately to stress the storage.
FIO can be used via command line, and you can save all the parameters in a configuration file; then you use this configuration file by running the command “fio config_file“. I’ve done several tests before finding a good configuration; you can use my configuration files as a starting point in order to develop your own tests. This configuration files are created for the Windows version, if yo want to use them in Linux you need to update them. You can also change IO depth and the number of parallel jobs to see how the storage react to those changes. Here are my files (change the extension to .fio before using them)
FIO Max Real I/O: 100% read, 100% sequential, Block Size 8k, IO depth 32, 16 jobs
FIO Max Bandwidth: 100% read, 100% sequential, Block size 1M, IO depth 32, 16 jobs
FIO Real Life Test: 80% read, 100% random, Block Size 8k, IO depth 32, 16 jobs
With this configurations, my NetApp FAS2020 has reached Max I/O 12717 IOPS, and Max Bandiwidth 199,57 MBs, while the Real Test gave me 2800 IOPS, with 22.40 MBs bandwidth and 181 ms of average latency.
Also, just for fun, I tried a totally silly result, that is the overall maximum IOPs, by configuring block size at 512 byte. My NetApp made slightly more than 23.000 IOPS. As you can see, by simply configuring blocksize at 8k (a more real value) IOPs felt down to 12717 IOPS. Here is another example to explain why we need to do meaningful tests.
JetStress
JetStress is the official Microsoft tool to simulate Exchange Server workloads. If compare to the other available tool (LoadGen) this one does not need a complete Exchange Server installed and configured in order to run the tests. I choose the 2010 version even if the 2013 is already available, since 2010 is much more widespread, so the test is much more interesting. You can follow this tutorial in order to install and configure JetStress.
Once JetStress is installed, it’s really simple and easy to be used. You can start the graphical version of the program (using the option “Run as Administrator”) and you choose to start a new test. You have several options; I created a performance type test, and you can run my same test by using these parameters:
The minimum duration of the test is 2 hours, and for the whole duration of the test JetStress really simulates every possible activity of an Exchange Server. If you take a look at the log, you can see informations like these ones:
Operation mix: Sessions 8, Inserts 40%, Deletes 20%, Replaces 5%, Reads 35%, Lazy Commits 70%.
As you can see, all activities are multi-threaded, and are a mix of writes, updates, reads, deletions. Once the test is completed, the result is like this one (I only removed some details I do not need for the purpose of these tests):
Microsoft Exchange Jetstress 2010
Performance Test Result Report
Overall, is a really reliable test. As you can see, even if the FIO result was 2800 IOPS, JetStress only reached 1013, simply because is a much more “real” test.
HammerDB
I tried for a long time to have a tool able to simulate a database server. Don’t count on Microsoft: they have a tool called SQLio, but it’s not related at all with SQL, and it’s simply a I/O benchmark tool, just like IOmeter or FIO. They also have SQLioSim, this one is a proper SQL simulator but it does not run I/O tests, it’s more aimed at testing storage resiliency, by introducing errors in the database, and furthermore the I/O pattern is too random, and cannot be repeated; so tests are not comparable.
Same problem with Oracle: there is Orion, but at the end is another I/O simulator, even if it’s dedicated to Oracle, and most of all the links in the vendor websites do not work… There is an alternative called SLOB, but from several months the binaries are not downloadable anymore and the author never replied in his blog about this problem… (UPDATE: the author of SLOB2, Kevin Closson, has commented on this post he fixed the broken link, you can try its tool by going here).
At the end I choose HammerDB. It needs a database server to be installed in order to run the tests, but you can simply use one of the free version of the supported databases, and follow their guide to install the most common database servers. Also, HammerDB is available for both Linux and Windows, so I was able to run everything inside my Windows VM.
Following their guide, I installed and configured PostgreSQL. It’s in my opinion a better choice than Microsoft SQL Server Express or Oracle Express since it does not have any limit on CPU or RAM usage, PostgreSQL is completely free and you can push it to the limits of the machine where it’s running.
Once you install the Database server and HammerDB, you can start the tests. HammerDB can run a complete OLTP test based on the TPC-C specifications, and this is a huge advantage of this software: you can get final results that you can then compare with other systems running TPC-C too. There are different guides helping you to run a TPC-C test, I used this one specifically written for PostgreSQL. By the way, this document is a great resouce to learn about TPC-C tests.
To run the tests, HammerDB needs to be installed in a system other than the database server, so I used another machine in my lab to run the tests, hosted on a different storage system, and also using vMotion I separated the two VMs.
Tests can be configured with different parameters. If, like me, you are using a VM with 4 vCPU, you need to configure 100 Warehouses (5 for each vCPU, rounded to the upper 100) and 10 Virtual Users, 1 every 10 Warehouses. I’m not looking at the best possible performance in the TPC-C test, I only want to design e reusable configuration that can be run on every situation.
The final goal of TPC-C is to evaluate the performances of a database server when running transactions. HammerDB give us back a value called TPM, that is Transactions per Minute, together with the NOPM value, that is Number of Orders per Minute. In fact, TPC-C is simulating a database used for processing orders of a large company…
My VM running on the FAS2020 reached 17790 TPM and 7845 NOPM, and as you can see in the graph, controller cache had hard times trying to cope with the high I/O stream reaching the storage. There is a peak value at 28896 TPM, but also much lower values:
Final notes
Performance tests are a difficult art to master. Often you risk to get results that have a value only in the test environment you used to get them, especially if those labs are “home labs” like mine. Bottlenecks like CPU, Memory, Disk Controller, Network, are always in ambush, and they can falsify the results, and ultimately give readers wrong informations. There are dedicated labs and companies that run those kind of tests, and they too fail sometimes; so think about the overall value of results coming out from a home lab.
Because of these reasons, don’t take my results or those of other bloggers like ultimate truths, even if some of them will try to convince you about the opposite. Rather, take my tests as a starting point an create YOUR tests; only be sure they can be repeated on several different systems in different periods.