Advanced X-Ray: reducing runtime by re-using VMs.

Published: October 5, 2020 (Updated: July 14, 2021) in X-Ray, benchmarking, xray by gary.

How to speed up your X-ray benchmark development cycle by re-using/re-cycling benchmark VMs and more importantly data-sets.

Problem: For large datasets, creating the data on-disk can be time consuming.

Consider a cluster where we wish to write 2TB per node and inter-node bandwidth is 10GbE per. Assuming we have enough storage bandwidth, the throughput is wire bound which makes our pre-fill stage 2,000 seconds – over 1/2 an hour. Waiting to 1/2 hour before our modeled workload can start can be a bit frustrating.

Even scenarios with relatively small workingset sizes can benefit from splitting up provisioning and executing the workload. When we re-use VM’s we avoid the cost of cloning, booting, establishing IP etc. This can help a lot when iteratively developing a scenario.

Solution: Re-use the VMs between multiple tests/scenarios.

A typical approach is to split the scenario across two parts

Part 1 – An X-ray Scenario that clones, creates, boots the VMs, disks and populates the data.
- This “Deployment Scenario” will be run once.
Part 2- An X-Ray Scenario that contains the main workload. It re-uses the VMs deployed in Part-1.
- To connect the VMs across the two separate scenarios we use the concept of Curie VM id’s.

Important things to remember

Remove the teardown step from the “Deployment” Scenario
Remove the cleanup step from the “Workload” Scenario
Any “standard” scenario which includes a teardown or cleanup step will wipe out the VMs even if they are using an “id”.
Ensure that the “Curie VM id” is the same between the Deployment and Workload scencarios
Ensure that a unique RunID is generated each time the workload scenario is run and that the workload identifier uses the RunID

Example

We create an ID that will tie together a set of VMs created in one X-Ray test to subsequent X-Ray tests. In this case I use the number 255. I will use the same “tag” id: 255 every time I want to re-use these VMs. Then we create a “runid” to uniqify each re-execution of the same scenario on the same VM. Without this step when we try to re-use the VM the scenario will immediately exit – or behave strangely in some way. Last, ensure that the runid is used as part of the workload name, and the call to that workload in the run step.

name: server_virt_simulator_setup_step
display_name: "Server Virtualization simulator setup"
summary: Server virtualization 75 Per node.
id: 255

estimated_runtime: {{ _estimated_runtime }}
{% set runid = range(1,9999999999)|random %}
presets:
small:
vars:
_estimated_runtime: 3600
_node_selector: ":n"
_iops_expected_value: 40
...
workloads:
SRV_VIRT_WLOAD {{ runid }}:
vm_group: SERVA
config_file: {{ workload_file }}
...
run:                   
  - workload.Start:                
      workload_name: SRV_VIRT_WLOAD {{ runid }}

Now I can run my first workload which will setup and prefill the VMs for me. Then I can run my experimental workload(s) on the same set of VMs without having to wait for VMs to be cloned, prefilled and powered on.

So, why we don’t do this by default…?

Normally in X-ray the VMs are created from scratch every time. This stops us from having to worry about remembering how many and what size of disks are in use, how many CPU are allocated because those variables are embedded as part of the test. Put another way, since every aspect of the test is (a) created every time and (b) is documented in the test, I never have to keep track of anything about the worker VMs. The worker VMs are ephemeral and exist only for the duration of the test. Just like a microservice pattern. This kind of idempotency is the key to making benchmarking as code a reality.

…and why you might want to do it despite the drawbacks…

However, particularly during the research phase a benchmark developer might want to optimize for iteration velocity over test hygiene. That is why the ability to re-use X-Ray test VMs is not the default behavior but it is possible.

Prism Output

You can observe the change in naming from the Prism UI. The default X-ray worker VM naming scheme is

__curie_test_<random_number>_vm_group_name>_<index>

When we want to re-use the VMs created in one test by subsequent tests, we need to tell X-Ray which VMs to use (there may be hundreds of VMs on the system and potentially several called __curie_test_<something>

This is what the VM name looks like with the re-use feature. Rather than generating a random number, I specify my own ID (in this case) 255.

With re-use, Prism will show the X-Ray worker vms having names like
__curie_test_<my_id>_vm_group_name>_<index>

Acknowledgements:

Thanks to Bob Allegreti and Bill Eubanks for blazing the trail on this technique.