Analyzing benchmark results with BM
Benchmarking is an important concept for evaluating the performance of computer systems. It can be considered essentially a three-step process:
- Step 1
- Define the benchmarks.
- Step 2
- Execute the benchmarks and produce numbers.
- Step 3
- Analyze the benchmark numbers.
Each step has its own challenges: Step 1 is about writing programs that test sufficiently interesting/relevant parts of the target product. In Step 2 one needs to define what resource to measure and ensure that the measurements are as reliable as possible in various software/hardware contexts (note that QTestLib provides some help here through the
-median option and by supporting callgrind). Step 3 is concerned with "higher level" analysis problems like how to detect performance regressions as reliably as possible, how to compare different implementations of a particular algorithm and so on.
This article introduces BM, a tool that attempts to solve some of the problems that may arise in Step 3. In particular we focus on how potential performance regressions can be detected. Currently the idea is to evaluate this tool for our internal QA needs, but we hope that others may find it interesting too and give us some feedback.
Benchmark results and execution context
BM is based on QTestLib concepts: First, a benchmark is identified by the combination of a test case, a test function, and a data tag (optional). The benchmark should execute some part(s) of the target product. Second, a benchmark result (i.e. the result of executing a particular benchmark) is essentially a single, non-negative number.
A benchmark result is produced within a context made up of the following components:
- The resource used for measuring the benchmark.
- The compiler and OS used for building and executing the benchmark.
- The physical computer on which the benchmark was executed.
- The development branch (of the target product) tested by the executable benchmark program.
- The timestamp and ID of the current branch revision.
Note that the host may identify a virtual machine that may run on different physical hosts. In this case the physical hosts should be as identical as possible with respect to both hardware and software.
Time series as snapshots of revision history
By varying the snapshot component only, successive runs of a particular benchmark form a time series of benchmark results. The notion of a performance regression is defined with respect to this time series. The question to ask is essentially:
How likely is it that a modification to the product itself is affecting the performance of the benchmark in a negative way?
(It would of course also be interesting to know if the benchmark performance is affected in a positive way — BM makes no distinction between the two cases.)
Ideally, new benchmark results should be produced each time the revision history is advanced. In practice this may be hard to achieve if the time it takes to produce a new set of benchmark results exceeds the product update interval.
Note that any modification done to the host or platform of a particular time series (e.g. reconfigurations or upgrades) will in theory invalidate the history in the sense that it may not make sense to compare results before and after the modification. The BM tool currently has no support for this scenario.
Change and stability
For our purpose, the only really interesting property of a sequence of benchmark results is how the results change along the revision history. Moreover, it is normally the current (i.e. most recent) change that we're most interested in. Who cares about how the benchmark performance changed one year ago if some code that was submitted yesterday is likely to cause a bad performance regression?
Ideally, Step 2 (defined earlier) will produce reliable numbers completely free of "statistical effects", but for various reasons this cannot be guaranteed in a real life scenario. Since nobody wants to waste their time chasing "false positives", we need some mechanism to help us reason about the confidence. BM takes a heuristic approach to this problem by letting the user define a difference tolerance (DT) and a stability tolerance (ST):
- Two values are considered significantly equal if their relative difference does not exceed DT (i.e. if |(v1 - v2) / v1| ≤ DT)
- If vn is the last value in the time series, the current change (if any) is the latest value vc before vn that differs significantly from vn.
- A value v is considered stable if it is preceded consecutively by at least ST values that are significantly equal to v.
- The current change is stable if both vc and vn are stable.
The following figures show examples of stable and non-stable current changes (ST is 4 in all cases):
By sorting the benchmarks on their current changes (considering both sign and magnitude) and on stability, we hope to get a useful indication of likely candidates for performance regressions. Once a potential regression is detected, it may need additional manual verification before eventually pinpointing the code modification(s) that caused it.
The BM system consists of the following main components:
- a bmserver (written in Qt) for managing the database of benchmark results,
- a bmclient (written in Qt) for uploading data to and retrieving data from the bmserver, and
New benchmark results are uploaded to the bmserver by invoking the bmclient with the benchmark results (in a QTestLib XML output file) along with the execution context. Benchmark results already in the database may be retrieved directly through the bmclient (which can be useful for debugging), but normally it is better to use the web client. In the latter case the bmclient is invoked as a CGI program by a web server.
The following figure shows the client/server architecture with arrow directions indicating the main data flow:
The web interface is organized into three main sections:
- A platform summary section that shows overall statistics for each platform/branch combination.
- A benchmark summary section that shows more detailed statistics for the most interesting benchmarks of a particular platform/branch combination (the ranking criteria may be defined by the user). Other features in this section include detailed views of individual time series, comparing the latest results of two branches, and comparing results in a local file against results in the database.
- An overall settings section.
Here are some screenshots:
Note: BM is currently known to work properly with Firefox only (partly due to the use of the HTML canvas tag for graphics).
Status and plans
It goes without saying that a proper assessment of the usefulness of BM depends on the database containing a fair amount of production data. As we speak, the Qt benchmarks (under the /tests/benchmarks/ directory) are being run at regular intervals on a Linux platform.
We would like to invite anyone interested to give us feedback on the BM tool. Does this approach make sense? How can it be improved? Etc. The repository is here (and there's a README to get you started).
Subscribe to our newsletter
Try Qt 5.15 LTS Now!
Download the latest release here: www.qt.io/download.
Qt 5.15 was developed with a strong focus on quality and is a long-term-supported (LTS) release that will be supported for 3 years.
Check out all our open positions here and follow us on Instagram to see what it's like to be #QtPeople.
Näytä tämä julkaisu Instagramissa.
Want to build something for tomorrow, join #QtPeople today! We have loads of cool jobs you don’t want to miss! http://qt.io/careers #builtwithQt #software #developers #coding #framework #tool #tooling #C++ #QML #engineers #sales #tech #technology #UI #UX #CX #Qt #Qtdev #global #openpositions #careers #job
Henkilön Qt (@theqtcompany) jakama julkaisu