The Icon Bar: Adventures in Optimisation

Welcome to the first in a (very) infrequent series of articles about optimisation. In this article I'll be looking at audio mixing in C/C++ using GCC 4.7.4, running benchmarks of various routines and optimisations across multiple machines in order to work out what works well and what doesn't.

The initial code

The code I'm aiming to improve is EDuke32's AudioLib. Used by open-source versions of many classic 90's DOS games, this code has a long and storied history, and can be traced back to Jonathon Fowler's C conversion of Apogee's original assembler code.

The original Apogee code looks like it was optimised for playback of unsigned 8-bit sound samples. A volume-specific lookup table is used to convert each sample to signed 16 bit and apply volume scaling (with a healthy dose of self-modifying code for patching in the correct table/buffer addresses). For mid-nineties x86 hardware, this is probably the best approach available. And for ARM, table lookup is a pretty good approach too. But changes which have been made over the years have eroded the performance of the code, making it many times slower than the original. How slow exactly? Slow enough for audio mixing in my RISC OS port of NBlood to consume over 50% of CPU time on a 1GHz Cortex-A8. And that's in a build where I've told GCC to target the Cortex-A8 CPU & use VFP/NEON floating point.

My profiling results showed that 20% of the CPU time was spent in MV_Mix16BitMono, and 32% of CPU time was spent in the library function lrintf. Examination of the source, and what GCC's done (and failed to do) during optimisation, gives me some clues as to why performance is so bad:

SCALE_SAMPLE is a (templated) function which converts the input sample to a float (including applying the correct bias for unsigned data), multiplies it by a float volume scale (in 0-1 range), and then converts it back to the original (signed/unsigned) integer format.
When SCALE_SAMPLE converts the result back to integer form, it uses the Blrintf function, which on ARM calls through to lrintf. UnixLib's lrintf looks like its around 20 instructions long. For comparison, the x86 version of Blrintf is just a couple of inline/intrinsic assembler instructions.
SCALE_SAMPLE is used twice, meaning that we're (essentially) performing three multiplies per sample. But the multiplication factors are all constant for the duration of the loop, so all that's needed is one multiply using a scale factor that was calculated outside the loop.
GCC isn't smart enough to optimise the clamp function down to a SSAT instruction. So that's seven instructions instead of one instruction.
GCC isn't unrolling the loop

Additionally, the table-based volume scaling system will only be accurate for 8-bit sample data. For 16-bit samples, the practice of performing two table lookups (one for the top half and one for the bottom half) introduces significant inaccuracies.

The optimisations

Based on the above analysis, and it's easy to draw up a list of optimisations to try out:

Different algorithm implementations:
- An all-float solution: convert the input sample to a float, multiply it once with a precomputed scale factor, then convert to int (using a direct cast instead of Blrintf), accumulate, and store.
- An all-lookup table solution: remove SCALE_SAMPLE (producing an inner loop that's effectively identical to Jonathon's original), and instead use lookup tables which take into account the two extra volume scales.
- An all-integer solution: use fixed-point arithmetic to apply a single volume scale.
- Integer versions optimised for various ARM targets:
  - StrongARM RiscPC (ARMv3M target)
  - Iyonix (ARMv5TE target)
  - Raspberry Pi 1 (ARMv6 target)
  - NEON integer (BeagleBoard, Raspberry Pi 2+, iMX6, etc.) (ARMv7 + NEON target)
Different compiler optimisation settings:
- -O2 (default optimisation level used by NBlood)
- -O3
- -O3 -funroll-loops -fmodulo-sched -ftree-vectorize -ffast-math (O3 + loop unrolling + better unrolled loop scheduling + auto NEON vectorisation + IEEE-violating FP math optimisations)
Different compiler machine targets:
- Default CPU, software floating point
- StrongARM, software floating point
- XScale, software floating point
- Raspberry Pi 1, software floating point
- Raspberry Pi 1, VFP
- Cortex-A8, software floating point
- Cortex-A8, NEON
Take advantage of the fact that (for stereo 16bit output) MV_SampleSize is always 4 and MV_RightChannelOffset is always 2. I.e. it's a straightforward stream of 16 bit values with the left & right channels interleaved. (My hunch is that in the DOS era this may have not always been the case, but for current versions of AudioLib it looks like these are the only settings used)

That's a lot of combinations to build and test.

Also, note that although my original profiling highlighted that MV_Mix16BitMono was the main function being used (8-bit mono source to 16-bit mono dest), for this initial round of optimisation I ended up focusing purely on the more complex MV_Mix16BitStereo16 function (16-bit mono source to 16-bit stereo dest). I suspect this was actually a mistake where I just misread my notes and thought it was MV_Mix16BitStereo16 which was taking up all the time and not MV_Mix16BitMono. So later on I may need to do a second round of profiling to make sure that the conclusions I'm drawing from these tests are still valid for the 8-bit routines.

Multi-compilation using `make`

To simplify testing, I developed the following makefile which handles compiling each test program/source file for multiple different CPUs, and with different optimisation settings.

CPUS = soft fpa armv3m-soft armv3m-fpa armv5te-soft armv5te-fpa armv6-vfp armv6-fpa armv7-neon armv7-fpa
OPTS = O2 O3 On
 
GCC_soft = -mfpu=fpa -msoft-float
GCC_fpa = -mfpu=fpa -mhard-float -mlibscl -DNO_ALIGNED_MALLOC
GCC_armv3m-soft = -march=armv3m -mtune=strongarm $(GCC_soft)
GCC_armv3m-fpa = -march=armv3m -mtune=strongarm $(GCC_fpa)
GCC_armv5te-soft = -march=armv5te -mtune=xscale $(GCC_soft)
GCC_armv5te-fpa = -march=armv5te -mtune=xscale $(GCC_fpa)
GCC_armv6-vfp = -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=softfp
GCC_armv6-fpa = -mcpu=arm1176jzf-s $(GCC_fpa)
GCC_armv7-neon = -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp
GCC_armv7-fpa = -mcpu=cortex-a8 $(GCC_fpa)
 
GCC_O2 = -O2
GCC_O3 = -O3
GCC_On = -O3 -funroll-loops -fmodulo-sched -ftree-vectorize -ffast-math
 
SRCS = armv3m armv5te armv6 float integer justlut orig neon
 
# CPU+Optimisation combinations
CPUOPTS = $(foreach cpu,$(CPUS),$(addprefix $(cpu)_, $(OPTS)))
 
# Test + CPU + optimisation
DESTS = $(foreach src,$(SRCS),$(addprefix bin/$(src)_, $(CPUOPTS)))
 
GCCFLAGS = -Wall -static -std=gnu++11
 
empty =
space = $(empty) $(empty)
 
all:: $(DESTS)
 
$(DESTS):
  g++ $(GCCFLAGS) $(foreach word,$(wordlist 2,100,$(subst _,$(space),$@)), $(GCC_$(word))) -o $@ $(notdir $(word 1,$(subst _,$(space),$@)).cpp) -MMD -MF d/$(notdir $(basename $@))
  -elf2aif $(subst /,.,$@)
 
# Dependencies
-include d/*

Makefile for easy compilation with multiple compiler settings

The above makefile will take each of the source files (armv3m.cpp, integer.cpp, justlut.cpp, etc.) and produce multiple outputs (in a 'bin' folder), each using different compiler settings. E.g. the file orig_armv3m-fpa_O2 corresponds to orig.cpp, built with the GCC_armv3m-fpa and GCC_O2 compiler settings.

The key parts of the makefile are:

The DESTS (and CPUDESTS) variables which build up the list of output files, using the foreach and addprefix functions to iterate over the SRCS, OPTS, and CPUS variables.
The single $(DESTS) recipe, which uses subst to split $@ (the target filename) back up into its component parts, and uses it to determine the input filename ($(notdir ...) construct), and which compiler flags to use $(foreach ...).

The makefile also asks GCC to output dependency information, so that dependencies don't have to the btracked manually. (This will require you to manually create a 'd' folder before running it for the first time)

Using the above makefile, my eight algorithm implementations get compiled to 240 binaries ready for execution. However, note that some of these will be invalid - e.g. NEON intrinsics compiled for an ARMv3 target. In these cases, #defines were used to detect the CPU architecture at compile time, allowing the code to compile cleanly and exit with a simple "unsupported" message at runtime.

With a few more helper scripts (including TSL and !NetTask to allow for remote execution over my home network), I had a setup where I could easily run all the programs on multiple machines and collect together the profiling results.

ARM optimisations in detail

Before we get to the results, here are the inner loops of the different ARM-optimised routines:

ARMv3M

int32_t * __restrict dest32 = (int32_t *) dest;
do
{
  int16_t const isample0 = B_LITTLE16(source[position >> 16]);
 
  position += rate;
 
  int32_t mix = *dest32;
  int32_t left,right;
  uint32_t magic = 0x80000000;
 
  int32_t const sample0 = isample0*leftvol;
  int32_t const sample1 = isample0*rightvol;
 
  asm("adds %0,%1,%2,lsl #16\n\tsbcvs %0,%3,#0" : "=r" (left) : "r" (sample0), "r" (mix), "r" (magic) : "cc");
  asm("adds %0,%1,%2\n\tsbcvs %0,%3,#0" : "=r" (right) : "r" (sample1), "r" (mix), "r" (magic) : "cc");
 
  *dest32++ = (((uint32_t)left)>>16) | (right & 0xffff0000);
}
while (--length);

ARMv3M (StrongARM) optimised mixing loop

Notes:

leftvol and rightvol are int32_t's containing the volume scaled to lie in the range 0-65536 (inclusive).
MV_RightChannelOffset == 2 and MV_SampleSize == 4 is exploited to allow dest to be treated as a word array, avoiding much of the slow byte-based halfword load/store that must be used on the RiscPC. This results in the left sample being in the low 16 bits of mix, and the right right sample in the high 16 bits.
Mixing and clamping is performed on 32bit values, using the ADDS & SBCVS sequence. This results in left and right containing 32bit output values, with only the top 16 bits being relevant.
For speed, the left channel value isn't masked out of mix when the right channel is being mixed & clamped - but the error this introduces is tiny.
You might be tempted to think that the ADDS instructions could be replaced with MLAS - but that won't work, because MULS/MLAS doesn't update the V flag.
Although I've labelled this code as ARMv3M, it'll actually run fine on any (32bit) ARM hardware

ARMv5TE

int32_t * __restrict dest32 = (int32_t *) dest;
do
{
  /* Volumes are negated and halved so that -32768 represents full volume */
  uint32_t arm_volumes = (-(leftvol >> 1)) & 0xffff;
  arm_volumes |= (-(rightvol >> 1)) << 16;
 
  int16_t const isample0 = B_LITTLE16(source[position >> 16]);
 
  position += rate;
 
  int32_t mix = *dest32;
  int32_t left,right;
  left = mix << 16;
  right = mix;
 
  int32_t left2,right2;
  asm ("smulbb %0,%1,%2" : "=r" (left2) : "r" (isample0), "r" (arm_volumes));
  asm ("smulbt %0,%1,%2" : "=r" (right2) : "r" (isample0), "r" (arm_volumes));
  asm ("qdsub %0,%1,%2" : "=r" (left) : "r" (left), "r" (left2));
  asm ("qdsub %0,%1,%2" : "=r" (right) : "r" (right), "r" (right2));
 
  *dest32++ = (((uint32_t)left)>>16) | (right & 0xffff0000);
}
while (--length);

ARMv5TE (Iyonix) optimised mixing loop

Similar to the ARMv3M code, dest is accessed as a word at a time
This packs both the left & right volumes into a single register, using the SMULxy instruction to perform signed 16x16 -> 32bit multiplication.
This also requires the volumes to be negated and halved, to avoid overflow when they're represented in 16bit form
QDSUB is used to mix and clamp, at the same time as undoing the negation and halving

ARMv6

do
{
  int16_t const isample0 = B_LITTLE16(source[position >> 16]);
 
  position += rate;
 
  int32_t left = *dest;
  int32_t right = *(dest + (MV_RightChannelOffset >> 1));
 
  /* left += (isample0 * leftvol) >> 16 */
  /* Of course, this prevents volume boosting */
  asm ("smlawb %0,%1,%2,%3" : "=r" (left) : "r" (leftvol), "r" (isample0), "r" (left));
  asm ("smlawb %0,%1,%2,%3" : "=r" (right) : "r" (rightvol), "r" (isample0), "r" (right));
  asm ("ssat %0,#16,%1" : "=r" (left) : "r" (left));
  asm ("ssat %0,#16,%1" : "=r" (right) : "r" (right));
 
  *dest = left;
  *(dest + (MV_RightChannelOffset >> 1)) = right;
  dest += MV_SampleSize >> 1;
}
while (--length);

ARMv6 (Pi 1) optimised mixing loop

This uses the ARMv5TE SMLAWx instruction together with the ARMv6 SSAT instruction
Since SMLAWB uses a 32bit accumulator, dest is accessed using halfwords
SSAT is the one-instruction replacement for clamp() that GCC failed to spot
There's some more scope for optimisation here (hard-code for MV_RightChannelOffset == 2 and MV_SampleSize == 4), so maybe some more work is needed

NEON

/* Volumes are negated and halved so that -32768 represents full volume */
/* Workaround bug in RISC OS GCC 4.7.4: vdup_n_s16, vset_lane_s16 with non-const values fail, so pack the values into a u32 instead */
uint32_t arm_volumes = (-(leftvol >> 1)) & 0xffff;
arm_volumes |= (-(rightvol >> 1)) << 16;
int16x4_t volumes = vreinterpret_s16_u32(vdup_n_u32(arm_volumes));
 
while(length >= 4)
{
  /* Load 4 samples */
  int16x4_t isample;
  isample = vld1_lane_s16(source + (position>>16), isample, 0);
  position += rate;
  isample = vld1_lane_s16(source + (position>>16), isample, 1);
  position += rate;
  isample = vld1_lane_s16(source + (position>>16), isample, 2);
  position += rate;
  isample = vld1_lane_s16(source + (position>>16), isample, 3);
  position += rate;
 
  /* Load and de-interleave dest buffer */
  int16x4x2_t mix = vld2_s16(dest);
 
  /* Volume scaling */
  int16x4_t sample0 = vqdmulh_lane_s16(isample, volumes, 0); /* (isample * volume * 2) >> 16 */
  int16x4_t sample1 = vqdmulh_lane_s16(isample, volumes, 1);
 
  /* Accumulate, using subtraction to counter the negated volume value */
  mix.val[0] = vqsub_s16(mix.val[0], sample0);
  mix.val[1] = vqsub_s16(mix.val[1], sample1);
 
  /* Interleave & store */
  vst2_s16(dest, mix);
 
  dest += 8;
  length -= 4;
}

NEON integer optimised mixing loop

Since the loop always processes four samples at a time, a second loop must be placed after it to deal with any remaining samples.
It uses the same 'negate and halve' approach as the ARMv5TE version. But with NEON, we have to do the doubling in the multiply instruction instead of in the mix & clamp. This means there'll be some unnecessary clamping if you have a sample with value -32768 and you're playing it back at full volume (-32768). But that should be minor enough to not cause any problems.

The results

For testing the performance of the routines, I settled on using six different machines, covering five different architectures, mainly focused on the slowest machines I had available in each category.

233MHz ARMv4 StrongARM RiscPC
600MHz ARMv5 XScale Iyonix
700MHz ARMv6 ARM11 Raspberry Pi 1
1GHz ARMv7 Cortex-A8 BeagleBoard-xM
1.5GHz ARMv7 Cortex-A15 Titanium
1.6GHz x86 Intel Atom 330 - to validate my theories, and check my optimisations aren't going to ruin x86 users!

After looking over the test results, I've been able to draw the following conclusions:

The existing code is terrible, on ARM and x86. Admittedly x86 suffers less than ARM does, but it's still the slowest algorithm I tried.
-O3 didn't make any difference
-O3 with loop unrolling & fast math was hit-or-miss - sometimes resulting in slightly faster code, sometimes slightly slower. So not something I can rely on.

The path to integer math

The graph below shows the performance of the float-only, LUT-only, and integer-only routines. Performance is relative to the original algorithm: e.g. the float results which are clustered around the number 5 mean that the float algorithm is about 5 times faster than the original algorithm.

I've included two sets of test results for the ARM targets: the 'generic' results are for a binary simply compiled with -O2 (i.e. targetting a generic CPU and using softfloat math), while the 'optimal' version is for -O2 and targeting the machine-specific CPU & FP hardware (or softfloat if there's no FP hardware).

Relative performance of CPU-agnostic routines

Although each machine is affected in different ways, there's a clear progression here: the original routine is slowest, float-only is faster than the original, LUT-only is faster than float-only, and integer-only is the fastest of them all.

One interesting pair of datapoints on the graph are the Titanium and Pi 1 optimised float results. There's a big jump from 'original' to 'float', and then a relatively flat section leading to 'LUT'. The first jump suggests that these machines really don't like having to do lots of conversions between float and int, while the second jump suggests that (compared to the BB-xM) their FPUs are actually pretty good - the BB-xM sees a much higher performance increase when going from float-only to LUT-only. But testing on more VFP/NEON machines is probably needed before drawing any general conclusions from this.

Also, don't be dismayed at the fact that the optimised Titanium integer version is only 11.4x faster than the original code, while the iyonix manages to reach a 30x improvement. In terms of raw numbers, the Titanium is actually the fastest of all the machines tested.

Raw performance figures (samples per second):

Machine/test	Original	Float	LUT	Integer
BB-xM generic	685,343.8125	2,892,623.25	9,039,449	12,336,188
BB-xM optimal	1,115,506.375	5,879,865	15,978,302	17,158,516
Iyonix generic	245,856.046875	1,184,831.625	4,338,935	6,229,164.5
Iyonix optimal	256,375.546875	1,255,779.625	6,710,886.5	7,695,970.5
Pi1 generic	305,707.28125	1,335,765.5	4,559,026	6,502,797
Pi1 optimal	892,405.125	6,291,456	7,695,970.5	8,962,188
SA generic	102,650.609375	454,913.65625	1,553,445.875	2,231,012.75
SA optimal	102,902.460938	455,902.625	1,588,751.5	2,279,513
Ti generic	1,198,372.625	5,331,742.5	23,301,688	36,649,260
Ti optimal	4,559,026	29,676,680	37,948,468	51,909,704
x86 generic	9,547,569	20,936,012	27,616,620	33,604,030

From these results we can see that x86 is initially the fastest out there, with 2x the performance of the Titanium. But as soon as the mixed float + integer code is removed the Titanium jumps ahead, even managing to be faster for when the code was compiled for a generic CPU.

ARM optimisated versions

Now that we've worked out that integers are best, we need to work out what ARM optimisations are best. AudioLib contains 9 dedicated mixing functions (and a couple of other mixing loops dotted around the place), and I don't really feel like writing optimised versions for every architecture under the sun - so being able to reduce the list to one or two good ones will make things a lot more manageable.

So in this case, rather than looking at performance relative to the original routine, the graph is showing performance relative to the fastest routine.

Relative performance of ARM optimised routines

Again, there are two lines for each machine: 'baseline' and 'optimal'. These represent whether the code was compiled for the lowest-possible CPU or the best-possible CPU; e.g. for the ARMv3M routine, running on ARMv7, 'baseline' indicates that it was compiled for ARMv3M, while 'optimal' indicates it was compiled for ARMv7. This is relevant because in addition to working out what source code variants are important, I also need to decide on how many different versions of the NBlood binary I need to build/distribute.

This graph is a bit harder to interpret, but if we break it down by machine then things become clearer:

StrongARM - There are only two routines valid here, integer and ARMv3M. Luckily the ARMv3M optimisations result in a healty 1.4x performance boost over the CPU-agnostic integer version (and in general the ARMv3M version is always better than the integer version).
Iyonix - When looking at this, you need to realise that the ARMv5TE baseline is identical to the ARMv5TE optimal. So the dip from ARMv3M optimal to ARMv5TE baseline doesn't mean that the ARMv5TE baseline is slower than the ARMv5TE optimal routine. Instead, it means that the ARMv3M optimal routine is actually faster than the ARMv5TE routine. Looking at the disassembly of the code, the ARMv5TE routine is one instruction shorter than the ARMv3M routine, so there must be something else going on with the instruction scheduling or cycle timing to cause such a performance difference.
Pi 1 - Similar to the Iyonix, an ARMv6-optimised build of the ARMv6 code is actually slightly slower than an ARMv6 build of the ARMv5TE code. But for baseline performance, we can at least see a steady and sizable increase through the different algorithms + CPUs.
BB-xM - This also sees a dip in performance for the optimised ARMv5TE code. But, at least we can see that NEON has a healthy ~30% lead over the other implementations.
Titanium - Another mysterious drop for ARMv5TE, and another ~30% gain for NEON.

Raw performance figures (samples per second):

Machine/test	Integer	ARMv3M	ARMv5TE	ARMv6	NEON
BB-xM baseline	12,336,188	19,418,074	21,559,506	23,519,460	31,457,280
BB-xM optimal	17,158,516	23,068,672	21,762,900	23,519,460	31,457,280
Iyonix baseline	6,229,164.5	9,446,631	10,082,462	-	-
Iyonix optimal	7,695,970.5	10,280,157	10,082,462	-	-
Pi1 baseline	6,502,797	11,135,320	12,098,954	13,719,685	-
Pi1 optimal	8,962,188	12,991,207	13,981,014	13,719,685	-
SA baseline	2,231,012.75	3,226,387.75	-	-	-
SA optimal	2,279,513	3,226,387.75	-	-	-
Ti baseline	36,649,260	94,371,840	128,736,064	207,618,048	268,435,456
Ti optimal	51,909,704	143,270,784	140,509,184	207,618,048	268,435,456

Conclusions

For the audiolib source, I'm likely go with the following three code variants: Integer, ARMv3M, and NEON. ARMv5TE and ARMv6 specific versions just don't look like they're going to be worth the effort - most of the performance gains from the ARMv5TE/ARMv6 versions come from the fact that the compiler has been told to use v5 or v6 instructions, not from the changes that were made to the source code.

For (ARM) binary releases, this means we're looking at at least two configurations: NEON and ARMv3M. Possibly also ARMv5TE and/or ARMv6 (with VFP) depending on how much of an effect they have on the game as a whole.

Other notes

The observant of you will have spotted that the makefile I provided above also builds FPA versions of the binaries. Depending on machine, the 'float' code when built for FPA is generally about 5-10 times slower than soft-float, and from 15 (BB-xM) to 58 (Titanium) times slower than VFP.

I also ran the tests twice, to measure the effects of different buffer sizes. The results presented in this article are for a small mix buffer (1024 bytes, i.e. 256 frames of 16bit sample pairs, the same as AudioLib is typically configured to use). 4MB of voice data is mixed into this buffer (2 million audio frames), simulating the expected runtime behaviour (small buffer which has lots of voices mixed into it).

The alternative arrangement tested, of having a large (4MB) mix buffer (with 2MB of voice data to mix into it) showed improved performance due to less time spent in the setup code at the start of the mix function. (The setup code typically contains four floating point multiplies, for calculating the two volume scale factors). However, except for the FPA builds, this performance improvement appears to be under 5%, suggesting that there isn't much to gain from tweaking the buffer size / function prologue. Also note that although the mix buffer size was 4MB, it was necessary to perform the mixing in 64K-frame chunks (effective mix buffer size of 256KB), to avoid overflows in the position variable.

And finally, to check that my optimisations hadn't broken the code, each build also contains a unit test that compares the mix output against a known-good reference implementation, for a few different volume settings, and reports the maximum error seen.

Adventures in Optimisation - Audio mixing