When someone says multi-core, we unconsciously think SMP. That worked out well for us until recently when ARM announced big.LITTLE. ARM’s big.LITTLE architecture is the first mass produced AMP architecture and as we’ll see next, it raises the bar for how hard multi-core programing is.
A tale of an impossible bug
It all started with a bug report against a phone with such a CPU, the Exynos chipset used on Samsung phones in Europe.
Apps created with our software were dying with SIGILL
at all completely random places.
Nothing could reasonably explain what was happening, and the crash was happening with valid instructions. This immediately made us suspect bad
instruction cache flushing.
After reviewing all JIT code around cache flushing we were sure that we were calling __clear_cache
properly. That lead us to look around
for how other
virtual machines
or
compilers
do cache flushing on ARM64, and we found out about some related
errata on the Cortex A53. ARM’s description of those issues
is both cryptic and vague, but we tried the workaround anyways. No luck there.
Next we went with the other usual suspects. A lying signal handler? Nope. Funky userspace CPU emulation? No.
Broken libc
implementation? Nice try. Faulty hardware? We reproduced it on multiple devices. Bad luck or karma? Yes!
Some of us could not sleep with such amazing puzzle in front of us and kept staring at memory dumps around failure sites. And there was this funny thing: the fault address was always on the third or fourth line of the memory dumps.
This was our only clue, and there are no coincidences when it comes to this sort of byzantine bug. Our memory dumps were of 16 bytes per line
and the SIGILL
would always happen to be somewhere between 0x40-0x7f
or 0xc0-0xff
.
We aligned the memory dump to help verify whether the code allocator was doing something funky:
$ grep SIGILL *.log
custom_01.log:E/mono (13964): SIGILL at ip=0x0000007f4f15e8d0
custom_02.log:E/mono (13088): SIGILL at ip=0x0000007f8ff76cc0
custom_03.log:E/mono (12824): SIGILL at ip=0x0000007f68e93c70
custom_04.log:E/mono (12876): SIGILL at ip=0x0000007f4b3d55f0
custom_05.log:E/mono (13008): SIGILL at ip=0x0000007f8df1e8d0
custom_06.log:E/mono (14093): SIGILL at ip=0x0000007f6c21edf0
[...]
With that we came to our first good hypothesis: Bad cache flushing was happening only on the upper 64 bytes of every 128-byte block. Those numbers, if you deal with low level programming, immediately remind you of cache line sizes. And that is where it all started to make sense.
Here is a pseudo version of how libgcc
does cache flushing on arm64:
void __clear_cache (char *address, size_t size)
{
static int cache_line_size = 0;
if (!cache_line_size)
cache_line_size = get_current_cpu_cache_line_size ();
for (int i = 0; i < size; i += cache_line_size)
flush_cache_line (address + i);
}
In the above pseudo-code get_current_cpu_cache_line_size
is a CPU instruction that returns the line size of its caches, and flush_cache_line
flushes the cache line that contains the supplied address.
At that point we were using our own version of this function, so we instrumented it to print the cache line size as returned by the CPU and, lo and behold, it printed both 128 and 64. We double verified that this was indeed the case. So we went to see that particular CPU manual and it turns out that the big core has a 128 bytes cache line but on the LITTLE core it is only 64 bytes for the instruction cache.
So what was happening is that __clear_cache
would be called first on a big core and cache 128 as the instruction cache line size. Later it would be called on one
of the LITTLE cores and would skip every other cache line when flushing. It doesn’t get simpler than that. We removed the caching and it all worked.
Summary
Some ARM big.LITTLE CPUs can have cores with different cache line sizes, and pretty much no code out there is ready to deal with it as they assume all cores to be symmetrical.
Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation
is not enough for user space code:
It can happen that a process gets scheduled on a different CPU while executing
the __clear_cache
function with a certain cache line size, where it might not
be valid anymore.
Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs.
Here is our fix for Mono: Pull Request.
Other projects adopted our fix as well already: Dolphin and PPSSPP.