tl;dr: If you want your Android app to run twice as fast, put all your application classes in the package "com.badlogic.gdx" (or subpackages).
While trying to evaluate how fit JOML currently is for running on Android devices, I did some benchmarking. Using a fresh Android Studio installation I used an Android 6.0.1 device and a simple application containing some benchmarking of JOML methods.
90 ns. for a 4x4 matrix multiplication, I though, was not bad.
I also tested the math classes of the latest libGDX version with the arm64-v8a shared library against it. The shared library loaded just fine via System.loadLibary("gdx") and the native Matrix4.mul() method successfully called the native function. But its performance was nowhere near that of the pretty standard textbook Java-only solution in JOML. LibGDX performed at arount 1370 ns. per invocation.
Next I tried to optimize libGDX's Matrix4.mul() method. First, this meant getting rid of the very slow JNI invocation, which internally also made calls to expensive VM runtime routines. The actual matrix multiplication arithmetic was the least of the runtime cost.
So, I translated JOML's matrix multiplication to float accesses that libGDX uses.
Then I ran the benchmark again.
Result: 47 ns. for the new Java-only libGDX Matrix4.mul() method.
I thought: Nice, so on Android arithmetic and memory-accesses on float elements is actually twice as fast compared to using primitive float fields like JOML does.
So, I stripped JOML's Matrix4f off everything except mul(), also used a float array instead of the float fields, and finally used the same mul() method in JOML's class which I already used for libGDX's Matrix4.mul().
I expected that now JOML would perform exactly as fast as libGDX's new Java-only mul() method.
That was not the case. Still around 90 ns.
That could not be, I thought.
Whatever I tried, changing the access patterns of the float array in the mul() method from row to column major. Nothing worked. JOML's Matrix4f class now literally looked exactly like my modified libGDX Matrix4 class. Something was fishy here.
Out of curiosity, I just refactored/moved JOML's Matrix4f class into the com.badlogic.gdx.math package, rebuilt everything, wiped the app completely from the device (as always after each test) and reran the benchmark.
48 ns.... wtf.
Moving libGDX's Matrix4 class into any other package other than a subpackage of com.badlogic.gdx also degraded the performance of its mul() method from 49 ns. to 90 ns.
Result: If you want your application to run fast on an Android 6.0.1, just put your application classes in the com.badlogic.gdx package.
...I'll look into Android's sources to see whether they actually have a codepath for classes in the 'com.badlogic.gdx' package...