If some of this isn't clear to some folks...look at this
java vs. java comparison which peaks at over 43x (or again some number's for Riven's library). Avoiding cache misses is a very big deal in high performance...where each is hundreds to thousands of less operations being performed (and constantly growing since the CPU memory gap grows).
Cache misses are indeed expensive, but the articles clearly shows (at the end) that the bottleneck is not cache misses, but the GC that tries to clear 50M objects with each run, taking between 1s and 10s for a full GC and a few of them happening per run. If the GC is running >20s per run, while the compact data version is running in ~0.8s, it's obvious that we can't just blame cache misses.
IMHO the article doesn't so much show the advantages of optimal memory usage, but shows the inability of the GC to efficiently handle a certain use case.