I've thought about compression a bit more. Delta encoding together with a hand-crafted LZ77 variant (really not that hard to do) is probably a good option for you, under the following assumptions:

- within a "scanline" or a Morton/Hilbert path there are long runs of constant data (-> run-length encoding would shine here)

- if there is variance it changes with more or less constant frequency so the first order derivative would be more or less constant (-> delta encoding would shine here which in turn makes LZ77 perform well, too) (to handle positive as well as negative derivatives you can zigzag-encode the deltas to get all numbers positive again)

- if there are not many "runs" of the same tile type (or first order derivative) there _might_ be recurring patterns (which again would be great for LZ77)

These are also the assumptions being made by the PNG compression algorithm, which is basically delta encoding plus deflate (LZ77 + Huffman).

Now you can tune a custom LZ77 implementation to be more suited for run-length encoding as opposed to pattern matching by altering the coding scheme used for the <offset, length> pairs and literals, both in how many bits to allocate for each of that information and especially how you entropy-encode these bits.

If you expect there to be lots of runs of the same data (tile numbers or derivative) and not so much patterns found in some distances to each another, then first use the popular LZ77 scheme which uses lengths longer than the offset to indicate repetition and therefore enables runlength encoding, and then Huffman-encode the offset using for example 2 bits for the "offset = 1" case (i.e. directly following the current encoder position in the LZ77 lookahead buffer) and a 3-bit extra-bits code with 2-3 extra bits for the offset up to 13 bytes behind in the LZ77 window. Then probably a 1-bit extra-bit code with 2-3 extra bits for the length of the run.

For example, using this scheme:

## Encoding of literal

* '0' + <literal byte> <-- first 0 bit indicates "now comes a literal byte" (like LZSS does)

## Encoding of offsets (with always a first 1 bit to indicate "now comes <offset, length> and not literal")

* 1 = '10' <-- we expect large single-byte runs so keep it the most common and shortest case

* 2-5 = '110' + 2 extra bits

* 6-13 = '111' + 3 extra bits <-- we don't want to find recurring patterns that far behind

## Encoding of lengths

* 3-6 = '0' + 2 extra bits <-- here is where we can tweak things, too...

* 7-14 = '1' + 3 extra bits <-- ...if we expect longer runs to be more common, we just swap the intervals

This is to be efficient to represent moderately long runs of data very close to the current encoding position, but rather inefficient for patterns that are further behind in the LZ77 window (while hopefully still being more efficient than not caring about patterns at all and _just_ implementing run-length coding).

Essentially, what we want here as well is to avoid having to encode a (canonical) Huffman tree in the data as well, since the 32x32 tiles data is rather very short in total.

Next, I would encode the tile type/number completely separately from the "second layer" which you mentioned. If the latter is more constant we can make use of that by first encoding the tile type/number of all 32x32 tiles which we expect to have a higher entropy than the second layer, and then encode the second layer of all 32x32 tiles with an encoding scheme that favors low-entropy long runs even more, probably by just using run-length coding.

But we won't know what works unless there is some example data.