As you may have seen in the "What I did today"-thread, I recently worked a bit with normal maps. I finally managed to exactly replicate the standard inputs and algorithm of 3D modeling programs, normal map baking software and all major engines out there, allowing me to use generated (baked) normal maps. My test object is a cube with smoothed normals, which uses a normal map to get back its blockiness with just some smoothed edges and corners. It all looks good... at a distance. Zoom in and you get this:

Ouch. The 8-bit normal map simply doesn't have enough precision to exactly "unsmooth" the smoothed normal back to a flat surface, causing that blocking artifact. This is even with bilinear filtering enabled! The problem is simply that the gradient stored in the normal map to counteract the smoothness simply doesn't have good enough precision, causing aliasing (adjacent pixels get rounded to the same value), so even filtering doesn't even help if the input is already aliased.

The only real solution here is to generate a 16-bit normal map and upload it to VRAM at 16-bit precision too. That's a massive cost though. I'm currently using RGTC/BC5, which allows you to store two channels (X and Y) of the normal map compressed to half the size, with Z reconstructed as sqrt(1 - x^2 + y^2). BC5 compresses a block of 16 pixels to just 16 bytes, meaning that each normal map texel only uses 1 byte! This is a massive saving, allowing me to have more normal maps and/or higher resolution normal maps in memory. Going to 16-bit normal map precision (using the same compression trick as above) would force me to drop the compression as there is no 16-bit compression formats (at least not on OGL3 level hardware), meaning I'd need 4 times as much memory for a normal map. That instantly cuts off the first mipmap level if I want to stay at about the same memory usage as before. That's not really something I can afford. However, I read somewhere it it is possible to cram out some extra precision out of BC5, so I decided to run some experiments.

Let's first go through some theory about how BC4 and BC5 work. BC5 is just the two-channel version of the single-channel BC4, so I will be using BC4 for this example. BC4 divides the texture into 4x4 texel blocks. It then stores two 8-bit reference values in each block and 3-bit indices for each texel. BC5 is then decoded using some special logic depending on the order of the reference values. If the first value is bigger than the second one, the index describes a linear blend between the two values. If the first value is smaller, the index is partly used as a blending factor, but can also represent the constants 0.0 and 1.0. Here's some pseudo code:

1 2 3 4 5 6 7 8 | index = <value between 0 and 7> |

NOTE: This is not the exact algorithm used! The indices don't linearly map to values exactly like this! See https://msdn.microsoft.com/en-us/library/windows/desktop/bb694531%28v=vs.85%29.aspx#BC4 for exact info on the specification!

This is the basis for how we can gain more precision out of a BC4 and BC5 than 8 bits in some special cases. If we look at the exact math done here, we see that in the first case we blend together the two reference values based on a 3-bit blending factor. This gives us 8 possible colors: one of the two original colors or one of six values evenly spaced between them. So in theory, even if the reference values themselves are only stored at 8-bit precision, since we can access 6 values inbetween the two reference values we can actually get a result that has a higher than 8-bit precision in some cases, up to something inbetween 10 and 11 bits. This would be a pretty major gain for absolutely zero cost!

However, this makes one grave assumption: When a compressed texture is read, the uncompressed result is stored in the GPU's hardware texture cache. The specification of BC4 and BC5 do not require the decompressed result to be stored at float precision, meaning that the GPU is technically allowed to simply decompress to 8-bit values, throwing away any the extra precision. However, when ATI came up with BC5 it was specifically tailored for normal map compression, and they explicitly stated that they stored the decompressed values at 16-bit precision, more than enough to accurately store the decompressed 11-bit-ish result!

I decided to try check out how Nvidia had implemented using a simple trick. If decompressed values are only stored in 8-bit precision, the resulting values when sampling the BC5 compressed texture will be exactly n/255f. Hence, I disabled all texture filtering and wrote a tiny shader that calculated textureColor*255.0, and checked how far that was from a multiple of 255. If all values are exact multiples of 255 the precision would only be 8 bits, but during testing I found that certain parts of my compressed normal map had values that were completely different from multiples of 255! So, it seems Nvidia too stores the decompressed result at higher-than-8-bit precision, which is exactly what I was hoping for! However, when I tried to upload 16-bit texture data to a BC5 texture I was unable to gain any extra precision. It seems like the driver's texture compressor converts the 16-bit inputs to 8-bit inputs before compressing the texture, so it can't be relied on for this. Drats!

I plan on writing a small offline texture compressor that brute forces the best reference values and indices for each block in the texture to create an optimally compressed normal map. I have a feeling that this will improve the quality a lot, especially for gradients like the ones my test cube has.