Java-Gaming.org Hi !
Featured games (91)
games approved by the League of Dukes
Games in Showcase (808)
Games in Android Showcase (239)
games submitted by our members
Games in WIP (872)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Converting floats/doubles to 10/11/16/N bit floats  (Read 4679 times)
0 Members and 1 Guest are viewing this topic.
Offline theagentd
« Posted 2016-09-23 00:10:34 »

GPUs often use smaller floats than 32 bits to avoid having to use a full 4 bytes per color channel. There are a number of common formats on GPUs, with 16-bit floats being the most common, but 10 and 11 bit formats are fairly common too. See this page for more info: https://www.opengl.org/wiki/Small_Float_Formats

There's no native support for <32-bit floats in Java, but it can be really useful to be able to work with smaller float values. Here are some use case examples:
 - You can store vertex attributes as 16-bit floats, especially normals and many other attributes that don't need a full 32-bit float to save a lot of space.
 - You can create 16-bit float texture data, or even R11F_G11F_B10F texture data offline and save it to a file without an OpenGL context, or something similar.
 - You can avoid some wasted memory bandwidth by reading back a 16-bit float texture in its native format and doing the unpacking on the CPU, although the driver may be faster at converting to 32-bit than my code...
 - Generally save memory when writing float binary data to files, as you can choose exactly how many bits to use for the exponent, the mantissa and even if you need a sign bit at all.


Storytime, the code is at the bottom =P
I first wrote a function to convert a 32-bit float to 16-bit floats and then back again using the Wikipedia specification, but then I realized that there are other float formats out there, so I decided to rework it a bit. I instead made two generic converter functions that take in a double value and converts it to a certain number of exponent and mantissa bits, with the sign being optional. Additionally, this also allowed me to test the system by using my functions to convert from 64-bit floats to 32-bit floats and compare that to a simple cast. So now I have a generic function that can handle any number of bits <=32, with a varying size mantissa and exponent for whatever needs you have.

Features
 - Denormals handled correctly for all bit counts.
 - Infinity/NaN preserved.
 - Clamps negative values to zero if the output value has no sign.
 - Values too big for the small format are rounded to infinity.
 - Values too small for the small format are rounded to 0.
 - Positive/negative zeroes preserved.
 - No dependencies.
 - Static functions for everything.
 - Shortcut methods for halfs, 11-bit and 10 bit floats.
 - Good performance (~50-100 million conversions per second).
 

Accuracy test
From my tests, converting doubles to 32-bit floats using my conversion function (and back again) provides 100% identical result as when doing a simple double-->float cast in Java (and back again). This test consisted of converting 18 253 611 008 random double bits to floats and back again, with 100% identical result to just casting. This should mean that the conversion is 100% accurate for 16-bit values as well, but this is harder to test.


Comments and suggestions are welcome.



1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
148  
149  
150  
151  
152  
153  
154  
155  
156  
157  
158  
159  
160  
161  
162  
163  
164  
165  
166  
167  
168  
169  
170  
171  
172  
173  
174  
175  
176  
177  
178  
179  
180  
181  
182  
183  
184  
185  
186  
187  
188  
189  
190  
191  
public class FloatConversion {

   private static final int DOUBLE_EXPONENT_BITS = 11;
   private static final long DOUBLE_EXPONENT_MASK = (1L << DOUBLE_EXPONENT_BITS) - 1;
   private static final long DOUBLE_EXPONENT_BIAS = 1023;
   
   private static final long DOUBLE_MANTISSA_MASK = (1L << 52) - 1;
   
   public static long doubleToSmallFloat(double d, boolean hasSign, int exponentBits, int mantissaBits){
     
      long bits = Double.doubleToRawLongBits(d);
     
      long s = -(bits >>> 63);
      long e = ((bits >>> 52) & DOUBLE_EXPONENT_MASK) - DOUBLE_EXPONENT_BIAS;
      long m = bits & DOUBLE_MANTISSA_MASK;
      int exponentBias = (1 << (exponentBits-1)) - 1;
     
      if(!hasSign && d < 0){
         //Handle negative NaN and clamp negative numbers when we don't have an output sign
         if(e == 1024 && m != 0){
            return (((1 << exponentBits) - 1) << mantissaBits) | 1; //Negative NaN
         }else{
            return 0; //negative value, clamp to 0.
         }
      }
     
     
     
      long sign = s;
      long exponent = 0;
      long mantissa = 0;
     
     
     

     
      if(e <= -exponentBias){

         double abs = Double.longBitsToDouble(bits & 0x7FFFFFFFFFFFFFFFL);
         
         //Value is too small, calculate an optimal denormal value.
         exponent = 0;
         
         int denormalExponent = exponentBias + mantissaBits - 1;
         double multiplier = Double.longBitsToDouble((denormalExponent + DOUBLE_EXPONENT_BIAS) << 52);
         
         //Odd-even rounding
         mantissa = (long)Math.rint(abs * multiplier);
         
      }else if(e <= exponentBias){
         
         //A value in the normal range of this format. We can convert the exponent and mantissa
         //directly by changing the exponent bias and dropping the extra mantissa bits (with correct
         //rounding to minimize the error).
         
         exponent = e + exponentBias;
         
         int shift = 52 - mantissaBits;
         long mantissaBase = m >> shift;
         long rounding = (m >> (shift-1)) & 1;
         mantissa = mantissaBase + rounding;

         //Again, if we overflow the mantissa due to rounding to 1024, we want to round the result to
         //up to infinity (exponent 31, mantissa 0). Through a stroke of luck, the code below
         //is not actually needed due to how the mantissa bits overflow into the exponent bits,
         //but it's here for clarity.
         //exponent += mantissa >> 10;
         //mantissa &= 0x3FF;
         
      }else{
         
         //We have 3 cases here:
         // 1. exponent = 128 and mantissa != 0 ---> NaN
         // 2. exponent = 128 and mantissa == 0 ---> Infinity
         // 3. value is to big for a small-float---> Infinity
         //So, if the value isn't NaN we want infinity.
         exponent = (1 << exponentBits) - 1;
         if(e == 1024 && m != 0){
            mantissa = 1; //NaN
         }else{
            mantissa = 0; //infinity
         }
      }
     
      if(hasSign){
         return (sign << (mantissaBits + exponentBits)) + (exponent << mantissaBits) + mantissa;
      }else{
         return (exponent << mantissaBits) + mantissa;
      }
     
   }
   
   public static double smallFloatToDouble(long f, boolean hasSign, int exponentBits, int mantissaBits){

      int exponentBias = (1 << (exponentBits-1)) - 1;

      long s = hasSign ? -(f >> (exponentBits + mantissaBits)) : 0;
      long e = ((f >>> mantissaBits) & ((1 << exponentBits) - 1)) - exponentBias;
      long m = f & ((1 << mantissaBits) - 1);

      long sign = s;
      long exponent = 0;
      long mantissa = 0;

      if(e <= -exponentBias){
         
         //We have a float denormal value. Cheat a bit with the calculation...

         int denormalExponent = exponentBias + mantissaBits - 1;
         double multiplier = Double.longBitsToDouble((DOUBLE_EXPONENT_BIAS - denormalExponent) << 52);
         
         return (1 - (sign << 1)) * (m * multiplier);

      }else if(e <= exponentBias){
         
         //We have a normal value that can be directly converted by just changing the exponent
         //bias and shifting the mantissa.
         
         exponent = e + DOUBLE_EXPONENT_BIAS;
         int shift = 52 - mantissaBits;
         mantissa = m << shift;
      }else{
         
         //We either have infinity or NaN, depending on if the mantissa is zero or non-zero.
         exponent = 2047;
         if(m == 0){
            mantissa = 0; //infinity
         }else{
            mantissa = 1; //NaN
         }
      }
     
      return Double.longBitsToDouble(((sign << 63) | (exponent << 52) | mantissa));
   }
   
   //Half floats
   
   public static short floatToHalf(float f){
      return (short) doubleToSmallFloat(f, true, 5, 10);
   }
   
   public static float halfToFloat(short h){
      return (float) smallFloatToDouble(h, true, 5, 10);
   }
   
   public static short doubleToHalf(double d){
      return (short) doubleToSmallFloat(d, true, 5, 10);
   }
   
   public static double halfToDouble(short h){
      return smallFloatToDouble(h, true, 5, 10);
   }
   
   
   //OpenGL 11-bit floats
   
   public static short floatToF11(float f){
      return (short) doubleToSmallFloat(f, false, 5, 6);
   }
   
   public static float f11ToFloat(short f){
      return (float) smallFloatToDouble(f, false, 5, 6);
   }
   
   public static short doubleToF11(double f){
      return (short) doubleToSmallFloat(f, false, 5, 6);
   }
   
   public static double f11ToDouble(short f){
      return smallFloatToDouble(f, false, 5, 6);
   }
   
   
   //OpenGL 10-bit floats.
   
   public static short floatToF10(float f){
      return (short) doubleToSmallFloat(f, false, 5, 5);
   }
   
   public static float f10ToFloat(short f){
      return (float) smallFloatToDouble(f, false, 5, 5);
   }
   
   public static short doubleToF10(double f){
      return (short) doubleToSmallFloat(f, false, 5, 5);
   }
   
   public static double f10ToDouble(short f){
      return smallFloatToDouble(f, false, 5, 5);
   }
}

Myomyomyo.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #1 - Posted 2016-09-23 06:24:39 »

thanks for sharing Smiley
Offline princec

« JGO Spiffy Duke »


Medals: 1147
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #2 - Posted 2016-09-23 08:33:10 »

Just look at all that bollocks which will become mercifully obsolete in just a couple of years' time Smiley

It's great code, but just a reminder to me of just how pointlessly annoying programming around hardware limitations can be.

Cas Smiley

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline ShadedVertex
« Reply #3 - Posted 2016-09-23 09:55:21 »

Just a quick question: did you learn all of this at university? I haven't gotten to university yet, so I wouldn't know. Where did you learn all of this? Lol I'm getting desperate Tongue
Offline Roquen

JGO Kernel


Medals: 518



« Reply #4 - Posted 2016-09-23 12:23:21 »

princec: memory footprint.
Offline theagentd
« Reply #5 - Posted 2016-09-23 14:24:42 »

Just a quick question: did you learn all of this at university? I haven't gotten to university yet, so I wouldn't know. Where did you learn all of this? Lol I'm getting desperate Tongue
We did have a lecture on two on how floating point numbers work at uni, but I just looked up the specifications of the different values on Wikipedia.

princec: memory footprint.
Yeah, the point here is to halve the bandwidth and memory usage.

Myomyomyo.
Offline princec

« JGO Spiffy Duke »


Medals: 1147
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #6 - Posted 2016-09-23 14:57:29 »

Indeed that would be the point... my take on it is just to go "meh" and wait for the hardware to catch up so we don't have to have to worry about pages and pages of this sort of thing in our codebases, which when you get right down to it, are just horrible incomprehensible hacks to work around hardware deficiencies. It's nice and all but I do look forward to a time when none of this is necessary.

Cas Smiley

Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2016-09-23 19:49:05 »

@Cas: the hardware will never catch up, because the hardware will never be fast enough.

Even in hardware 20 years from now, when we have realtime photon-mapping, memory bandwidth will be a bottleneck.
Any way or form to halve your data size is bound to yield performance gains.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline Roquen

JGO Kernel


Medals: 518



« Reply #8 - Posted 2016-09-24 11:04:40 »

The only out here is some unforeseen engineering miracle.

This paper has an old graph of the gap: http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_speed.pdf

Related: https://fgiesen.wordpress.com/2016/08/07/why-do-cpus-have-multiple-cache-levels/
Pages: [1]
  ignore  |  Print  
 
 

 
mercenarius (12 views)
2020-06-04 19:26:01

mercenarius (18 views)
2020-06-04 19:13:43

Riven (853 views)
2019-09-04 15:33:17

hadezbladez (5843 views)
2018-11-16 13:46:03

hadezbladez (2653 views)
2018-11-16 13:41:33

hadezbladez (6269 views)
2018-11-16 13:35:35

hadezbladez (1506 views)
2018-11-16 13:32:03

EgonOlsen (4742 views)
2018-06-10 19:43:48

EgonOlsen (5801 views)
2018-06-10 19:43:44

EgonOlsen (3292 views)
2018-06-10 19:43:20
A NON-ideal modular configuration for Eclipse with JavaFX
by philfrei
2019-12-19 19:35:12

Java Gaming Resources
by philfrei
2019-05-14 16:15:13

Deployment and Packaging
by philfrei
2019-05-08 15:15:36

Deployment and Packaging
by philfrei
2019-05-08 15:13:34

Deployment and Packaging
by philfrei
2019-02-17 20:25:53

Deployment and Packaging
by mudlee
2018-08-22 18:09:50

Java Gaming Resources
by gouessej
2018-08-22 08:19:41

Deployment and Packaging
by gouessej
2018-08-22 08:04:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!