Java-Gaming.org Hi !
Featured games (90)
games approved by the League of Dukes
Games in Showcase (741)
Games in Android Showcase (225)
games submitted by our members
Games in WIP (823)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Faster Than Light JSON Parser  (Read 3711 times)
0 Members and 1 Guest are viewing this topic.
Offline KaiHH

JGO Kernel


Medals: 481



« Posted 2017-07-03 18:30:19 »

Probably everyone has done this at some point Smiley, but here is my take on an event/push-based SAX-like JSON parser which I needed for a library I am working on.

I needed a very fast parser working on a byte[] input that just provides me with events about structural JSON elements so that a consumer can implement a hierarchical model on that. The parser had to be as straight-forward and as fuzz-less as possible. No abstraction, no internal AST building, nothing.

First I tried a JavaCC-generated one, which ultimately became the bottleneck of the whole library (needless to say that the requirements also ruled out any other JSON parsing library since most of them also handle marshalling/POJO-mapping which I did not need).
So I opted for a minimal, complete handwritten solution.
I also post this in the hope that someone will probably provide an even more optimized solution. I could really use it!

EDIT:
All parsers developed so far can be found here: https://github.com/httpdigest/ftljson/tree/master/src/ftljson
Offline Sickan
« Reply #1 - Posted 2017-07-03 21:02:48 »

That is some well thought out code, really shows what can be done when performance is the priority. Besides, JSON validation should be done on the dev machine, not on the user's machine.
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #2 - Posted 2017-07-05 19:51:55 »

Since the first post I gratefully got lots of help and improvements from @Spasi. He and I optimized the parser to get to an absolute top-peak performance. The code of the initial post now shows the latest and fastest version of the parser.

On a Ryzen 1800X it achieves almost 1.1 Gigabyte per second of parsing performance (measured with a JMH benchmark reading a ~9KB file as byte[] array on JDK9 b176). The parser is now also able to handle escape sequences in object keys and string literals correctly, and to handle numbers in scientific notation. Of course, any actual conversion to a numeric type representation still needs to be done by the client.

I am pretty certain that you will not find any faster Java/JVM-language-based parser than this.
It is even 2x as fast as noggit, which claims to be "the world's fastest streaming JSON parser for Java".
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Abuse

JGO Ninja


Medals: 60


falling into the abyss of reality


« Reply #3 - Posted 2017-07-05 22:20:07 »

Really minor tweak;

byte c
in isescaped(...) should be an int. (As it is being used as a counter, not intermediary storage of a value copied from input)

....and the method itself should be named isEscaped(...)  Pointing

I'd also decide whether the class is going to be stateful, or not.
At the moment you're storing pos as state, but passing the byte[] & ParseListener around as parameters.

Go with one, or the other; doing both looks messy.
Given the aim is minimal weight & dependencies, I'd be inclined to make it stateless so everything can be static.
Either way, I doubt it'd have any noticeable impact upon performance.
Offline princec

« JGO Spiffy Duke »


Medals: 974
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #4 - Posted 2017-07-05 22:42:19 »

Now make it do HJSON Smiley

Cas Smiley

Offline Spasi
« Reply #5 - Posted 2017-07-06 08:32:17 »

I'd also decide whether the class is going to be stateful, or not.
At the moment you're storing pos as state, but passing the byte[] & ParseListener around as parameters.

Go with one, or the other; doing both looks messy.
Given the aim is minimal weight & dependencies, I'd be inclined to make it stateless so everything can be static.
Either way, I doubt it'd have any noticeable impact upon performance.

This mix of functional/stateless parameter passing and internal state is by design. Because the JVM is the leakiest abstraction in modern software development. Possibly only rivaled by SQL query optimizers.

1. The problem starts with bytecode size. More bytecode in a method == harder for the JVM to optimize.
2. Once you start dealing with bytecode size, you need to extract repeating code and corner cases to other methods.
3. It all ends when you need to return two values from a method. You can't do that without state in Java.

There's a major difference between input/listener and pos. The former is external state, the latter private/internal. The JVM sees that it has full control over pos and is able to perform the full range of optimizations. It's also likely that the entire parser instance can be eliminated via escape analysis.
Offline Sickan
« Reply #6 - Posted 2017-07-06 10:27:59 »

As it operates on bytes, it doesn't support non-ASCII characters, no?
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #7 - Posted 2017-07-06 10:32:58 »

As it operates on bytes, it doesn't support non-ASCII characters, no?
It does support the UTF-8 character coding, since all recognized tokens (", :, 0-9, etc.) are also in the ASCII range (highest bit not set). Or in other words: There will not be any byte of the 1-4 bytes in a single UTF-8 character encoding that can be misinterpreted as a recognized token by the parser, since all such bytes use the highest order bits to indicate that there are more bytes to follow. Quite clever of them to design the character coding this way.
So you can use any UTF-8 encoded character you like in object key names and string value literals.
Offline Abuse

JGO Ninja


Medals: 60


falling into the abyss of reality


« Reply #8 - Posted 2017-07-06 11:31:09 »

As it operates on bytes, it doesn't support non-ASCII characters, no?
It does support the UTF-8 character coding, since all recognized tokens (", :, 0-9, etc.) are also in the ASCII range (highest bit not set). Or in other words: There will not be any byte of the 1-4 bytes in a single UTF-8 character encoding that can be misinterpreted as a recognized token by the parser, since all such bytes use the highest order bits to indicate that there are more bytes to follow. Quite clever of them to design the character coding this way.
So you can use any UTF-8 encoded character you like in object key names and string value literals.

Makes sense; so ParseInterface#stringLiteral(int off, int len) is the number of bytes that make up the literal, not necessarily the number of characters.
Should make sure the ParseInterface javadoc makes that clear.
Offline Abuse

JGO Ninja


Medals: 60


falling into the abyss of reality


« Reply #9 - Posted 2017-07-06 11:50:43 »

Going back to isEscaped; granted it's a fringe case, but rather than counting the escaped characters, just toggle a boolean?

  • It would eliminate the limitation of 2 billion sequential escaped characters persecutioncomplex
  • It might be quicker. (not that it'd ever be noticeable outside of the degenerate case of every character in every literal being '\' )
  • It demonstrates the intent of the code a little more obviously

1  
2  
3  
4  
5  
6  
7  
    private boolean isEscaped(byte[] input) {
        int p = pos - 1;
        boolean escaped = false;
        while (read(input, p--) == '\\')
            escaped=!escaped;
        return escaped;
    }


Though looking at the bytecode generated by javac 8, it appears that negating booleans is done with a branch now!? (I believe it used to be done with an xor)
Presumably that's something optimized by the JIT.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #10 - Posted 2017-07-06 12:04:27 »

Going back to isEscaped; granted it's a fringe case, but rather than counting the escaped characters, just toggle a boolean?
Thanks! I went with this:
1  
2  
3  
4  
int p = pos;
boolean c = false;
for (; read(input, --p) == '\\'; c ^= true);
return c;

XOR'ing a boolean with true is quite faster (benchmarked with lots of escapes) than the not operator, since the latter incurs a branch.
Offline Abuse

JGO Ninja


Medals: 60


falling into the abyss of reality


« Reply #11 - Posted 2017-07-06 12:34:46 »

I'm almost certain that a single case switch...

1  
2  
3  
4  
5  
         switch (c) {
         case '"':
            string(input, listener, inobject);
            continue;
         }

...will be slower than a simple if branch.
The reason being even a tableswitch involves a branch to check bounds (both upper & lower).

The index must be of type int and is popped from the operand stack. If index is less than low or index is greater than high, then a target address is calculated by adding default to the address of the opcode of this tableswitch instruction. Otherwise, the offset at position index - low of the jump table is extracted. The target address is calculated by adding that offset to the address of the opcode of this tableswitch instruction. Execution then continues at the target address.

Of course your benchmarking might have found otherwise?

#flibble dibble silly 90% quote content check#
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #12 - Posted 2017-07-06 12:40:48 »

There is a GitHub project now with a benchmark suite /src/bench/B.java using various test data JSON files.
Simply `./mvnw compile exec:java` if you want to give it a go:
  https://github.com/httpdigest/ftljson
Running the benchmark class as Java application also works very nicely in IntelliJ if you would like to hack and try an even faster version. Smiley
Offline Riven
Administrator

« JGO Overlord »


Medals: 1324
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2017-07-06 20:15:55 »

XOR'ing a boolean with true is quite faster (benchmarked with lots of escapes) than the not operator, since the latter incurs a branch.

XOR may be faster, but there is no reason that NOT would incur a branch.

It's just an unconditional bit-flip, unless HotSpot screws up. Clueless

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #14 - Posted 2017-07-06 23:21:58 »

The quest for ultimate performance is not over, yet...
Two more tricks gave another ~8% performance increase:
1. Replace the switches for the "numeric" and "ignore" cases, which were just used to decide true/false, with a byte[256] table lookup (benefit: reduced bytecode size)
2. Implement that table via a ByteBuffer and do the lookup with sun.misc.Unsafe.getByte() to get rid of the default Java array access bounds checks.

Source of the parser: https://github.com/httpdigest/ftljson/blob/master/src/ftljson/JsonParserKS9.java
Companion Unsafe class: https://github.com/httpdigest/ftljson/blob/master/src/ftljson/Unsafe.java

Now, I'm even getting 1.12 Gigabytes per second on my i7-3820QM DDR3-1600 notebook, which was earlier only possible on a high-end desktop class PC. Smiley
Offline Abuse

JGO Ninja


Medals: 60


falling into the abyss of reality


« Reply #15 - Posted 2017-07-06 23:45:41 »

2. Implement that table via a ByteBuffer and do the lookup with sun.misc.Unsafe.getByte() to get rid of the default Java array access bounds checks.

Wow, you really are trusting the integrity of your data  Clueless
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #16 - Posted 2017-07-07 00:01:49 »

Well, not more than trusting that (aByte & 0xFF) can only return values in the range 0-255.
Offline CoDi^R
« Reply #17 - Posted 2017-07-07 11:28:13 »

Interesting topic. FYI if I read the bechmark results right, ks5/6 are faster than ks7/8/9 here on my MacBook Pro 2013, Java version "1.8.0_112".

Robotality - steamworks4j - @code_disaster - codi^r @ #java-gaming
Offline abcdef
« Reply #18 - Posted 2017-07-07 18:59:56 »

I had 5 and 6 fastest overall on Linux Mint
Offline abcdef
« Reply #19 - Posted 2017-07-07 19:29:51 »

I created a new test based on 5 and changed these two methods

1  
2  
3  
4  
5  
6  
        private static boolean ignore(byte b) {
        return (b<=32 || b == 44 || b == 58);
   }
   private static boolean isNumberPart(byte c) {
        return ((c>=48 && c <=57) || c ==43 || c==45 || c==46 || c==69|| c==101);
   }


The times improved further still on 5.
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #20 - Posted 2017-07-07 20:16:52 »

I had 5 and 6 fastest overall on Linux Mint
The OS variant is not as much interesting as the CPU instruction set (x86, amd64/x86_64, ARMv?), CPU vendor (AMD, Intel, ARM) and the Java runtime version you are using. Would be nice if you could provide any insight on those. Thanks for testing!
Offline Icecore
« Reply #21 - Posted 2017-07-07 22:40:48 »

I hope I not mess something up)
Its 2 AM)

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
public class JsonParserKS6{
   public static interface ParseListener{
      void beginObject();
      void endObject();
      void booleanLiteral(boolean value);
      void numberLiteral(int off, int len);
      void stringLiteral(int off, int len);
      void nullLiteral();
      void beginObjectEntry(int off, int len);
      void beginList();
      void endList();
   }

   final private static byte[] ignore = _to_B("\t\n\r ,:");
   final private static byte[] isNumeric = _to_B("-0123456789");
   final private static byte[] isNumberPart = _to_B("0123456789+-.e");
   final private static byte _ByteIn = 1;

   private static byte[] _to_B(String text){
      byte[] data = new byte[256];
      for(int i = 0; i < text.length(); i++){
         char c = text.charAt(i);
         data[c] = _ByteIn;
      }
      return data;
   }

   private int pos = -1;
   private byte[] input;
   private ParseListener listener;
   private boolean isEscaped(){
      int p = pos;
      boolean escaped = false;
      for(; input[--p] == '\\'; escaped ^= true){}
      return escaped;
   }
   private void json(boolean inobject, boolean cntn){
      do{
         byte pByte;
         do{
            pByte = input[++pos];
         }while(ignore[pByte] == _ByteIn);
         
         int start;
         switch (pByte){
         case '"':
            start = pos + 1;
            while(input[++pos] != '"' || isEscaped()){}
            if(inobject){
               listener.beginObjectEntry(start, pos - start);
               json(false, false);
            }else{
               listener.stringLiteral(start, pos - start);
            }
            break;
         case '[':
            listener.beginList();
            json(false, true);
            break;
         case ']':
            listener.endList();
            return;
         case 'f':
            listener.booleanLiteral(false);
            pos += 4;
            break;
         case 'n':
            listener.nullLiteral();
            pos += 3;
            break;
         case 't':
            listener.booleanLiteral(true);
            pos += 3;
            break;
         case '{':
            listener.beginObject();
            json(true, true);
            break;
         case '}':
            listener.endObject();
            return;
         default:
            if(isNumeric[pByte] == _ByteIn){
               start = pos;
               while(isNumberPart[input[++pos]] == _ByteIn){}
               listener.numberLiteral(start, pos-- - start);
            }
            else{
               new RuntimeException("Unknown Byte=" + pByte + " '" + (char)pByte + "' at pos=" + pos);
            }
            break;
         }
      }while(cntn);
   }
   public int json(byte[] input, ParseListener listener){
      pos = -1;
      this.input = input;
      this.listener = listener;
      json(false, false);
      return pos;
   }
}

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Offline KaiHH

JGO Kernel


Medals: 481



« Reply #22 - Posted 2017-07-07 22:43:51 »

I hope I not mess something up)
I'm afraid you did. Smiley In the repo there is a unit test which you can use to verify the parser works correctly.
Offline Icecore
« Reply #23 - Posted 2017-07-07 22:45:38 »

I hope I not mess something up)
I'm afraid you did. Smiley In the repo there is a unit test which you can use to verify the parser works correctly.
i don't see it(
up: think i find)  looks like i fix it(i may be wrong)
up3: nope it was benchmark, i find test only now XD
https://github.com/httpdigest/ftljson/blob/master/test/test/ParserTest.java
- Test passed on file "json2.json" ^^

p.s hm - maybe JVM give better performance on small functions

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Offline abcdef
« Reply #24 - Posted 2017-07-08 10:30:02 »

I had 5 and 6 fastest overall on Linux Mint
The OS variant is not as much interesting as the CPU instruction set (x86, amd64/x86_64, ARMv?), CPU vendor (AMD, Intel, ARM) and the Java runtime version you are using. Would be nice if you could provide any insight on those. Thanks for testing!

The CPU is :

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz

Java version is :

java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
Pages: [1]
  ignore  |  Print  
 
 

 
Ecumene (110 views)
2017-09-30 02:57:34

theagentd (144 views)
2017-09-26 18:23:31

cybrmynd (245 views)
2017-08-02 12:28:51

cybrmynd (241 views)
2017-08-02 12:19:43

cybrmynd (240 views)
2017-08-02 12:18:09

Sralse (254 views)
2017-07-25 17:13:48

Archive (872 views)
2017-04-27 17:45:51

buddyBro (1016 views)
2017-04-05 03:38:00

CopyableCougar4 (1573 views)
2017-03-24 15:39:42

theagentd (1373 views)
2017-03-24 15:32:08
List of Learning Resources
by elect
2017-03-13 14:05:44

List of Learning Resources
by elect
2017-03-13 14:04:45

SF/X Libraries
by philfrei
2017-03-02 08:45:19

SF/X Libraries
by philfrei
2017-03-02 08:44:05

SF/X Libraries
by SkyAphid
2017-03-02 06:38:56

SF/X Libraries
by SkyAphid
2017-03-02 06:38:32

SF/X Libraries
by SkyAphid
2017-03-02 06:38:05

SF/X Libraries
by SkyAphid
2017-03-02 06:37:51
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!