Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (498)
Games in Android Showcase (116)
games submitted by our members
Games in WIP (563)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  [Renamed] Ye mighty lexer compiler!  (Read 10258 times)
0 Members and 1 Guest are viewing this topic.
Offline keldon85

Senior Member


Medals: 1



« Posted 2007-11-23 12:40:37 »

Updated 20 July 2012 with Github link

I've done it! Ye mighty auto tokenizer allows you to define a grammar and tokenizer strings. I've never done it, looked at other code, or read anything on how to do this so there is a chance I've used a pattern I never knew I was using, etc.

You feed it a token grammar and a string, and this son-of-a-gun will give you the next token. Warning: like any declarative language this will do what you tell it to do, therefore do not give it faulty grammars (like ones that will accept nothing)!

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
      ArrayList<Node> alphaNumericList = new ArrayList<Node>();
      ArrayList<Node> sentenceList = new ArrayList<Node>();
      ArrayList<Node> anotherWordList = new ArrayList<Node>();
      ArrayList<Node> wordList = new ArrayList<Node>();
      ArrayList<Node> whitespaceList = new ArrayList<Node>();
     
      Node whitespace;
      Node word;
      Node anotherWord;
      Node sentence;
     
      whitespaceList.add ( TerminalFactory.createTerminalString(" "));
      whitespaceList.add (  RepetitionFactory.createRepetition(TerminalFactory.createTerminalString(" ")) );
      whitespace = ListFactory.createList(whitespaceList);
     
      alphaNumericList.add(number());
      alphaNumericList.add(letter());
     
      wordList.add (letter());
      wordList.add(RepetitionFactory.createRepetition( letter() ));
      word = ListFactory.createList(wordList);

      anotherWordList.add(whitespace);
      anotherWordList.add(word);
      anotherWord = ListFactory.createList(anotherWordList);
     
      sentenceList.add (word);
      sentenceList.add (RepetitionFactory.createRepetition(anotherWord));
      sentenceList.add (OptionFactory.createOptional(TerminalFactory.createTerminalString(".")));
      sentence = ListFactory.createList(sentenceList);
     
     
      System.out.println ( Parser.parse("The dirty fairies are dead", sentence));
      System.out.println ( Parser.parse("The dirty fairies are dead.", sentence));
      System.out.println ( Parser.parse("The dirty fai.ries are dead.", sentence));


Output:
1  
2  
3  
26
27
14



Legal disclaimer:
By downloading the said file, knowingly or not, you agree to have no rights to its code or your knowledge of the knowledge gained upon mentally processing it. You have no copying rights, understanding rights, or right to process any thoughts derived from the knowledge of the said file. You are however given the right to live and breathe under the condition you do so without violation of any stated rule in this disclaimer.


Code available at https://github.com/keldon85-avasopht/mighty-parser

As for the approach, I've never done anything like this before, I just looked at the rules for grammars and created factories for them. The parser does most of the work, but you can see from the [poorly (un)commented] code how it works.

Offline keldon85

Senior Member


Medals: 1



« Reply #1 - Posted 2007-11-23 12:43:44 »

One more thing; I will eventually add some form of faulty grammar detection. Basically if you can reach an end node without a decision or entering a node from a node then that node is faulty. A node is also faulty if it contains a grammar inside itself that is faulty - i.e. node.getInside() is faulty, so I will conjure up the code for it at some time.

Offline ryanm

Senior Member


Projects: 1
Exp: 15 years


Used to be bleb


« Reply #2 - Posted 2007-11-23 13:46:01 »

Huh That seems like a lot of code to recreate String.indexOf(). Perhaps a fuller example might help extol the benefits?
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline keldon85

Senior Member


Medals: 1



« Reply #3 - Posted 2007-11-23 14:34:57 »

It does a lot more than string.indexOf; it allows you to define a grammar for parsing. That was just an example to show you how to define a grammar for a simple sentence, which could end with a period. You can define any grammar you want, html, css, a command system in French, you can even define a programming language and have it parse that.

Next I will upgrade the code to return the parse tree too, which will make it "ye mighty auto lexer", or something.

Offline keldon85

Senior Member


Medals: 1



« Reply #4 - Posted 2007-11-23 17:59:15 »

I've made a further update to have it select from a list of grammars, returning the grammar that gave the best match. Note that ambiguous grammars will give ambiguous results, such as one grammar accepting white space (only) and another beginning with being able to accept white space. On another note I've added in detection of faulty grammar!

Offline Abuse

JGO Knight


Medals: 12


falling into the abyss of reality


« Reply #5 - Posted 2007-11-24 00:26:29 »

Can't make head nor tail of your example Huh

However, is this not something that is achievable with regular expressions?

Make Elite IV:Dangerous happen! Pledge your backing at KICKSTARTER here! https://dl.dropbox.com/u/54785909/EliteIVsmaller.png
Offline keldon85

Senior Member


Medals: 1



« Reply #6 - Posted 2007-11-24 00:37:31 »

Can't make head nor tail of your example Huh

However, is this not something that is achievable with regular expressions?

The example is simple, I defined the following grammar ...
1  
2  
3  
word = LETTER {LETTER} .
anotherWord = whiteSpace word .
sentence = word {anotherWord} ["."] .

 ... and then passed the grammar the provided sentences to see how much of it is accepted by the grammar.

It is achievable using regular expressions (or JavaCC), however figuring this out by yourself from scratch without resorting to any text book or reference as to how it is (or could be) achieved is something that not a lot of people can attest to having experience with. Also it allows me to implement it anywhere, such as on an embedded device restricted to using C/assembler, or just about anywhere I can think of.

There is more to be added, for example it currently tells you how much has been accepted, but does not quite tokenize to the last reasonable end for a grammar. I've modified the code to determine of a string is acceptable in its entirety, but I will make one final modification to make it only return a reasonable string that is entirely compatible (if you get what I mean).

Offline keldon85

Senior Member


Medals: 1



« Reply #7 - Posted 2007-11-24 01:14:52 »

Okay, I've made the next update - the parser will now return the largest valid token from a string, which it was not doing before. Before the code could either determine whether a string completely satisfies a grammar or extract the largest string that is accepted. Now it will extract the largest valid grammar, hence is a complete tokenizer & lexer.

Now if I want I could have some factories that load a grammar from file or something, or even a regular expression.

Offline DzzD
« Reply #8 - Posted 2007-11-24 06:13:19 »

sry, I dont understand anything, could you explain a simple test case please ?

Offline keldon85

Senior Member


Medals: 1



« Reply #9 - Posted 2007-11-24 08:28:43 »

---=== THIS POST CONTAINS A SIMPLE EXPLANATION AND TEST CASE ===---

sry, I dont understand anything, could you explain a simple test case please ?

Okay, basically this code can automatically tokenize any string based on the [CFG] rules you give it. And for free it also doubles up as an automatic lexer.

So as a test case, consider the following grammar (shown in the last post, and also in the code example) ...
1  
2  
3  
word = LETTER {LETTER} .
anotherWord = whiteSpace word .
sentence = word {anotherWord} ["."] .

... and also the following rule (also defined in the code) ...
1  
whitespace = " " {" "} .

... if you were to call Parser.parse ("This is a lovely valid sentence. This is the next sentence!", sentence), it would return you the number of characters that form a valid sentence, which would be "This is a lovely sentence."

If you then were to call Parser.parse ("       This is not a valid sentence because sentences do not begin with whitespace.", sentence), it would return -1 to tell you that it could not return a token. If you were to pass whitespace instead of sentence then it would return 7, i.e. "       "!

My latest code update will also return a lexeme (as com.avasopht.mightyParser.traversing.Token).

Edit: I did a little speed test with a valid sentence formed of 12 031 characters; it searched 2 140 577 nodes and took 1.469 seconds to parse. Bearing in mind that the grammar depicting sentences involves branching for each character, increasing the search space. If you have a specific word then it generates a list that does not require so many nodes, so if your grammar defines the text, "some_reserved_keyword" then it will not need to branch at each letter.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 800
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #10 - Posted 2007-11-24 13:43:25 »

Something like...

1  
2  
digits = DIGIT {DIGIT} .
decimal = digits {["."]} {digits} {["e"] digits} .



?

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline keldon85

Senior Member


Medals: 1



« Reply #11 - Posted 2007-11-24 14:02:10 »

Something like...

1  
2  
digits = DIGIT {DIGIT} .
decimal = digits {["."]} {digits} {["e"] digits} .



?
Yes, although it should be ...
1  
decimal = digits ["." digits] ["e" digits] .

... as the one you wrote will accept "123............123e123e123e123". And never infinitely accept an option, take out my grammar check and turn on debugging in Parser.java to see the damage that could do Smiley

Edit: just saw the TinyCode challenges, interesting!

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 800
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #12 - Posted 2007-11-24 14:27:31 »

I added a few { and } because those parts are optional.

I wouldn't know how I'd do it any other way.

123
123e456
123.456
123.456e789

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline keldon85

Senior Member


Medals: 1



« Reply #13 - Posted 2007-11-24 14:41:32 »

Ah lol; I write {} as [optional] repeats and [] as options, whereas you write {} as options ! And although I don't do it on purpose, I [often] tend to write an optional words in square brackets when writing [normal] sentences in chat Cheesy

Offline keldon85

Senior Member


Medals: 1



« Reply #14 - Posted 2007-11-24 16:35:27 »

Ok, I've written what appears to be the final [necessary] update. There was a possibility of terminating early and not getting the largest token size by accepting every character in the string and reaching an end node without fully satisfying the grammar completely (i.e. an incomplete branch). Although it never occurred (in my presence) that bug has been fixed. The only expected updates (as of now) will be commenting.

Note that most of the work takes place in parser, which is just your every day depth first search using iteration as opposed to recursion! The rest are simple factories, which I may update to be instances, therefore removing the need for implementing code to have to know about lists and collections.

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 800
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #15 - Posted 2007-11-24 17:21:03 »

That's all nice, but a product without proper 'marketing' is worthless.


Could you show us something that might drop a few yaws?

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline keldon85

Senior Member


Medals: 1



« Reply #16 - Posted 2007-11-24 18:46:48 »

That's all nice, but a product without proper 'marketing' is worthless.
Very true! Well my purpose of writing it was down to two motivations: we don't have anything like this in our code repository at work, especially for embedded [memory starved] systems, and it was really nagging me that I knew it could be done with little difficulty. Apart from that it offers nothing above the others, although I am considering left recursion removal, or at least detection of it.

If I can achieve removal of left recursion I'll think about pushing it as a useful tool, second to that I'm working on a C++ version. As for some 'yaw' value, I'll think of something!

Offline DzzD
« Reply #17 - Posted 2007-11-25 00:49:43 »

ok, thanks for explanation you post above, I have a better understanding now, but may be I still missed something :-(

what is better in it than directly using regular expression ? I guess it is more powerfull enabling more stuff ?

Offline keldon85

Senior Member


Medals: 1



« Reply #18 - Posted 2007-11-25 01:20:46 »

It's not really about it being more powerful or better, it's all personal; I literally work up at 6am Wednesday morning after going to sleep at 1 thinking that this is possible. I had no idea I'd have a solution for Friday morning (or Thursday evening when I completely figured it out). My first idea was terribly dumb (on Wednesday morning), I'm quite surprised at how easy it was. It's basically a really simple search for the longest valid placement of letters that lead to the end node!

Technically speaking iterative deepening should not only find the solution, but will do so without getting caught in an infinite loop on left recursion! Source code is available online for anyone who wants to have a try! I'll even race ya!! On second thoughts it wouldn't! It's still infinite, so it will tell you if the entire string is valid, but not return a token.

Offline keldon85

Senior Member


Medals: 1



« Reply #19 - Posted 2007-11-28 11:48:12 »

I've made a further advancement in the code, which I need to update some time. Basically the parser can have the saving of the stack eliminated as it is redundant in the presence of the "choice stack". This [should] give it a significant speed improvement.

Offline Niwak

Senior Member


Medals: 1
Projects: 1



« Reply #20 - Posted 2007-11-28 12:06:43 »

Have you tried http://www.antlr.org/ ?
I've used it a lot and it is a very convenient and efficient solution for this type of problem.
Especially given that it comes with a grammar design/interpret/debug tool.
Offline keldon85

Senior Member


Medals: 1



« Reply #21 - Posted 2007-12-06 13:58:42 »

That does seem interesting; my code currently works as a lex/tokenizer compiler, but I'm also looking to make it generate parse trees. In fact it is practically capable of doing that! What you have here though is the code to do so yourself; you can examine it and see how simple these programs are. Note that there is an error I've updated on my workstation but not here. Since I'm at work I have no access to it.

The nodes should [also] correspond to Von Neumann architecture, where the NODE structure can compile to code/commands. Of course I have no current ambitions for that, though it would be fun. Second to fun, it would be nice to have the code interpret the grammar in text, i.e. you could give it this text stream ...
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
 program = block "." .
 
 block =
     ["const" ident "=" number {"," ident "=" number} ";"]
     ["var" ident {"," ident} ";"]
     {"procedure" ident ";" block ";"} statement .
 
 statement =
     [ident ":=" expression
     | "call" ident
     | "begin" statement {";" statement} "end"
     | "if" condition "then" statement
     | "while" condition "do" statement
     ] .
 
 condition =
     "odd" expression
     | expression ("="|"#"|"<"|"<="|">"|">=") expression
     .
 
 expression = ["+"|"-"] term {("+"|"-") term} .
 
 term = factor {("*"|"/") factor} .
 
 factor = ident | number | "(" expression ")" .

... and then use the following code ...
1  
NODE term_grammar = grab_identity ("term");

... I think I'll do this some weekend or something.

Offline keldon85

Senior Member


Medals: 1



« Reply #22 - Posted 2007-12-07 16:18:28 »

Oh yeah one more thing, full source code is available at the Mighty Parser github!

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 800
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #23 - Posted 2007-12-07 20:46:15 »

Lazy as I am, could you post something that compiles, and that walks the generated tree ?

It's lots easier to mess with something that works Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline keldon85

Senior Member


Medals: 1



« Reply #24 - Posted 2007-12-08 03:43:26 »

It compiles and walks the tree; just import the Jar into Eclipse. In the Parser class there is a boolean to debug_output the tree walking. I've made some additions in the office as I tend to do tool researching out of hours, so when I get back into the office I'll give you the update that's got it screwing around with a particular type of config file I use. It lexes it, and I'll even make it parse the file (on Monday at around 18:20 GMT after I finish work).

If there is trouble compiling, try creating a java project (in Eclipse) and importing into the src(source) folder. If you don't use Eclipse then maybe someone could cook up an ant makefile (I have no idea how easy/hard they are).

So if that code isn't enough (at the moment) it should be at 18:20GMT on Monday - maybe over the weekend if I go in! So I guess I'll have to do something to make this more 'visible'; and outputting to the console is so pre 2k, I'll output to a JFrame or something. I will [eventually] make it read the grammars from text in the not too distant future!

Edit: not today, I've got a ticket for a film screening Cool (wootages)

Offline Fuco

Innocent Bystander





« Reply #25 - Posted 2010-04-07 22:47:09 »

what is better in it than directly using regular expression ? I guess it is more powerfull enabling more stuff ?

CFGs are indeed more powerful then regular expresions, in fact, infinitely times more powerful. See http://en.wikipedia.org/wiki/Chomsky_hierarchy

Type-3, or Regular languages, described by RE, are only a subset of CFG (Type-2). Very simple example of language you can't describe with RE is a^nba^n (that is, some number of a, then b, and then same number of a again).

EDIT: wow, I'm a necromancer... Cheesy
Offline Roquen
« Reply #26 - Posted 2010-04-08 06:24:32 »

Bring out your dead.

Generally, if the grammar is simple enough, I write descent parsers otherwise I like packrat/PEG.
Offline DzzD
« Reply #27 - Posted 2010-04-08 17:05:01 »

wow finaly ! I will have wait for this answer nearly for 3 years now (ayway interresting)

Offline Nate

JGO Kernel


Medals: 147
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #28 - Posted 2010-04-08 21:41:45 »

Every time I try to use a damned parser generator I fail. I've put serious effort into it multiple times. Tried JavaCC and antlr. I just don't seem to be able to wrap my head around how to approach solving ambiguity problems.

keldon85's API looks nice and straightforward. I would worry that it silently ignores ambiguity though.

Offline keldon85

Senior Member


Medals: 1



« Reply #29 - Posted 2012-07-20 11:29:40 »

Although this topic has not been posted in for 180 days, I believe responding here is justified as there are unanswered questions.

I've uploaded the code to github now, but you should be aware that another library (parboiled) does the same job, and goes further to return a parse tree and provides a nice way to express your grammars.

Quote
wow finaly ! I will have wait for this answer nearly for 3 years now (ayway interresting)
Oh, I thought I pretty much responded to your question (though I didn't quote you). Sorry about that.

Quote
keldon85's API looks nice and straightforward. I would worry that it silently ignores ambiguity though.
It most likely does.

---

I ended up using it for lexing in a scripting system by the way.

Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

radar3301 (12 views)
2014-09-21 23:33:17

BurntPizza (30 views)
2014-09-21 02:42:18

BurntPizza (22 views)
2014-09-21 01:30:30

moogie (20 views)
2014-09-21 00:26:15

UprightPath (28 views)
2014-09-20 20:14:06

BurntPizza (32 views)
2014-09-19 03:14:18

Dwinin (48 views)
2014-09-12 09:08:26

Norakomi (74 views)
2014-09-10 13:57:51

TehJavaDev (102 views)
2014-09-10 06:39:09

Tekkerue (50 views)
2014-09-09 02:24:56
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!