Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (539)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (603)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Parsing URLs from text  (Read 1479 times)
0 Members and 1 Guest are viewing this topic.
Offline Wildern

Junior Devvie





« Posted 2009-08-07 14:01:15 »

I know this isn't really game related, but I am hoping there is enough java expertise here to point me in the correct direction.

I need to be able to scan many blocks of text and identify all the URLs present whether they are in an anchor tag, image tag, or plain text.  I believe I can handle the URLs within valid html markup just fine.  It is the plain text portion that is going to give me trouble, specifically, attempts to hide the URL within other html markup such as size or color.  I can't really show an example of the color trick, but, basically, they make the surrounding text close enough to background color to appear invisible.

aaaaa[size=15pt]www.foo.org[/size]aaaa
[size=5pt]aaaaa[/size][size=15pt]www[/size].[size=15pt]foo[/size].[size=15pt]org[/size][size=5pt]aaaaa[/size]

Regular expressions are not really an option as speed is critical (I need to be able to process 300-500 blocks of text a second where the smallest block of text would be roughly the size of this post)

Pointers to open source projects already handling something similar would be perfect, but, if I need to roll my own solution, parsing advice is welcome as well.

Thanks in advance.
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #1 - Posted 2009-08-07 14:50:07 »

Something like:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
   static class Feedback
   {
      public void found(String input, int off, int end)
      {
         System.out.println("found: " + input.substring(off, end));
      }
   }

   public static void findUrlSimilarPattern(String input, Feedback feedback)
   {
      int off = -1;

      int periods = 0;
      for (int end = 0; end < input.length(); end++)
      {
         char c = input.charAt(end);

         if ((c >= 'a' && c <= 'z') ||
         /**/(c >= 'A' && c <= 'Z') ||
         /**/(c >= '0' && c <= '9') ||
         /**/(c == '%' || c == '_' || c == '/') ||
         /**/(c == '?' || c == '&' || c == '=') ||
         /**/(c == '-' || c == '.' || c == ':'))
         {
            if (off == -1)
               off = end;

            if (c == '.')
               periods++;

            if (end == input.length() - 1)
               end++; // ensure last match (if any) is found
            else
               continue;
         }

         if (periods != 0 && off != -1)
         {
            feedback.found(input, off, end);
         }

         periods = 0;
         off = -1;
      }
   }

1  
2  
3  
4  
5  
6  
7  
8  
   public static void main(String[] args)
   {
      Feedback feedback = new Feedback();
      findUrlSimilarPattern("hello world www google com bye world", feedback);
      findUrlSimilarPattern("hello world www.google.com bye world", feedback);
      findUrlSimilarPattern("<font>hello world</font>http://www.google.com/search?q=oh%20no<font>www.bye.world.com</font>really! no.com", feedback);
   }
}


Output:
1  
2  
3  
4  
found: www.google.com
found: http://www.google.com/search?q=oh%20no
found: www.bye.world.com
found: no.com

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Wildern

Junior Devvie





« Reply #2 - Posted 2009-08-07 15:51:54 »

Yes something very similar to that, thanks!

The cases that are proving troublesome are similar to the following:
1  
<font size=1>foo</font><font size=5>w<font>ww.go</font>ogle.com</font><font size=1>bar</font>


Where simply stripping out html would result in a string of "foowww.google.combar" which is not the URL a person would easily see when presented with the rendered html.

Making the prefix/postfix text nearly invisible through color changing is also something I would like to be able to overcome, but I realize that will be a bit more difficult.

I may have to write a document parser that has heuristics to determine if two separate tokens should be merged or not based on their associated font/color attributes and what the intervening punctuation was.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #3 - Posted 2009-08-07 16:25:06 »

Not to mention HTML encoding, like: &NNN;

Now, the thing is: do you want to parse the URLs, or simply remove them?

In the end you simply cannot prevent this behaviour. If you encounter a URL like "foowww.google.combar" (after stripping tags) that's a clear indication somebody is trying to bend the rules, and it should be handled like any other URL.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #4 - Posted 2009-08-07 16:42:06 »

Or.. try to pattern match this...


 ****    ****    ****    ****   *      *****       ****    ****   *    *
*    *  *    *  *    *  *    *  *      *          *    *  *    *  **  **
*       *    *  *    *  *       *      *****      *       *    *  * ** *
*  ***  *    *  *    *  *  ***  *      *          *       *    *  *    *
*    *  *    *  *    *  *    *  *      *          *    *  *    *  *    *
 ****    ****    ****    ****   *****  *****  **   ****    ****   *    *



Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Wildern

Junior Devvie





« Reply #5 - Posted 2009-08-07 17:10:31 »

I need to be able to identify the destination of the URL, that includes obfuscation via encoding and redirects.
If the "bad guys" resort to ascii art or embedded images, that is a win.
I just don't want to pass any text that has a bad URL that could be clicked or copied with cut/paste.

I have the system for dealing with the URLs in C/C++, I was hoping to not have to re-write it all from scratch for switching to java.
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (33 views)
2014-12-15 09:26:44

Mr.CodeIt (23 views)
2014-12-14 19:50:38

BurntPizza (51 views)
2014-12-09 22:41:13

BurntPizza (84 views)
2014-12-08 04:46:31

JscottyBieshaar (45 views)
2014-12-05 12:39:02

SHC (59 views)
2014-12-03 16:27:13

CopyableCougar4 (58 views)
2014-11-29 21:32:03

toopeicgaming1999 (123 views)
2014-11-26 15:22:04

toopeicgaming1999 (114 views)
2014-11-26 15:20:36

toopeicgaming1999 (32 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!