Java-Gaming.org
Play Revenge of the Titans! The situation is critical. We need fancy commanders to defend Earth, the moon, Mars!
Featured games (78)
games approved by the League of Dukes
Games in Showcase (406)
games submitted by our members
Games in WIP (293)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Parsing URLs from text  (Read 1156 times)
0 Members and 1 Guest are viewing this topic.
Offline Wildern

Junior Member





« Posted 2009-08-07 16:01:15 »

I know this isn't really game related, but I am hoping there is enough java expertise here to point me in the correct direction.

I need to be able to scan many blocks of text and identify all the URLs present whether they are in an anchor tag, image tag, or plain text.  I believe I can handle the URLs within valid html markup just fine.  It is the plain text portion that is going to give me trouble, specifically, attempts to hide the URL within other html markup such as size or color.  I can't really show an example of the color trick, but, basically, they make the surrounding text close enough to background color to appear invisible.

aaaaa[size=15pt]www.foo.org[/size]aaaa
[size=5pt]aaaaa[/size][size=15pt]www[/size].[size=15pt]foo[/size].[size=15pt]org[/size][size=5pt]aaaaa[/size]

Regular expressions are not really an option as speed is critical (I need to be able to process 300-500 blocks of text a second where the smallest block of text would be roughly the size of this post)

Pointers to open source projects already handling something similar would be perfect, but, if I need to roll my own solution, parsing advice is welcome as well.

Thanks in advance.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 438
Projects: 4


Hand over your head.


« Reply #1 - Posted 2009-08-07 16:50:07 »

Something like:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
   static class Feedback
   {
      public void found(String input, int off, int end)
      {
         System.out.println("found: " + input.substring(off, end));
      }
   }

   public static void findUrlSimilarPattern(String input, Feedback feedback)
   {
      int off = -1;

      int periods = 0;
      for (int end = 0; end < input.length(); end++)
      {
         char c = input.charAt(end);

         if ((c >= 'a' && c <= 'z') ||
         /**/(c >= 'A' && c <= 'Z') ||
         /**/(c >= '0' && c <= '9') ||
         /**/(c == '%' || c == '_' || c == '/') ||
         /**/(c == '?' || c == '&' || c == '=') ||
         /**/(c == '-' || c == '.' || c == ':'))
         {
            if (off == -1)
               off = end;

            if (c == '.')
               periods++;

            if (end == input.length() - 1)
               end++; // ensure last match (if any) is found
            else
               continue;
         }

         if (periods != 0 && off != -1)
         {
            feedback.found(input, off, end);
         }

         periods = 0;
         off = -1;
      }
   }

1  
2  
3  
4  
5  
6  
7  
8  
   public static void main(String[] args)
   {
      Feedback feedback = new Feedback();
      findUrlSimilarPattern("hello world www google com bye world", feedback);
      findUrlSimilarPattern("hello world www.google.com bye world", feedback);
      findUrlSimilarPattern("<font>hello world</font>http://www.google.com/search?q=oh%20no<font>www.bye.world.com</font>really! no.com", feedback);
   }
}


Output:
1  
2  
3  
4  
found: www.google.com
found: http://www.google.com/search?q=oh%20no
found: www.bye.world.com
found: no.com

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Projects: Revenge of the Titans, Titan Attacks, Droid Assault, and Ultratron
Offline Wildern

Junior Member





« Reply #2 - Posted 2009-08-07 17:51:54 »

Yes something very similar to that, thanks!

The cases that are proving troublesome are similar to the following:
1  
<font size=1>foo</font><font size=5>w<font>ww.go</font>ogle.com</font><font size=1>bar</font>


Where simply stripping out html would result in a string of "foowww.google.combar" which is not the URL a person would easily see when presented with the rendered html.

Making the prefix/postfix text nearly invisible through color changing is also something I would like to be able to overcome, but I realize that will be a bit more difficult.

I may have to write a document parser that has heuristics to determine if two separate tokens should be merged or not based on their associated font/color attributes and what the intervening punctuation was.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 438
Projects: 4


Hand over your head.


« Reply #3 - Posted 2009-08-07 18:25:06 »

Not to mention HTML encoding, like: &NNN;

Now, the thing is: do you want to parse the URLs, or simply remove them?

In the end you simply cannot prevent this behaviour. If you encounter a URL like "foowww.google.combar" (after stripping tags) that's a clear indication somebody is trying to bend the rules, and it should be handled like any other URL.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Projects: Revenge of the Titans, Titan Attacks, Droid Assault, and Ultratron
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 438
Projects: 4


Hand over your head.


« Reply #4 - Posted 2009-08-07 18:42:06 »

Or.. try to pattern match this...


 ****    ****    ****    ****   *      *****       ****    ****   *    *
*    *  *    *  *    *  *    *  *      *          *    *  *    *  **  **
*       *    *  *    *  *       *      *****      *       *    *  * ** *
*  ***  *    *  *    *  *  ***  *      *          *       *    *  *    *
*    *  *    *  *    *  *    *  *      *          *    *  *    *  *    *
 ****    ****    ****    ****   *****  *****  **   ****    ****   *    *



Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Projects: Revenge of the Titans, Titan Attacks, Droid Assault, and Ultratron
Offline Wildern

Junior Member





« Reply #5 - Posted 2009-08-07 19:10:31 »

I need to be able to identify the destination of the URL, that includes obfuscation via encoding and redirects.
If the "bad guys" resort to ascii art or embedded images, that is a win.
I just don't want to pass any text that has a bad URL that could be clicked or copied with cut/paste.

I have the system for dealing with the URLs in C/C++, I was hoping to not have to re-write it all from scratch for switching to java.
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

Play Revenge of the Titans! The situation is critical. We need fancy commanders to defend Earth, the moon, Mars!
 
Browse for soundtracks for your game!

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

The invasion has landed! On Mars! And you're there to beat 'em!
cubemaster21 (72 views)
2013-05-17 21:29:12

alaslipknot (84 views)
2013-05-16 21:24:48

gouessej (113 views)
2013-05-16 00:53:38

gouessej (108 views)
2013-05-16 00:17:58

theagentd (119 views)
2013-05-15 15:01:13

theagentd (108 views)
2013-05-15 15:00:54

StreetDoggy (153 views)
2013-05-14 15:56:26

kutucuk (176 views)
2013-05-12 17:10:36

kutucuk (173 views)
2013-05-12 15:36:09

UnluckyDevil (182 views)
2013-05-12 05:09:57
Complex number cookbook
by Roquen
2013-04-24 12:47:31

2D Dynamic Lighting
by Oskuro
2013-04-17 16:46:12

2D Dynamic Lighting
by Oskuro
2013-04-17 16:45:57

2D Dynamic Lighting
by Oskuro
2013-04-17 16:23:20

Noise (bandpassed white)
by Roquen
2013-04-05 17:36:01

Noise (bandpassed white)
by Roquen
2013-04-03 16:17:38

Java Data structures
by Roquen
2013-03-29 13:21:12

Topic Request
by kutucuk
2013-03-22 21:42:01
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!
Page created in 0.179 seconds with 20 queries.