Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (539)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (603)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  [SOLVED] Website info Grabbing? (Answer: Regex)  (Read 2372 times)
0 Members and 1 Guest are viewing this topic.
Offline jonjava
« Posted 2012-02-22 19:33:32 »

Good evening!

Lots of websites have and share very useful information. Simple things, like this weeks lottery numbers! Like this http://www.yle.fi/tekstitv/txt/P471_01.html. The point being that every week this web page gets updated with the newest lottery numbers.

Now, how can these numbers be utilized in my application? Websites all over can have useful information you would want to use while they don't necessarily output that information easily for applications to use, but rather in a format that is easy for human beings to read.

What is the best way of utilizing these in applications? I was sort of thinking about downloading the .html page from the URL and then dividing it into lines. Then go through each line until you find the line you want based on a matching string that is close to the information you really want. In the example above this would be the string "OIKEAT NUMEROT:" which directly precedes the winning lottery numbers. However, the .html is riddled with .html tags and so you'd have to work your way through all the nonsense to get to the information you want ( the lottery numbers ).

Next week the layout is the same but something might have changed slightly and your code might not work. Is there a better way?

For instance, in the browser the winning lottery numbers are directly preceded by the string "OIKEAT NUMEROT:" - grabbing the numbers from a layout like that would be much easier than an .html file.

Is there a way to get convert a .html into plain text as it is in the browser? Is this the right way?

Input appreciated!

Thank you,
jon

[EDIT]

Here's the final program
http://lotto.pastebay.net/366193

Offline Cero
« Reply #1 - Posted 2012-02-22 20:04:59 »

Is there a way to get convert a .html into plain text as it is in the browser? Is this the right way?

yeah you just strip all tags
not sure how exactly you do this is java, in php there are fixed functions - but I guess java has them too
also its just a big string, so all string stuff works there too

quick serach of "strip html tags java" brought this up: http://htmlcleaner.sourceforge.net/

Offline sproingie

JGO Kernel


Medals: 202



« Reply #2 - Posted 2012-02-22 20:16:41 »

JSoup is also good for pulling out tags and content, leaving the rest for regex matching.  HTMLUnit is a test framework, but also useful for general web-scraping tasks.  If you need to handle javascript, you're probably stuck with Selenium, which is a massive pill to swallow, but ultimately it'll do damn near anything you can imagine (just not fast).

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline jonjava
« Reply #3 - Posted 2012-04-07 00:56:16 »

sproingie: I wish that you'd have more thoroughly highlighted the importance of regex (regular expression) since I now realize that regular expressions were the ultimate answer to my question. You really don't need the html parser at all for what I was trying to do.

I have found a video http://www.youtube.com/watch?v=kWyoYtvJpe4 that explains regex quite well, although it is in python you shouldn't have much trouble translating the ideas into java http://docs.oracle.com/javase/tutorial/essential/regex/.

For those familiar with unix regex is basically a super powerful version of grep.


So the answer to my original question: Trying to find and grab this weeks lottery numbers from this website http://www.yle.fi/tekstitv/txt/P471_01.html (the 7 numbers are "OIKEAT NUMEROT:" would have been something like with the help of regex:

OIKEAT NUMEROT: 5,8,12,17,25,35,38    

// Param1: Regex
// Param2: website url
obj = grab( r'OIKEAT NUMEROT:\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d)\s*' , url );
int[] num = new int[7];
for(int i=0; i<7; i++)
 num = obj.get(i);

// or something like that, I haven't explored how exactly regex are utilized in java yet.

Offline sproingie

JGO Kernel


Medals: 202



« Reply #4 - Posted 2012-04-07 02:34:18 »

sproingie: I wish that you'd have more thoroughly highlighted the importance of regex (regular expression) since I now realize that regular expressions were the ultimate answer to my question. You really don't need the html parser at all for what I was trying to do.

You really shouldn't parse HTML with regular expressions or bad things happen

Offline jonjava
« Reply #5 - Posted 2012-04-07 02:36:18 »

I made a sample program that uses the correct java regex to find this weeks lottery numbers and additional numbers from the website:

The only thing that differed from the java video was basically that you had to use double backslashes.

First here's working java regular expression to find the wanted information:
Quote
String regex = "OIKEAT NUMEROT:.+[>](\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*";
      
String regex = "LIS&Auml;NUMEROT:.+[>](\\d{1,2})\\s*.\\s*(\\d{1,2})";
      

And here's the program (1 class):

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

import java.util.regex.*;

public class Lotto {
   
   private int[] numerot = new int[7];
   private int[] lisanumerot = new int[2];
   
   
   
   public Lotto(String urlstr){
      String text = null;
      try{
         text = saveUrl(urlstr);
      }
      catch (Exception e){}
     
      if( text == null ) return;
     
      numerot = findOikeatNumerot(text); // Find this weeks correct lottery numbers ( 7 numbers )
      lisanumerot = findLisaNumerot(text); // Find this weeks correct additional numbers ( 2 number )
   }
   
   private int[] findLisaNumerot(String text) {
      //
      // Template: LISÄNUMEROT: 22,36
      // Actual Text to filter with regex:    LISÄNUMEROT:</font> <font color="#000000">22,36    
      //
      String regex = "LISÄNUMEROT:.+[>](\\d{1,2})\\s*.\\s*(\\d{1,2})";
      Pattern myPattern =
            Pattern.compile(regex);
      Matcher matcher =
            myPattern.matcher(text);

      if(!matcher.find()){
         System.out.println("Not found!");
         int[] num = new int[1];
         return num;
      } else {
         System.out.println("Found!");
      }
      int[] num = new int[matcher.groupCount()];
      for(int i=1; i < matcher.groupCount()+1; i++){
         num[i-1] = Integer.parseInt( matcher.group(i) );
      }
     
     
      return num;
   }

   private int[] findOikeatNumerot(String text ) {
      //
      // Template: OIKEAT NUMEROT: 5,8,12,17,25,35,38
      // Actual Text to filter with regex: 00">OIKEAT NUMEROT:</font> <font color="#000000">5,8,12,17,25,35,38     </f
      //
      String regex = "OIKEAT NUMEROT:.+[>](\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*";
      Pattern myPattern =
            Pattern.compile(regex);
      Matcher matcher =
            myPattern.matcher(text);

      if(!matcher.find()){
         System.out.println("Not found!");
         int[] num = new int[1];
         return num;
      } else {
         System.out.println("Found!");
      }
      int[] num = new int[matcher.groupCount()];
      for(int i=1; i < matcher.groupCount()+1; i++){
         num[i-1] = Integer.parseInt( matcher.group(i) );
      }
     
     
      return num;
   }

   public static void main(String[] args){
      Lotto obj = new Lotto("http://www.yle.fi/tekstitv/txt/P471_01.html");
      System.out.println(obj);
   }
   
   @Override
   public String toString(){
     
      // Print out OIKEAT NUMEROT
      String str = "";
      str += "OIKEAT NUMEROT:";
      for (int num : numerot){
         str += " " + num;
      }
     
      // Print out LISANUMEROT
      str += "\n"; // new line
      str += "LISÄNUMEROT:";
      for (int num : lisanumerot){
         str += " " + num;
      }
     
      return str;
   }
   
   public String saveUrl(String urlString) throws MalformedURLException, IOException
    {
       BufferedReader in = null;
       String text = "";
       try
       {
          in = new BufferedReader( new InputStreamReader( new URL(urlString).openStream() ) );

          String line;
         
          while((line = in.readLine()) != null){
             text += line;
          }
       }
       finally
       {
          if (in != null)
             in.close();
       }
       
       System.out.println(text);
       
       return text;
    }
}

Offline sproingie

JGO Kernel


Medals: 202



« Reply #6 - Posted 2012-04-07 02:39:42 »

A slightly more serious reply: my job involves some seriously intimate knowledge of regular expressions, and I've worked on regex engines, and I can tell you that they're really not the appropriate tool for the job.  For any given page where you already know the exact html it produces, you can scrape with regexes, yes, but for general purpose parsing, it's impossible to parse html with normal regexes, and really quite tricky, error-prone, and very very slow to do it with the extended dialects.
Offline jonjava
« Reply #7 - Posted 2012-04-07 02:40:01 »

sproingie: I wish that you'd have more thoroughly highlighted the importance of regex (regular expression) since I now realize that regular expressions were the ultimate answer to my question. You really don't need the html parser at all for what I was trying to do.

You really shouldn't parse HTML with regular expressions or bad things happen



:O


Quote
A slightly more serious reply: my job involves some seriously intimate knowledge of regular expressions, and I've worked on regex engines, and I can tell you that they're really not the appropriate tool for the job.  For any given page where you already know the exact html it produces, you can scrape with regexes, yes, but for general purpose parsing, it's impossible to parse html with normal regexes, and really quite tricky, error-prone, and very very slow to do it with the extended dialects.

Hmm. What would be the better way then? Parsing the HTML as a pure text (ie. removing all tags etc) file and regexing that?

Offline sproingie

JGO Kernel


Medals: 202



« Reply #8 - Posted 2012-04-07 02:47:45 »

Any HTML parser, like jsoup, can return the text inside tags.  Then you run a regex on that, yes. 

Realistically, if you know for sure they don't put any intervening tags inbetween the numbers you're looking for, you can just go ahead and regex match the html source.  It's more brittle as solutions go, but all web scraping is a bit hacky.  It's just not usable as a general-purpose solution.
Offline jonjava
« Reply #9 - Posted 2012-04-07 03:04:32 »

Yeah it's a bit messy with the tags, but a simply function like the one below should (Tongue) remove most cluttering tags, in which case the regex becomes much easier to handle.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
private String removeTags(String text){
      String str = "";
      String buffer = "";
      boolean accept = true;
      // Removes all contents inside tags from the text
      for(int i=0; i < text.length(); i++){
         if(text.charAt(i) == '<'){
            accept = false;
         } else
         if(text.charAt(i) == '>'){
            if(accept) buffer = "";
            accept = true;
            str += buffer;
            buffer = "";
         } else
         if(accept) buffer += text.charAt(i);
      }
     
      return str;
   }


The thing that I was wondering about the HTML parser: How do you find the exact tag you're looking for? Since the information is separated by font tags and what not, how do you find the correct one?

Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (32 views)
2014-12-15 09:26:44

Mr.CodeIt (23 views)
2014-12-14 19:50:38

BurntPizza (50 views)
2014-12-09 22:41:13

BurntPizza (84 views)
2014-12-08 04:46:31

JscottyBieshaar (45 views)
2014-12-05 12:39:02

SHC (59 views)
2014-12-03 16:27:13

CopyableCougar4 (57 views)
2014-11-29 21:32:03

toopeicgaming1999 (123 views)
2014-11-26 15:22:04

toopeicgaming1999 (114 views)
2014-11-26 15:20:36

toopeicgaming1999 (32 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!