开发者

how to parse an html page using java to look for dates on the page

i have to look for dates on an html page.. i开发者_开发知识库t can be in varied formats.. like, dd/mm/yy , dd/mm/yyyy, january 24-28 2010, december 12-14, 12-14 december etc etc.. how do i look for them and get all the dates on the page?


Basically HTML will be well-formed right? And mostly the Xpaths form where you want to take dates will also be fixed.

I mean, make list of Xpaths you want to read,and make an XSLT for this one. Convert the HTML into your smaller XML using XSLT transformation.

Then you can use JAXB or CASTOR for XML-to-Object mapping If you want to get all values in some POJO. OR You can directly read values using SAX XML Parsing.

Hope this helps.

parth.


  • Load the page in to a String
  • Define regexps for each possible dateformat that is used.
  • Extract hits from the String with your regexps

But actually this can be an impossible task, at least very hard.


  • Download your page. URL.openConnection(), Commons HttpClient - whatever you're comfortable with.
  • Run the downloaded page through a HTML Purifier like JTidy. You'll get a DOM tree.
  • Use whatever instruments you have to extract data from this DOM. Me personally, I'd use XSLT or XQuery. But even primitive traversal of the DOM tree would do.

JAXB and Castor mentioned above aren't suitable for the task. The regexp approach may also work, but it's much harder to implement, I think.


Assume you have the HTML file as local text file.

The following code match again the 02-11-10, 2-3-2010, 02/01/2010, 2-1-2010 etc. using the regular expression "([0-9]{1,2}[/-][0-9]{1,2}/-)" in Java.

You can add support for other date format by extending the regular expression.

FileInputStream fin = null;
BufferedReader in = null;
String str = null;
try
{
   fin = new FileInputStream ("test.html");
   in = new BufferedReader(new InputStreamReader(fin));
   while ( (str = in.readLine())!=null)
   {
      Pattern pattern = 
         Pattern.compile("([0-9]{1,2}[/-][0-9]{1,2}[/-]([0-9]{2}|[0-9]{4}))");
      Matcher matcher = 
         pattern.matcher(str);
      while (matcher.find()) {
         System.out.println("Date: " + matcher.group());
      }
   }
}
catch (Exception e)
{
   e.printStackTrace();
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜