开发者

Extract dates from web page

I want to extract dates with different formats out of web pages. I am using the Selenium2 Java API to interact with the browser. Also i use jQuery to further interact with the document. So, solutions for both layers are welcome.

Dates can have very different formats in different locales. Also, month names can be written as text or as 开发者_如何学编程number. I need to match as much dates as possible, and I am aware of the fact that there are many combinations.

For example if I have a HTML element like this:

<div class="tag_view">
    Last update: May,22,2011 
    View :40
</div>

I want that the relevant part of the date is extracted and recognized:

May,22,2011

This should now be converted to a regular Java Date object.

Update

This should work with the HTML from any web page, the date can be contained in any element in any format. For example here on Stackoverflow the source code looks like this:

<span class="relativetime" title="2011-05-13 14:45:06Z">May 13 at 14:45</span>

I want it to be done the most effective way and i guess this would be a jQuery selector or filter which returns a standardized date representation. But I am open to your suggestions.


Since we can't limit ourselves to any specific element type or children of any element, you're basically talking about searching the whole page's text for dates. The only way to do this with any kind of efficiency is to use regular expressions. Since you're looking for dates in any format, you need a regex for each acceptable format. Once you define what those are, just compile the regexes and run something like:

var datePatterns = new Array();
datePatterns.push(/\d\d\/\d\d\/\d\d\d\d/g);
datePatterns.push(/\d\d\d\d\/\d\d\/\d\d/g);
...

var stringToSearch = $('body').html(); // change this to be more specific if at all possible
var allMatches = new Array();
for (datePatternIndex in datePatterns){
    allMatches.push(stringToSearch.match(datePatterns[datePatternIndex]));
}

You can find more date regexes by googling around, or make them yourself, they're pretty easy. One thing to note: You could probably combine some regexes above to create a more efficient program. I'd be very careful with that, it could cause your code to become hard to read very quickly. Doing one regex per date format seems much cleaner.


You could consider using getText to get element text and then split the String, like -

String s = selenium.getText("css=span.relativetime");
String date = s.split("Last update:")[1].split("View :")[0];


I will answer this myself because i came up with a working solution. I appreciate comments though.

/**
 * Extract date
 * 
 * @return Date object
 * @throws ParseException 
 */
public Date extractDate(String text) throws ParseException {
    Date date = null;
    boolean dateFound = false;

    String year = null;
    String month = null;
    String monthName = null;
    String day = null;
    String hour = null;
    String minute = null;
    String second = null;
    String ampm = null;

    String regexDelimiter = "[-:\\/.,]";
    String regexDay = "((?:[0-2]?\\d{1})|(?:[3][01]{1}))";
    String regexMonth = "(?:([0]?[1-9]|[1][012])|(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Sept|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))";
    String regexYear = "((?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3}))";
    String regexHourMinuteSecond = "(?:(?:\\s)((?:[0-1][0-9])|(?:[2][0-3])|(?:[0-9])):([0-5][0-9])(?::([0-5][0-9]))?(?:\\s?(am|AM|pm|PM))?)?";
    String regexEndswith = "(?![\\d])";

    // DD/MM/YYYY
    String regexDateEuropean =
        regexDay + regexDelimiter + regexMonth + regexDelimiter + regexYear + regexHourMinuteSecond + regexEndswith;

    // MM/DD/YYYY
    String regexDateAmerican =
        regexMonth + regexDelimiter + regexDay + regexDelimiter + regexYear + regexHourMinuteSecond + regexEndswith;

    // YYYY/MM/DD
    String regexDateTechnical =
        regexYear + regexDelimiter + regexMonth + regexDelimiter + regexDay + regexHourMinuteSecond + regexEndswith;

    // see if there are any matches
    Matcher m = checkDatePattern(regexDateEuropean, text);
    if (m.find()) {
        day = m.group(1);
        month = m.group(2);
        monthName = m.group(3);
        year = m.group(4);
        hour = m.group(5);
        minute = m.group(6);
        second = m.group(7);
        ampm = m.group(8);
        dateFound = true;
    }

    if(!dateFound) {
        m = checkDatePattern(regexDateAmerican, text);
        if (m.find()) {
            month = m.group(1);
            monthName = m.group(2);
            day = m.group(3);
            year = m.group(4);
            hour = m.group(5);
            minute = m.group(6);
            second = m.group(7);
            ampm = m.group(8);
            dateFound = true;
        }
    }

    if(!dateFound) {
        m = checkDatePattern(regexDateTechnical, text);
        if (m.find()) {
            year = m.group(1);
            month = m.group(2);
            monthName = m.group(3);
            day = m.group(3);
            hour = m.group(5);
            minute = m.group(6);
            second = m.group(7);
            ampm = m.group(8);
            dateFound = true;
        }
    }

    // construct date object if date was found
    if(dateFound) {
        String dateFormatPattern = "";
        String dayPattern = "";
        String dateString = "";

        if(day != null) {
            dayPattern = "d" + (day.length() == 2 ? "d" : "");
        }

        if(day != null && month != null && year != null) {
            dateFormatPattern = "yyyy MM " + dayPattern;
            dateString = year + " " + month + " " + day;
        } else if(monthName != null) {
            if(monthName.length() == 3) dateFormatPattern = "yyyy MMM " + dayPattern;
            else dateFormatPattern = "yyyy MMMM " + dayPattern;
            dateString = year + " " + monthName + " " + day;
        }

        if(hour != null && minute != null) {
            //TODO ampm
            dateFormatPattern += " hh:mm";
            dateString += " " + hour + ":" + minute;
            if(second != null) {
                dateFormatPattern += ":ss";
                dateString += ":" + second;
            }
        }

        if(!dateFormatPattern.equals("") && !dateString.equals("")) {
            //TODO support different locales
            SimpleDateFormat dateFormat = new SimpleDateFormat(dateFormatPattern.trim(), Locale.US);
            date = dateFormat.parse(dateString.trim());
        }
    }

    return date;
}

private Matcher checkDatePattern(String regex, String text) {
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    return p.matcher(text);
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜