开发者

HTML Parsing/Scraping Algorithm Help..Java

I am writing a program that an HTML scraper that when it grabs the HTML from the page, it returns the HTML, and I want to Grab words that are All Capital letters, and then stores these words into a database. My problem right now is I cannot right the algorithm to parse each line of the HTML I got back in order to store the words. This is essentially what format that I am working with. IMPORTANT You will notice that the capital lettered words are always the first ones, so essentially I only need to look at the first letter of each line of HTML, and then decide if the whole word is capital. If it is then I want to开发者_开发知识库 add the word to a list, if it isn't then I want to go to the next line...So the it would look like this...

list of names ----> This line should be skipped because first word is not all CAPS

AARON ....

ABRAHAM ....

ANGELA ...

AMY ...

ASHLEY....

       AARON through ASHLEY should be added to list because first word is all CAPS 

I am able to get the html in the format above, but now I am having a hard time writing the algorithm for getting the first word of each line, and then

does anybody know how to do this without using external parsing and just using loops and lists. Thanks, I appreciate you helping out


First, instead of reinventing the wheel and because it can be a pain to parse bad HTML, I'd use an existing HTML parser, something like TagSoup or Jericho. Actually, Jericho would have my preference here as it has a built-in functionality to extract all text from HTML markup.

Then, I'd use a regex (\p{Upper}+) to extract all words in uppercase. See java.util.regex.


You could do this with a regular expression:

for (String line: lines) {
    if (line.matches("[A-Z]+\\b.*")) {
        ...
    }
}

This matches any line that has one or more capital letters [A-Z]+, followed by a word boundary \\b, followed by anything else .*. You could get rid of the \\b.* if you only expect there to be a single name on each line and nothing after.

Alternatively you could use a String.split() to break up the line into words and then check the first word for all caps:

for (String line: lines) {
    String[] words = line.split("\\s");

    if (words.length > 0 && words[0].equals(words[0].toUpperCase())) {
        ...
    }
}

Here \\s matches any space, tab, or other whitespace character.


String line = "AARON asdfasdflökj";

int i;
String cmp;

if( (i=line.indexOf(' ')) != -1 ) {
    cmp = line.substring( 0, i );
} else {
    cmp = line;
}

if( cmp.equals( cmp.toUpperCase() ) ) {
    // Line starts with all capitals
} else {
    // ...
}

The first if checks wheter there's a space in the String line and removes everything behind it. The second if checks if every char is upper case in the String.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜