开发者

how to split string according to tag name in an HTML page using String.split()

I want to split the following string according to the td tags:

<html>

<body>
  <table>
    <tr><td>data1</td></tr>
    <tr><td>data2</td></tr>
    <tr><td>data3</td></tr>
    <tr><td>data4</td></tr>
  </table>
</body>

I'v tried split("h2"); and split("[h2]"); but this way the split method splits the html code where it finds "h" or "2" and if Iam not mistaken开发者_开发百科 also "h2".

My ultimate goal is to retrieve everything between <td> and </td>

Can anyone please please tell me how to do this with only using split()?

Thanks alot


No.

That would mean — in essence — parsing HTML with regex. We don't do that 'round these parts.


Here is how to solve your optimal goal:

String html = ""; // your html
Pattern p = Pattern.compile("<td>([^<]*)</td>", Pattern.MULTILINE | Pattern.DOTALL);

for (Matcher m = p.matcher(html);  m.find(); ) {
    String tag = m.group(1);
    System.out.println(tyg);
}

Please note that this code is written here without compiler but it gives the idea.

BUT why do you want to parse HTML using regex? I agree with guys: use HTML or XML parser (if your HTML is well-formatted.)


You cannot successfully parse HTML (or in your case, get the data between TD tags) with regular expressions. You should take a look at a simple HTML parser:

import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public static List<String> extractTDs(String html) throws IOException {
    final List<String> tdList = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
        StringBuffer buffer = new StringBuffer();
        public void handleText(final char[] data, final int pos) {
            buffer.append(data);
        }
        public void handleEndTag(Tag t, final int pos) {  
            if(Tag.TD.equals(t)) {
                tdList.add(buffer.toString());
            }
            buffer = new StringBuffer();
        }
    };

    parserDelegator.parse(new StringReader(html), parserCallback, true);

    return tdList;
}


String.Split or regexes should not be used to parse markup languages as they have no notion of depth (HTML is a recursive grammar needs a recursive parser). Consider what would happen if your <td> looked like:

<td>
  <table><tr><td> td inside a td? </td></tr></table>
</td>

A regex would greedily match everything between the outer <td>...</td> giving you unwanted results.

You should use an HTML parser like Johan mentioned.


You should really use a html parser, such as neko html or HtmlParser.

Iff you have a very small set of controlled html you could (although I generally recommend against it) use a regex such as

(?<=\\<td\\>)\\w+(?=\\</td\\>)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜