开发者

Regex to extract link content

I'll be the first to admit that my Regex knowledge is hopeless. I am using java with the following

Matcher m = Pattern.compile(">[^<>]*</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(html.substring(m.start(), m.end()));
}

I get the following list:

>Lin开发者_运维百科k Text a</a>
>Link Text b</a>

What am I missing to remove the > and the </a>.

Cheers.


You can do that by wrapping a group around that part of your regex and then using group(X) where X is the number of the group:

Matcher m = Pattern.compile(">([^<>]*)</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(m.group(1));
}

But, a better way would be to use a simple parser for this:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("foo <a href=\"#\">Link 1</a> bar <a href=\"#\">Link <b>2</b> more</a> baz");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new LinkParser(), true);
       reader.close();
   }
}

class LinkParser extends HTMLEditorKit.ParserCallback {

    private boolean linkStarted = false;
    private StringBuilder b = new StringBuilder();

    public void handleText(char[] data, int pos) {
        if(linkStarted) b.append(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.A) linkStarted = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.A) {
            linkStarted = false;
            System.out.println(b);
            b = new StringBuilder();
        }
    }
}

Output:

Link 1
Link 2 more


Have you looked at using a capturing group ?

Pattern.compile(">([^<>]*)</a>")

Note however that it's generally not recommended to use regular expressions for HTML, since HTML isn't regular. You will get more reliable results by using an HTML parser such as JTidy.


Keep in mind that due to its limited nature, your regex (and regex in general) may run into problems if the HTML you're trying to parse is slightly more complex. For example, the following would fail to parse correctly, but is completely valid (and common) HTML:

<a href="blah.html">this is only a <em>single</em> link</a>

You might be better off using a DOM parser (I'm pretty sure Java has plenty of options in this regard) that you can then request the inner-text of each <a> tag.


I'm late to the party but I'd like to point out another alternative:

(?<=X)      X, via zero-width positive lookbehind

If you put your initial > into that mess, i.e.

(?<=>)[^<>]*</a>

then it should not be returned as part of your result.

Untested, though. Good luck!


A nice quick way to test your regular expressions, is to use a regex editor such as the following eclipse plugin: http://brosinski.com/regex/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜