Regex to extract link content

2022-12-11 17:24 问答作者：

I'll be the first to admit that my Regex knowledge is hopeless. I am using java with the following

Matcher m = Pattern.compile(">[^<>]*</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(html.substring(m.start(), m.end()));
}

I get the following list:

>Lin开发者_运维百科k Text a</a>
>Link Text b</a>

What am I missing to remove the > and the </a>.

Cheers.

You can do that by wrapping a group around that part of your regex and then using group(X) where X is the number of the group:

Matcher m = Pattern.compile(">([^<>]*)</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(m.group(1));
}

But, a better way would be to use a simple parser for this:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("foo <a href=\"#\">Link 1</a> bar <a href=\"#\">Link <b>2</b> more</a> baz");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new LinkParser(), true);
       reader.close();
   }
}

class LinkParser extends HTMLEditorKit.ParserCallback {

    private boolean linkStarted = false;
    private StringBuilder b = new StringBuilder();

    public void handleText(char[] data, int pos) {
        if(linkStarted) b.append(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.A) linkStarted = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.A) {
            linkStarted = false;
            System.out.println(b);
            b = new StringBuilder();
        }
    }
}

Output:

Link 1
Link 2 more

Have you looked at using a capturing group ?

Pattern.compile(">([^<>]*)</a>")

Note however that it's generally not recommended to use regular expressions for HTML, since HTML isn't regular. You will get more reliable results by using an HTML parser such as JTidy.

Keep in mind that due to its limited nature, your regex (and regex in general) may run into problems if the HTML you're trying to parse is slightly more complex. For example, the following would fail to parse correctly, but is completely valid (and common) HTML:

<a href="blah.html">this is only a <em>single</em> link</a>

You might be better off using a DOM parser (I'm pretty sure Java has plenty of options in this regard) that you can then request the inner-text of each <a> tag.

I'm late to the party but I'd like to point out another alternative:

(?<=X)      X, via zero-width positive lookbehind

If you put your initial > into that mess, i.e.

(?<=>)[^<>]*</a>

then it should not be returned as part of your result.

Untested, though. Good luck!

A nice quick way to test your regular expressions, is to use a regex editor such as the following eclipse plugin: http://brosinski.com/regex/

继续阅读：regex

Regex to extract link content

更多精彩内容

精彩评论

最新问答

男性抗体阴性？

结婚多年一直不能怀孕？

雷克萨斯es300臻享版是指什么音响？

小度在家怎么监控家里？

智能电视哪个牌子好?？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？