get text between html tags
Possible duplicate: RegEx matching HTML tags and extracting text
I need to get the text between the html tag like <p></p>
or whatever. My pattern is this
Pattern pText = Pattern.compile(">([^>|^<]*?)开发者_如何学Go<");
Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.
Thanks
SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.
It looks like you are trying to use the |
operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:
Pattern pText = Pattern.compile(">([^<>]*?)<");
Don't use regular expressions when parsing HTML.
Use XPath instead (if your HTML is well formed). You can reference text nodes using the text()
function very easily.
精彩评论