Java regular expression transform HTML list to text
I have data in the form:
<ol>
<li>example1</li>
<li>example2</li>
<li>example3</li>
</ol>
which needs to turn into
# example1
# example2
# example3
The pound sign has to be associated with the ol html tag. I'm using java regular expressions and this is what I have so far:
info = info.replaceAll("(?s).<ol>\n(<li>(.*?)</li>\n)*</ol>","# $2");
info is a string object containing the data. Also there may be line breaks in between the li tags.When I run it, it only prints the开发者_开发百科 last item. i.e the result is
# example3
example2 and example1 are missing
Any thoughts on what I'm doing wrong?
Your regex has a couple of problems:
- it contains a capturing group inside a capturing group
- overall, it will only match once (it includes for a start -- there's only one of these.
The solution I'd recommend: don't tie yourself in knots. Write a loop with a Matcher.find(), pulling out the matches one by one and adding them to a string buffer. It would go something like this:
Pattern p = Pattern.compile("<ol>(.*?)</ol>");
Matcher m = p.matcher("...");
StringBuffer sb = new StringBuffer();
while (m.find()) {
sb.append("#").append(m.group(1)).append("\n");
}
String result = sb.toString();
I would argue you can achieve a more robust solution using XPath and Java's document parser, as follows:
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class Foo {
public static void main(String[] args) throws Exception {
final String info = "<html>\n<body>\n<ol>\n<li>example1</li>\n<li>exmaple2</li>\n<li>example3</li>\n</ol>\n</body>\n</html>";
final Document document = parseDocument(info);
final XPathExpression xPathExpression = getXPathExpression("//ol/li");
final NodeList nodes = (NodeList) xPathExpression.evaluate(document, XPathConstants.NODESET);
// Prints # example1\n# exmaple2\n# example3
for (int i = 0; i < nodes.getLength(); i++) {
final Node liNode = nodes.item(i);
if (liNode.hasChildNodes()) {
System.out.println("# " + liNode.getChildNodes().item(0).getTextContent());
}
}
}
private static Document parseDocument(final String info) throws Exception {
final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
final DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(new ByteArrayInputStream(info.getBytes("UTF-8")));
}
private static XPathExpression getXPathExpression(final String expression) throws Exception {
final XPathFactory factory = XPathFactory.newInstance();
final XPath xpath = factory.newXPath();
return xpath.compile(expression);
}
}
The answer to "what you are doing wrong" is that you are replacing the entire single regex (which matches from ol all the way to /ol) with the value of your second group. The second group was in a repeated fragment, so the result of $2
was the last match of that group.
I would use a simpler solution instead of a complex regex. For ecample:
Scanner scann= new Scanner(str); //the parameter can be a file or an inputstream
scann.useDelimiter("</?ol>");
while (scann.hasNext())
{
str = scann.next();
str = str.replaceAll("<li>(.*?)</li>\n*","# $1" +
"\n"); //$NON-NLS-1$ //$NON-NLS-2$
}
Don't use regular expressions for parsing XML/HTML. Full stop. You'll never handle all the possible variations that can legally occur in the input, and you'll forever be telling people who supply the content that you're sorry, you can only handle a restricted subset of XML/HTML, and they will forever be cursing you. And if you do get to the point where you can handle 99% of legal input, your code will be unmaintainable and slow.
There are off-the-shelf parsers to do this job - use them.
info = info.replaceAll("(?:<ol>|\\G)\\s*<li>(.+?)</li>(?:\\s*</ol>)?",
"# $1\n");
(?:<ol>|\G)
ensures that each bunch of matches starts either with <ol>
or where the last match left off, so it can never start matching inside a <ul>
element.
EDIT: fixing the <ul>
problem mentioned by hoipolloi with this look ahead:
(?=((?!</ul>)(.|\n))*</ol>)
This one worked on your example:
info.replaceAll(
"(?:<ol>\s*)?<li>(.*?)</li>(?=((?!</ul>)(.|\n))*</ol>)(?:\s*</ol>)?",
"# $1"
);
(?:<ol>\s*)?
- If it exists, match
<ol>
plus anything whitespace following it. The(?:
means don't capture this group.
<li>(.*?)</li>
- Match an
<li>anything</li>
. And capture theanything
in the first group. The*?
means match any length, non-greedily, (i.e. match the first</li>
after the<li>
.)
- New clause
(?=((?!</ul>)(.|\n))*</ol>)
- Ensure that an
</ol>
follows this<li>
before a</ul>
- Ensure that an
(?:\s*</ol>)?
- And match any trailing whitespace plus
</ol>
.
精彩评论