开发者

Fastest way to perform a lot of strings replace in Java

I have to write some sort of parser that get a String and replace certain sets of character with others. The code looks like this:

noHTMLString = noHTMLString.replaceAll("</p>", "\n");
noHTMLString = noHTMLString.replaceAll("<br/>", "\n\n");
noHTMLString = noHTMLString.replaceAll("<br />", "\n\n");
//here goes A LOT of lines like these ones

The function is very long and performs a lot of strings replaces. The issue here is that it takes a lot of time because the method it's called a lot of times, slowing down the application performance.

I have read some threads here about using StringBuilder as an alternative but it lacks the ReplaceAll method and as it's noted here Does string.replaceAll() performance suffer from string immutability? the replaceAll method in String class works with

Match Pattern & Matcher and Matcher.replaceAll() uses a StringBuilder to store the eventually returned value so I don't know if switching to StringBuilder will really reduce the time to perform the substitutions.

Do you know a fast way to do a lot of String replace in a fast way? Do you have any advice for this problem?

Thanks开发者_开发知识库.

EDIT: I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. The problem is that I have to invoke the method very often


I found that org.apache.commons.lang.StringUtils is the fastest, if you don't want to bother with the StringBuffer.

You can use it like this:
noHTMLString = StringUtils.replace(noHTMLString, "</p>", "\n");

I did performance testing, and found this to be faster than my custom StringBuffer solution (similar to the one @extraneon proposed).


It looks like your parsing HTML there, have you though about using a 3rd party library instead of re-inventing the wheel?


I agree with Martijn in using a ready-built solution instead of parsing it yourself - there's loads of stuff built into Java in the javax.xml package. A neat solution would be to use XSLT transformation to replace, this looks like an ideal use case for it. However, it is complicated.

To answer the question, have you considered using the regular expression libraries? It looks like you have many different things you want to match, and replace with the same thing (\n or empty string). Using regular expressions you could be an expression like "<br>|<br/>|<br />" or even more clever like <br.*?>" to create a matcher object, on which you can call replaceAll.


I fully agree with Martijn here. Pick the right tool for the job.

If your file however is not HTML, but only contains some HTML tokens there are a few ways you can speed things up.

First, if some amount of the input does not contain replaceable elements, consider starting with something like:

if (!input.contains('<')) {
    return input;
}

Second, consider a regex:

Pattern p = Pattern.compile( your_regex );

Don't make a pattern for every single replaceAll line, but try to combine them (regex has a OR operator) and let Pattern optimize the regex. Do use the compiled pattern and don't compile it in every call, it's fairly expensive.

If regexes are a bit to complex you can also implement some faster (but potentially less readable) replacement engine yourself:

StringBuilder result = new StringBuilder(input.length();
for (int i=0; i < input.length(); i++) {
  char c = input.charAt(i);

  if ( c != '<' ) {
    continue;
  }

  int closePos = input.indexOf( '>', i);
  if (closePos == -1) {// not found
    result.append( input.substring(i, input.length());
    return result.toString();
  }
  i = closePos;
  String token = input.substring(i, closePos);
  if ( token.equals( "p/" ) {
    result.append("\\n");
  } else if (token.equals(...)) {
  } else if (...) {
  } 
}
return result.toString();

This may have some errors :)

The advantage is you have to iterate through the input only once. The big disadvantage is that it is not all that easy to understand. You could also write a state machine, analyzing per character what the new state should be, and that would probably be faster and even more work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜