need to clean malformed tags using regular expression
Looking to find the appropriate regular expression for the following conditions:
I need to clean certain tags within free flowing text. For example, within the text I have two important tags: <2004:04:12>
and <name of person>
. Unfortunately some of tags have missing "<" or ">" delimiter.
For example, some are as follows:
1) <2004:04:12 , I need this to be <2004:04:12>
2) 2004:04:12>, I need this to be <2004:04:12>
3) <John Doe , I need this to be <John Doe>
I attempted to use the following for situation 1:
String regex = "<\\d{4}-\\d{2}-\\d{2}\\w*{2}[^>]";
String output = content.replaceAll(regex,"$0>");
This did find all instances of "<2004:04:12" and the result was "<2004:04:12 >". However, I need to eliminate the space prior to the ending tag.
Not sure this is the best way.开发者_如何学编程 Any suggestions.
Thanks
Basically, you are looking for a negative look-ahead, like this:
String regex = "<\\d{4}-\\d{2}-\\d{2}(?!>)";
String output = content.replaceAll(regex,"$0>");
This will help with the numeric "tags", but since no regex can be intelligent enough to match an arbitrary name, you either must define very closely what a name can look like, or deal with the fact that the same approach is impossible for "name" tags.
For fixing the dates, you can match any date, with zero one or two angled brackets:
String regex = "(\\s?\\<?)(\\d{4}:\\d{2}:\\d{2})(\\>?\\s)";
String replace = " <$2> ";
To recognise a name, we assume parts of the name begin with a capital letter and the only separator is a space. We match the angled bracket explicitly at the start or end, and the preceeding/succeeding char before/after the name should be only a space or punctuation.
String regex = "(\\<[A-Z][a-zA-Z]*(\\s[A-Z][a-zA-Z])*)(?=[\\.!?:;\\s])";
String replace = "$1>";
String regex = "(?<=[\\.!?:;\\s])([A-Z][a-zA-Z]*(\\s[A-Z][a-zA-Z]*)*)";
String replace = "<$1";
精彩评论