开发者

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.

Every footnote beings with this: [Footnote

Every footnote ends with this: ]

There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.

So, essentially I want to match [Footnote, then match anything except ], until ] is matched.

This is the closest I have been able to get to identifying all of the footnotes:

NSString *regexString = @"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";

Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.

I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.

Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]

Here is some sample text that the regular expression is run on:

  <p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>

  <p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>

  <p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>

  <p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>

When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--

[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]

In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.

The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footno开发者_Python百科te 1 because it contains line breaks.

Any improvements on my regular expression would be most appreciated.


Try

@"\\[Footnote[^\\]]*\\]";

This should match across newlines. No need to put a single character into a character class, either.

As a commented, multiline regex (without string escapes):

\[        # match a literal [
Footnote  # match literal "Footnote"
[^\]]*    # match zero or more characters except ]
\]        # match ]

Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.

Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to

@"\\[Footnote[^\\]\\[]*\\]";

so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜