开发者

.NET Regular Expression: Get paragraphs

I'm trying to get paragraphs from a string in C# with Regular Expressions. By paragraphs; I mean string blocks ending with double or more \r\n. (NOT HTML paragraphs <p>)...

Here is a sample text:

For example this is a paragraph with a carriage return here

and a new line here.

At 开发者_如何学编程this point, second paragraph starts. A paragraph ends if double or more \r\n is matched or

if reached at the end of the string ($).

I tried the pattern:

Regex regex = new Regex(@"(.*)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Multiline);

but this does not work. It matches every line ending with a single \r\n. What I need is to get all characters including single carriage returns and newline chars till reached a double \r\n.


.* is being greedy and consuming as much as it can. Your second set of () has a $ so the expression that is being used is (.*)(?). In order to make the .* not be greedy, follow it with a ?.

When you specify RegexOptions.Multiline, .NET will split the input on line breaks. Use RegexOptions.Singleline to make it treat the entire input as one.

Regex regex = new Regex(@"(.*?)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Singleline);


An opposite approach will be to match the separators instead of the paragraphs, making the problem almost trivial. Consider:

string[] paragraphs = Regex.Split(text, @"^\s*$", RegexOptions.Multiline);

By splitting the input string by empty lines you can easily get all paragraphs. If you only want blank lines with no spaces you can simplify that even further, and use the parretn ^$. In that case you can also use the non-regex String.Split, with an array of separators:

string[] separators = {"\n\n", "\r\r", "\r\n\r\n"};
string[] paragraphs = text.Split(separators,
                                 StringSplitOptions.RemoveEmptyEntries);


Do you have to use a regular expression? Tools like COCO/R could make this job pretty easy as well. In addition it might just prove to be faster than generating code at runtime using a regex.

COMPILER YourParaProcessor
// your code goes here
TOKENS
newLine= '\r'|'\n'.
paraLetter = ANY - '\n' - '\r' .

YourParaProcessor 
=
 {Paragraph}
.

Paragraph =
  {paraLetter} '\r\n' .
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜