开发者

Regular expression question: Until next match OR End Of Document

I'm working on a document parser to extract data from some documents that I've been given and I'm coding in C#. The documents are in the form:


(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.

The document repeats (Type 1)-(Type N) M times in the same format

I'm having trouble with the multi-lined strings and finding the last iteration of (Type 1)-(Type N)

What I need to do is capture the (potentially multi-lined string) in a group named by its preceeding (Type #)

Here is a snippet of the document that开发者_开发问答 I'm trying to match:

Name: John Dow
Position: VP. over Development
Bio: Here is a really long string of un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Development
Sr. Project Manager
Jr. Project Manager
Developer
Peon
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.
Name: Joe Noob
Position: Peon
Bio: I'm a peon, so I have little bio
Position History: Peon
Notes: few notes
Name: Jane Smith
Position: VP. over Sales
Bio: Here is a really long string of more un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Sales
Sales Manager
Secretary
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.

The order of (type #) is always the same and they're always preceeded by a newline.

What I have:

Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?

Any help would be great!


Because you're using lazy matching, the last token takes only as much as it must. You can solve that by adding a lookahed at the end of your pattern, to match until the next token:

(?=^Name:|$)

Here's the full regex:

Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)

Example: http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01


The simplest fix would be to do the match it in right-to-left mode:

Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
                    @"Position:\s(?:(.*?)\r\n)+?" +
                    @"Bio:\s(?:(.*?)\r\n)+?" +
                    @"Position History:\s(?:(.*?)\r\n)+?" +
                    @"Notes:\s(?:(.*?)\r\n)+?",
                    RegexOptions.Singleline | RegexOptions.RightToLeft);

By the way, I had to delete a bunch of inappropriate question marks to make it work at all. You did want those groups to capture, didn't you?


try this one:

(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)

I just gruped as named group 'tag' the label preceding the ':' and as value the (potential) multiline text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜