Regular expression question: Until next match OR End Of Document
I'm working on a document parser to extract data from some documents that I've been given and I'm coding in C#. The documents are in the form:
(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.
The document repeats (Type 1)-(Type N) M times in the same format
I'm having trouble with the multi-lined strings and finding the last iteration of (Type 1)-(Type N)
What I need to do is capture the (potentially multi-lined string) in a group named by its preceeding (Type #)
Here is a snippet of the document that开发者_开发问答 I'm trying to match:
Name: John Dow Position: VP. over Development Bio: Here is a really long string of un important stuff that could include words like "Bio" or "Name". Some times I have problems here, but for the most part it should be normal Bio information Position History: Vp. over Development Sr. Project Manager Jr. Project Manager Developer Peon Notes: Here are some notes that may or may not be multilined and if it is, all the lines need to be captured for this person. Name: Joe Noob Position: Peon Bio: I'm a peon, so I have little bio Position History: Peon Notes: few notes Name: Jane Smith Position: VP. over Sales Bio: Here is a really long string of more un important stuff that could include words like "Bio" or "Name". Some times I have problems here, but for the most part it should be normal Bio information Position History: Vp. over Sales Sales Manager Secretary Notes: Here are some notes that may or may not be multilined and if it is, all the lines need to be captured for this person.The order of (type #) is always the same and they're always preceeded by a newline. What I have:
Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?Any help would be great!
Because you're using lazy matching, the last token takes only as much as it must. You can solve that by adding a lookahed at the end of your pattern, to match until the next token:
(?=^Name:|$)
Here's the full regex:
Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)
Example: http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01
The simplest fix would be to do the match it in right-to-left mode:
Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
@"Position:\s(?:(.*?)\r\n)+?" +
@"Bio:\s(?:(.*?)\r\n)+?" +
@"Position History:\s(?:(.*?)\r\n)+?" +
@"Notes:\s(?:(.*?)\r\n)+?",
RegexOptions.Singleline | RegexOptions.RightToLeft);
By the way, I had to delete a bunch of inappropriate question marks to make it work at all. You did want those groups to capture, didn't you?
try this one:
(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)
I just gruped as named group 'tag' the label preceding the ':' and as value the (potential) multiline text.
精彩评论