Regular expression question: Until next match OR End Of Document

2023-02-06 12:36 问答作者：

I'm working on a document parser to extract data from some documents that I've been given and I'm coding in C#. The documents are in the form:


(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.

The document repeats (Type 1)-(Type N) M times in the same format

I'm having trouble with the multi-lined strings and finding the last iteration of (Type 1)-(Type N)

What I need to do is capture the (potentially multi-lined string) in a group named by its preceeding (Type #)

Here is a snippet of the document that开发者_开发问答 I'm trying to match:

Name: John Dow
Position: VP. over Development
Bio: Here is a really long string of un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Development
Sr. Project Manager
Jr. Project Manager
Developer
Peon
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.
Name: Joe Noob
Position: Peon
Bio: I'm a peon, so I have little bio
Position History: Peon
Notes: few notes
Name: Jane Smith
Position: VP. over Sales
Bio: Here is a really long string of more un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Sales
Sales Manager
Secretary
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.

The order of (type #) is always the same and they're always preceeded by a newline.

What I have:

Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?

Any help would be great!

Because you're using lazy matching, the last token takes only as much as it must. You can solve that by adding a lookahed at the end of your pattern, to match until the next token:

(?=^Name:|$)

Here's the full regex:

Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)

Example: http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01

The simplest fix would be to do the match it in right-to-left mode:

Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
                    @"Position:\s(?:(.*?)\r\n)+?" +
                    @"Bio:\s(?:(.*?)\r\n)+?" +
                    @"Position History:\s(?:(.*?)\r\n)+?" +
                    @"Notes:\s(?:(.*?)\r\n)+?",
                    RegexOptions.Singleline | RegexOptions.RightToLeft);

By the way, I had to delete a bunch of inappropriate question marks to make it work at all. You did want those groups to capture, didn't you?

try this one:

(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)

I just gruped as named group 'tag' the label preceding the ':' and as value the (potential) multiline text.

继续阅读：regex

Regular expression question: Until next match OR End Of Document

更多精彩内容

精彩评论

最新问答

大家觉得三星电视怎么样?？

电动幕布挂不平会不会有皱纹？

海信激光电视视距是多少,客厅大小怎么匹配?？

如何打开屏幕镜像？

检查输卵管堵了哪家医院好？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？