Regex expression is too greedy
I'm writing a regular expression to match data from the IMDb soundtracks data file. My regexes are mostly working, although they are in places slurping too much text into my named groups. Take the following regex for example:
"^ Performed by '?(?<performer>.*)('? \(qv\))?$"
The performer group includes the string ' (qv)
as well as the performer's name. Unfortunately, because the records are not consistently formatted, some performers' names are surrounded by single quotation marks whilst others are not. This means they are optional as far as the regex is concerned.
I've tried marking the last group as a greedy group using the ?>
group specifier, but this appeared to have no effect on the results.
I can improve the results by changing the performer group to match a small range of characters, but this reduces my chances of parsing the name out correctly. Furthermore, if I were to just exclude the apostrophe character, I would then be unable to parse, e.g., ban开发者_如何学JAVAd names containing apostrophes, such as Elia's Lonely Friends Band who performed Run For Your Life featured in Resident Evil: Apocalypse.
Update: Here's an example input line that the regex should match, as requested. Other formats are also presented which my existing regex won't handle.
" Performed by 'Carmen Silvera' (qv)"
Here is a solution to your immediate problem, although I looked through the IMDB soundtracks data file, and this will not solve everything in there.
var exp = new Regex(@"^ Performed by '?(?<performer>.*?)('? \(qv\))?$");
Basically you need to specify a non-greedy search on the performer matching.
I'll add a comment to explain why this isn't going to be good enough for your project long term.
精彩评论