开发者

How to parse a string to extract year range values?

I received a change request and I'm unsure how to best approach it. If the client searches for something and they specify a year or year range greater than what we have in our database, I have to return the result that corresponds to the latest year range that we have.

Currently the results we have in the db all follow one of the following pattern:

Thing1 Thing2 S1 // There's some results with no year
Thing1 Thing2 2006-07 Series 6 // there's some results with 'Series X'
Thing1 Thing2 2006-2007 S12 RP // some resuls have SN or SN YZ
Thing1 Thing2 2020-21 S6 // some results don't have a full second year
Thing1 Thing2 2022-2024 S12
Thing1 Thing2 2024 Onwards // the result that matches the final year jus开发者_StackOverflowt has the year & 'Onwards'

There are more results for Thing1 Thing2 available in the world, going up to 2060, but we only keep +14 years worth of data, because after 14 years (say 2026 or 2028), the data is exactly the same as the years previous.

The maximum year we have, and the maximum year in existance increases by 2 years every 2 years. So in 2012, we'll have Thing1 Thing2 2026 Onwards, and the maximum in existance will be 2062.

So basically, I need to identify when the client searches for [Thing1 (or) Thing2 with a year range], and if the first year value is greater than [this year + 14] I have to return [this year + 14], but only if the current year is an even year, otherwise I have to return [this year + 13].

The trouble I'm having is how to identify a year in the middle of a string that doesn't follow a well defined pattern, other than the first part of the year range starts with a 4 digit year.

What is the best way for me to go about this? Could somebody suggest how I could approach a solution to this issue? Thanks.


This regex pattern would work nicely: \b(?<Year1>\d{4})(?:-(?<Year2>\d{2,4}))?\b

Explanation:

  • \b: is a word-boundary to ensure we're capturing the years entirely on their own and not as part of another word (i.e., no partial match) - this is used to anchor both ends of the pattern
  • (?<Year1>\d{4}): named capture group to match 4 digits
  • (-(?<Year2>\d{2,4}))?: this matches the - dash and then uses a named capture group for the 2nd year which matches 2-4 repeated digits since those years vary in length. The opening and closing parentheses groups this pattern together, and finally the trailing ? makes the entire group optional for cases where the second year doesn't exist.

Technically the \d{2,4} part accepts 07, 107, 2007. Obviously a 3 digit year is incorrect. I suggest you perform additional error checking to capture such scenarios. You could prevent it by changing it to \d{2}|\d{4} but then you would match Year1 and not Year2 and lose user input.

Here's the code:

string[] inputs = { "Thing1 Thing2 S1", "Thing1 Thing2 2006-07 Series 6", "Thing1 Thing2 2006-2007 S12 RP", "Thing1 Thing2 2020-21 S6", "Thing1 Thing2 2022-2024 S12", "Thing1 Thing2 2024 Onwards" };
string pattern = @"\b(?<Year1>\d{4})(-(?<Year2>\d{2,4}))?\b";
Regex rx = new Regex(pattern);

foreach (var input in inputs)
{
    Match m = rx.Match(input);
    Console.WriteLine("{0}: {1}", m.Success, input);
    if (m.Success)
    {
        string year1 = m.Groups["Year1"].Value;
        string year2 = m.Groups["Year2"].Value;
        Console.WriteLine("Year1: {0}, Year2: {1}", year1, year2 == "" ? "N/A" : year2);
    }
    Console.WriteLine();
}


Maybe simply searching for the first 4 numeric characters (if any) in the string and treating them as the year would work?


or a regular expression like

perl -ne '/(\d\d\d\d)-(\d\d(\d\d)?)/; print "$1:$2:$3"'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜