
Regex questions

I'm trying to get some text from a large text file, the text I'm looking for is:

Type:Production Color:Red

I pass the whole text in the following method to get (Type:Production , Color:Red)

  private static void FindKeys(IEnumerable<string> keywords, string source)
        var found = new Dictionary<string, string>(10);
        var keys = string.Join("|", keywords.ToArray());
        var matches = Regex.Matches(source, @"(?<key>" + @"\B\s" + keys + @"\B\s" + "):",

        foreach (Match m in matches)
            var key = m.Groups["key"].ToString();
            var start = m.Index + m.Length;
            var nx = m.NextMatch();
            var end = (nx.Success ? nx.Index : source.Length);
            found.Add(k开发者_如何学Goey, source.Substring(start, end - start));


        foreach (var n in found)
            Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);

My problems are the following:

  1. The search returns _Type: as well, where I only need Type:
  2. The search return Color:Red/n/n/n/n/n (with the rest of the text, where I only need Color:Red

So, basically: - How can I force Regex to get the exact match for Type and ignore _Type - How to get only the text after : and ignore /n/n/ and any other text

I hope this is clear


Your regex currently looks like this:


I see the following issues here:

  • First, Word1|Word2|Word3 should be put in parenthesis. Otherwise, it will search for \B\sWord1 or Word2 or Word3\B\s, which is not what you want (I guess).

  • Why \B\s? A non-boundary followed by a whitespace? That doesn't make sense. I guess you want just \b (= word boundary). There's no need to use it in the end, because the colon already constitutes a word boundary.

So, I would suggest to use the following. It will fix the _Type problem, because there is no word boundary between _ and Type (since _ is considered to be a word character).


If the text following the key is always just a single word, I'd match it in the regex as well: (\s* allows for whitespace after the colon, I don't know if you need this. \w+ ensures that only word characters -- i.e. no line breaks etc. -- are matched as the value.)


Then you just need to iterate through all the matches and extract the key and value groups. No need for any string operations or index arithmetic.

So if I understand correctly, you have:

  • Pairs of key:values
  • Each pair is separated by a space
  • Within each pair, the key and value is separated by “:”

Then I would not use regex at all. I would:

  • use String.Split(' ') to get an array of pairs
  • loop over all the pairs
  • use String.Split(':') to get the key and value from each pair




验证码 换一张
取 消

