.NET Regex parse markup for repeated values in certain section but not others

2023-03-31 01:51 问答作者：

I need to use .NET regular expressions to scrap some values between <value> tags of a markup file such as this (copy\pasted excerpt):

<Title>Section1</Title>

<attributeArray><name>Name1</name><value>Value1</value></attributeArray>

<attributeArray><name>Name2</name><value>开发者_JS百科;Value2</value></attributeArray>

<attributeArray><name>Name3</name><value>Value3</value></attributeArray>

<attributeArray><name>Name4</name><value>Value4</value></attributeArray>

<Title>Section2</Title>

<attributeArray><name>Name1</name><value>Value1</value></attributeArray>

<attributeArray><name>Name2</name><value>Value2</value></attributeArray>

<attributeArray><name>Name3</name><value>Value3</value></attributeArray>

<attributeArray><name>Name4</name><value>Value4</value></attributeArray>

</node>

The actual text goes on to include 6 sections. the problem I have is that all tag names for each section are identical and I only need to extract the values from say Section2 (so not including 1, 3,4,5,6).

I have struggled with this for a couple days and tried various conditional expressions which was new to me like this:

(?(<node>Section2)(.*?<value>(?<Value>.*?)<\/value>.*?))

If Section 2, then parse the value keys, but it only extracts the first value - it does not iterate through each <value> of the markup. and the markup usually has around 10 values that I need to extract (abbreviated in the example above).

This is not being done in code so I don't have the liberty of using an XML parser.

Any suggestions would be greatly appreciated - or if I can clarify further let me know.

an afterthought- if there is a way to include the text of the title with each value match then I could parse all 6 sections, but I could later filter the result based on the section I am after would also work.

example:

match1
group1 = Section2
group2 = Value1

match2
group1 = Section2
group2 = Value2

match3
group1 = Section2
group2 = Value3

match4
group1 = Section2
group2 = Value4

Thanks!

Here's one option:

(?:
   <Title>Section2</Title>    # Match the header
   |                          # or
   \G(?!\A)                   # Match where the previous match ended
)\s*
<attributeArray>
    <name>(?<name>[^<]*)</name>
    <value>(?<value>[^<]*)</value>
</attributeArray>

The first match includes the header, and the following matches must start where the previous one ended.
Working example: http://regexhero.net/tester/?id=321ce843-923d-4556-9b99-dbb72175929a

Note that the above will fail if you have other elements you didn't mention between the values or the title. You can get around that with a probably less efficient pattern, using the fact .Net regexes can have variable length lookbehinds:

(?<=                          # lookbehind - check that before the current position
   <Title>Section2</Title>    #  we can see the wanted title,
   (?:(?!<Title>).)*          #  followed by no more title between it and here.
)
<attributeArray>
    <name>(?<name>[^<]*)</name>
    <value>(?<value>[^<]*)</value>
</attributeArray>

Example: http://regexhero.net/tester/?id=743c4de6-1b8a-48a4-a69b-63f3624de594

If you want to, you can change the title to <Title>(?<title>[^<]*)</Title>, capture all values in the file, and filter by the wanted title - it will be added to each match.

Lastly, here's a similar approach which will work in other flavors: it captures key/value pairs before the title Section3, assuming it is well ordered:

<attributeArray>
    <name>(?<name>[^<]*)</name>
    <value>(?<value>[^<]*)</value>
</attributeArray>
(?=
   (?:(?!<Title>).)*
   <Title>Section3</Title>
)

Example: http://regexhero.net/tester/?id=8d8ae0e8-5f10-439f-a5a5-50d0b4e73bd2

I recommend using a CaptureCollection:

string s = @"<Title>Section1</Title>
<attributeArray><name>Name1</name><value>Value1-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value1-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value1-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value1-4</value></attributeArray>

<Title>Section2</Title>
<attributeArray><name>Name1</name><value>Value2-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value2-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value2-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value2-4</value></attributeArray>

<Title>Section3</Title>
<attributeArray><name>Name1</name><value>Value3-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value3-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value3-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value3-4</value></attributeArray>";

Regex r = new Regex(
  @"<Title>(Section2)</Title>(?:\s*<attributeArray>.*?<value>(.*?)</value></attributeArray>)+");
Match m = r.Match(s);
if (m.Success)
{
  string section = m.Groups[1].Value;
  int i = 0;
  foreach (Capture c in m.Groups[2].Captures)
  {
    Console.WriteLine("match{0}\ngroup1 = {1}\ngroup2 = {2}\n",
                      ++i, section, c.Value);
  }
}

m.Groups[2].Value would return Value2-4, the last thing to be captured in group #2. But all the intermediate captures are retained, and can be accessed through the Captures property.

继续阅读：.net regex

.NET Regex parse markup for repeated values in certain section but not others

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？