.NET Regex parse markup for repeated values in certain section but not others
I need to use .NET regular expressions to scrap some values between <value>
tags of a markup file such as this (copy\pasted excerpt):
<Title>Section1</Title>
<attributeArray><name>Name1</name><value>Value1</value></attributeArray>
<attributeArray><name>Name2</name><value>开发者_JS百科;Value2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value4</value></attributeArray>
<Title>Section2</Title>
<attributeArray><name>Name1</name><value>Value1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value4</value></attributeArray>
</node>
The actual text goes on to include 6 sections. the problem I have is that all tag names for each section are identical and I only need to extract the values from say Section2 (so not including 1, 3,4,5,6).
I have struggled with this for a couple days and tried various conditional expressions which was new to me like this:
(?(<node>Section2)(.*?<value>(?<Value>.*?)<\/value>.*?))
If Section 2, then parse the value keys, but it only extracts the first value - it does not iterate through each <value>
of the markup. and the markup usually has around 10 values that I need to extract (abbreviated in the example above).
This is not being done in code so I don't have the liberty of using an XML parser.
Any suggestions would be greatly appreciated - or if I can clarify further let me know.
an afterthought- if there is a way to include the text of the title with each value match then I could parse all 6 sections, but I could later filter the result based on the section I am after would also work.
example:
match1
group1 = Section2
group2 = Value1
match2
group1 = Section2
group2 = Value2
match3
group1 = Section2
group2 = Value3
match4
group1 = Section2
group2 = Value4
Thanks!
Here's one option:
(?:
<Title>Section2</Title> # Match the header
| # or
\G(?!\A) # Match where the previous match ended
)\s*
<attributeArray>
<name>(?<name>[^<]*)</name>
<value>(?<value>[^<]*)</value>
</attributeArray>
The first match includes the header, and the following matches must start where the previous one ended.
Working example: http://regexhero.net/tester/?id=321ce843-923d-4556-9b99-dbb72175929a
Note that the above will fail if you have other elements you didn't mention between the values or the title. You can get around that with a probably less efficient pattern, using the fact .Net regexes can have variable length lookbehinds:
(?<= # lookbehind - check that before the current position
<Title>Section2</Title> # we can see the wanted title,
(?:(?!<Title>).)* # followed by no more title between it and here.
)
<attributeArray>
<name>(?<name>[^<]*)</name>
<value>(?<value>[^<]*)</value>
</attributeArray>
Example: http://regexhero.net/tester/?id=743c4de6-1b8a-48a4-a69b-63f3624de594
If you want to, you can change the title to <Title>(?<title>[^<]*)</Title>
, capture all values in the file, and filter by the wanted title - it will be added to each match.
Lastly, here's a similar approach which will work in other flavors: it captures key/value pairs before the title Section3
, assuming it is well ordered:
<attributeArray>
<name>(?<name>[^<]*)</name>
<value>(?<value>[^<]*)</value>
</attributeArray>
(?=
(?:(?!<Title>).)*
<Title>Section3</Title>
)
Example: http://regexhero.net/tester/?id=8d8ae0e8-5f10-439f-a5a5-50d0b4e73bd2
I recommend using a CaptureCollection:
string s = @"<Title>Section1</Title>
<attributeArray><name>Name1</name><value>Value1-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value1-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value1-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value1-4</value></attributeArray>
<Title>Section2</Title>
<attributeArray><name>Name1</name><value>Value2-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value2-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value2-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value2-4</value></attributeArray>
<Title>Section3</Title>
<attributeArray><name>Name1</name><value>Value3-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value3-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value3-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value3-4</value></attributeArray>";
Regex r = new Regex(
@"<Title>(Section2)</Title>(?:\s*<attributeArray>.*?<value>(.*?)</value></attributeArray>)+");
Match m = r.Match(s);
if (m.Success)
{
string section = m.Groups[1].Value;
int i = 0;
foreach (Capture c in m.Groups[2].Captures)
{
Console.WriteLine("match{0}\ngroup1 = {1}\ngroup2 = {2}\n",
++i, section, c.Value);
}
}
m.Groups[2].Value
would return Value2-4
, the last thing to be captured in group #2. But all the intermediate captures are retained, and can be accessed through the Captures
property.
精彩评论