Regular Expression - Applied to a Text File
I 开发者_运维知识库have a text file with the following structure:
KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N
KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13
_________DataKey14-DataValue14 DataKey1N-DataValue1N (1)
// It is significant that the additional datakeys are on a new line
(1) the underline is not part of the data. I used it to align the data.
Question: How do I use a regex to convert my data to this format?
<KEYWORD0>
<DataKey00>DataValue00</DataKey00>
<DataKey01>DataValue01</DataKey01>
<DataKey02>DataValue02</DataKey02>
<DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
<DataKey10>DataValue10</DataKey10>
<DataKey11>DataValue11</DataKey11>
<DataKey12>DataValue12</DataKey12>
<DataKey13>DataValue12</DataKey13>
<DataKey14>DataValue12</DataKey14>
<DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>
Regex is for masochists, it's a very simple text parser in VB.NET (converted from C# so check for bugs):
Public Class MyFileConverter
Public Sub Parse(inputFilename As String, outputFilename As String)
Using reader As New StreamReader(inputFilename)
Using writer As New StreamWriter(outputFilename)
Parse(reader, writer)
End Using
End Using
End Sub
Public Sub Parse(reader As TextReader, writer As TextWriter)
Dim line As String
Dim state As Integer = 0
Dim xmlWriter As New XmlTextWriter(writer)
xmlWriter.WriteStartDocument()
xmlWriter.WriteStartElement("Keywords")
' Root element required for conformance
While (InlineAssignHelper(line, reader.ReadLine())) IsNot Nothing
If line.Length = 0 Then
If state > 0 Then
xmlWriter.WriteEndElement()
End If
state = 0
Continue While
End If
Dim parts As String() = line.Split(Function(c) [Char].IsWhiteSpace(c), StringSplitOptions.RemoveEmptyEntries)
Dim index As Integer = 0
If state = 0 Then
state = 1
xmlWriter.WriteStartElement(parts(System.Math.Max(System.Threading.Interlocked.Increment(index),index - 1)))
End If
While index < parts.Length
Dim keyvalue As String() = parts(index).Split("-"C)
xmlWriter.WriteStartElement(keyvalue(0))
xmlWriter.WriteString(keyvalue(1))
xmlWriter.WriteEndElement()
index += 1
End While
End While
If state > 0 Then
xmlWriter.WriteEndElement()
End If
xmlWriter.WriteEndElement()
xmlWriter.WriteEndDocument()
End Sub
Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, value As T) As T
target = value
Return value
End Function
End Class
Note that I added a root element to the XML because .Net XML objects only like reading and writing conformant XML.
Also note that the code uses an extension I wrote for String.Split.
^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*
This is starting to get in the neighborhood but I think this is just easier to do in a programming language... just process the file line by line...
You need to use the Groups and Matches feature of Regex in .NET and apply something like:
([A-Z\d]+)(\s([A-Za-z\d]+)\-([A-Za-z\d]+))*
- Find a Match and select the first Gruop to find the KEYWORD
- Loop through the Matches of Group 3 and 4 to catch the DataKey and DataValue for that KEYWORD
- Go to 1
If the DataValue and DataKey items don't can't contain <
or >
or '-' chars or spaces you can do something like this:
Read your file in a string and to a replaceAll with a regex similar to this: ([^- \t]+)-([^- \t]+)
and use this as a replacement (<$1>$2</$1>
). This will convert something like this: DataKey01-DataValue01
into something like this: <DataKey01>DataValue01</DataKey01>
.
After that you need to run another global replace but this regex ^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)
and replace with <$1>$2</$1>
again.
This should do the trick.
I don't program in VB.net so i have no idea if the actual syntax is correct (you might need to double or quadruple the \
in some cases). You should make sure the enable the Multiline option for the second pass.
To explain:
([^- \t]+)-([^- \t]+)
- (
[^- \t]+
) will match any string of chars not containingor
-
or\t
. This is marked as $1 (notice the parentheses around it) -
will match the-
char- (
[^- \t]+
) will again match any string of chars not containingor
-
or\t
. This is also marked as $2 (notice the parentheses around it) - The replacement will just convert a
ab-cd
string matched with<ab>cd</ab>
After this step the file looks like:
KEYWORD0 <DataKey00>DataValue00</DataKey00> <DataKey01>DataValue01</DataKey01>
<DataKey02>DataValue02</DataKey02> <DataKey0N>DataValue0N</DataKey0N>
KEYWORD1 <DataKey10>DataValue10</DataKey10> <DataKey11>DataValue11</DataKey11>
<DataKey12>DataValue12</DataKey12> <DataKey13>DataValue12</DataKey13>
<DataKey14>DataValue12</DataKey14> <DataKey1N>DataValue1N</DataKey1N>
^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)
^([^ \t]+)
mark and match any string of nonor
\t
beginning at the line (this is$1
)(
begin a mark\s+
white space(?:
non marked group starting here<[^>]+>
match an open xml tag:<ab>
[^<]+
match the inside of a tagbc
</[^>]+>
match an closing tag</ab>
[\s\n]*
some optional white space or newlines
)+
close the non marked group and repeat at least one time
)
close the mark (this is$2
)
The replacement is straight forward now.
Hope it helps.
But you should probably try to make a simple parser if this is not a one off job :)
精彩评论