开发者

Regular Expression - Applied to a Text File

I 开发者_运维知识库have a text file with the following structure:

KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N

KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13 _________DataKey14-DataValue14 DataKey1N-DataValue1N (1)

// It is significant that the additional datakeys are on a new line

(1) the underline is not part of the data. I used it to align the data.

Question: How do I use a regex to convert my data to this format?

<KEYWORD0>
    <DataKey00>DataValue00</DataKey00>
    <DataKey01>DataValue01</DataKey01>
    <DataKey02>DataValue02</DataKey02>
    <DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
    <DataKey10>DataValue10</DataKey10>
    <DataKey11>DataValue11</DataKey11>
    <DataKey12>DataValue12</DataKey12>
    <DataKey13>DataValue12</DataKey13>
    <DataKey14>DataValue12</DataKey14>
    <DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>


Regex is for masochists, it's a very simple text parser in VB.NET (converted from C# so check for bugs):

Public Class MyFileConverter
    Public Sub Parse(inputFilename As String, outputFilename As String)
        Using reader As New StreamReader(inputFilename)
            Using writer As New StreamWriter(outputFilename)
                Parse(reader, writer)
            End Using
        End Using
    End Sub

    Public Sub Parse(reader As TextReader, writer As TextWriter)
        Dim line As String
        Dim state As Integer = 0

        Dim xmlWriter As New XmlTextWriter(writer)
        xmlWriter.WriteStartDocument()
        xmlWriter.WriteStartElement("Keywords")
        ' Root element required for conformance
        While (InlineAssignHelper(line, reader.ReadLine())) IsNot Nothing
            If line.Length = 0 Then
                If state > 0 Then
                    xmlWriter.WriteEndElement()
                End If
                state = 0
                Continue While
            End If

            Dim parts As String() = line.Split(Function(c) [Char].IsWhiteSpace(c), StringSplitOptions.RemoveEmptyEntries)
            Dim index As Integer = 0

            If state = 0 Then
                state = 1
                xmlWriter.WriteStartElement(parts(System.Math.Max(System.Threading.Interlocked.Increment(index),index - 1)))
            End If

            While index < parts.Length
                Dim keyvalue As String() = parts(index).Split("-"C)
                xmlWriter.WriteStartElement(keyvalue(0))
                xmlWriter.WriteString(keyvalue(1))
                xmlWriter.WriteEndElement()
                index += 1
            End While
        End While

        If state > 0 Then
            xmlWriter.WriteEndElement()
        End If
        xmlWriter.WriteEndElement()
        xmlWriter.WriteEndDocument()
    End Sub
    Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, value As T) As T
        target = value
        Return value
    End Function
End Class

Note that I added a root element to the XML because .Net XML objects only like reading and writing conformant XML.

Also note that the code uses an extension I wrote for String.Split.


^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*

This is starting to get in the neighborhood but I think this is just easier to do in a programming language... just process the file line by line...


You need to use the Groups and Matches feature of Regex in .NET and apply something like:

([A-Z\d]+)(\s([A-Za-z\d]+)\-([A-Za-z\d]+))*
  1. Find a Match and select the first Gruop to find the KEYWORD
  2. Loop through the Matches of Group 3 and 4 to catch the DataKey and DataValue for that KEYWORD
  3. Go to 1


If the DataValue and DataKey items don't can't contain < or > or '-' chars or spaces you can do something like this:

Read your file in a string and to a replaceAll with a regex similar to this: ([^- \t]+)-([^- \t]+) and use this as a replacement (<$1>$2</$1>). This will convert something like this: DataKey01-DataValue01 into something like this: <DataKey01>DataValue01</DataKey01>.

After that you need to run another global replace but this regex ^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+) and replace with <$1>$2</$1> again.

This should do the trick.

I don't program in VB.net so i have no idea if the actual syntax is correct (you might need to double or quadruple the \ in some cases). You should make sure the enable the Multiline option for the second pass.

To explain:

([^- \t]+)-([^- \t]+)
  • ([^- \t]+) will match any string of chars not containing or - or \t. This is marked as $1 (notice the parentheses around it)
  • - will match the - char
  • ([^- \t]+) will again match any string of chars not containing or - or \t. This is also marked as $2 (notice the parentheses around it)
  • The replacement will just convert a ab-cd string matched with <ab>cd</ab>

After this step the file looks like:

KEYWORD0 <DataKey00>DataValue00</DataKey00> <DataKey01>DataValue01</DataKey01>
   <DataKey02>DataValue02</DataKey02> <DataKey0N>DataValue0N</DataKey0N>

KEYWORD1 <DataKey10>DataValue10</DataKey10> <DataKey11>DataValue11</DataKey11>
   <DataKey12>DataValue12</DataKey12> <DataKey13>DataValue12</DataKey13>
   <DataKey14>DataValue12</DataKey14> <DataKey1N>DataValue1N</DataKey1N>

^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)

  • ^([^ \t]+) mark and match any string of non or \t beginning at the line (this is $1)
  • ( begin a mark
    • \s+ white space
    • (?: non marked group starting here
      • <[^>]+> match an open xml tag: <ab>
      • [^<]+ match the inside of a tag bc
      • </[^>]+> match an closing tag </ab>
      • [\s\n]* some optional white space or newlines
    • )+ close the non marked group and repeat at least one time
  • ) close the mark (this is $2)

The replacement is straight forward now.

Hope it helps.

But you should probably try to make a simple parser if this is not a one off job :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜