开发者

Errors processing bullet point via regex replace in VB to clean up XML file

I'm trying to clean up an XML file to have only UTF-8 characters but I'm having issues with a bullet point. The files have a bullet point in them and if I remove these characters, the rest of the regex replace works fine, but it doesn't seem to replace this specific bullet character. Looking at HEX it is 0x07 and in unicode /u0007 but neither of these resolved the error ("hexidecimal value 0x07, is an invalid character")

here is some of the regex replace code (VB script in SSIS) I'm using with several iterations I've tried. Any help would be greatly appreciated.

    XMLString = FileIO.FileSystem.ReadAllText(filelocation)
    'Dim rgx As Regex = New Regex("[\x00-\x08\x0B-\x0C\x0E-\x1F\u0000-\u0007]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[^0-9a-zA-Z]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[[:^print:]]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[[:^print:][\u0007]]", RegexOptions.None)
    Dim rgx As Regex = New Regex("[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[\x00-\x1F\x7F-\xFF]+", RegexOptions.None)
    rgx.Replace(X开发者_JS百科MLString, "")

thanks


One think you need to know is whether your regular expression is being applied against a string of bytes, or a string of characters. (In perl there is an explicit difference, not too sure about VB - its usually controlled by the way you read the data in). The below two points are not "rules" as such, more good form.

  • If running against bytes, then you should only use the \xXX escape sequences. (and XX can only be 2 "digits")
  • If running against characters, then you should use the \uXXXX escape sequences (\xXXXX is the same thing in most languages)

Looking at your uncommented regex, it seems you're looking at characters. This would imply the file must already be in some valid character encoding (probably one of UTF-8, UTF-16LE, or cp1252). So all this regex is doing is stripping out valid UTF-8 characters which are not allowed according the the XML spec. http://www.w3.org/TR/xml/#charsets . That should be fine.

But if your string is a stream of bytes, and you are trying to ensure it is valid UTF-8 then that is harder to do with a regex. Other than non-ascii I don't know how.

One other point: Shouldn't you be setting the Global attribute of your regex before doing the replace. Could this be your problem? Its fixing the first occurance but not the whole file?


With Powershell I used the following regex:

    -replace "\u2022", "" `

as @Brian Reichle mentioned in your comment

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜