Errors processing bullet point via regex replace in VB to clean up XML file

2023-01-31 07:44 问答作者：

I'm trying to clean up an XML file to have only UTF-8 characters but I'm having issues with a bullet point. The files have a bullet point in them and if I remove these characters, the rest of the regex replace works fine, but it doesn't seem to replace this specific bullet character. Looking at HEX it is 0x07 and in unicode /u0007 but neither of these resolved the error ("hexidecimal value 0x07, is an invalid character")

here is some of the regex replace code (VB script in SSIS) I'm using with several iterations I've tried. Any help would be greatly appreciated.

    XMLString = FileIO.FileSystem.ReadAllText(filelocation)
    'Dim rgx As Regex = New Regex("[\x00-\x08\x0B-\x0C\x0E-\x1F\u0000-\u0007]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[^0-9a-zA-Z]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[[:^print:]]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[[:^print:][\u0007]]", RegexOptions.None)
    Dim rgx As Regex = New Regex("[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]", RegexOptions.None)
    'Dim rgx As Regex = New Regex("[\x00-\x1F\x7F-\xFF]+", RegexOptions.None)
    rgx.Replace(X开发者_JS百科MLString, "")

thanks

One think you need to know is whether your regular expression is being applied against a string of bytes, or a string of characters. (In perl there is an explicit difference, not too sure about VB - its usually controlled by the way you read the data in). The below two points are not "rules" as such, more good form.

If running against bytes, then you should only use the \xXX escape sequences. (and XX can only be 2 "digits")
If running against characters, then you should use the \uXXXX escape sequences (\xXXXX is the same thing in most languages)

Looking at your uncommented regex, it seems you're looking at characters. This would imply the file must already be in some valid character encoding (probably one of UTF-8, UTF-16LE, or cp1252). So all this regex is doing is stripping out valid UTF-8 characters which are not allowed according the the XML spec. http://www.w3.org/TR/xml/#charsets . That should be fine.

But if your string is a stream of bytes, and you are trying to ensure it is valid UTF-8 then that is harder to do with a regex. Other than non-ascii I don't know how.

One other point: Shouldn't you be setting the Global attribute of your regex before doing the replace. Could this be your problem? Its fixing the first occurance but not the whole file?

With Powershell I used the following regex:

    -replace "\u2022", "" `

as @Brian Reichle mentioned in your comment

继续阅读：regex unicode utf-8 xml

Errors processing bullet point via regex replace in VB to clean up XML file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？