How do i remove — – special characters from my XML files

2023-01-09 06:02 问答作者：

this is a sample of the xml file

<row tnote="0">
<entry namest="col2" nameend="col4" us="none" emph="bld"><blst>
<li><text>Single, head of household, or qualifying widow(er)&#x2014;$55,000</text></li>
<li><text>Married filing jointly&#x2014;$115,000</text></li>
</blst></entry>
<entry colname="col6" ldr="1" valign="middle">&#x2002;</entry>
<entry c开发者_开发问答olname="col7" valign="middle"> 5.</entry>
</row>

the — etc represent HTML 4.0 entities. i want to store each line's text as an element of an array, but not if the line is just  

if e.text.strip =~ /^&#x20[0-9][0-9];$/ then
next
else
subLines << e.text
end

but it doesn't seem to be working...is my regEx incorrect?

&#x...; isn't an entity reference, it's a character reference. To an XML parser, — is absolutely identical to the raw character —, so when you look at the DOM produced by an XML parser through a property such as element.text you won't see anything with an ampersand in it, but a simple — character.

So in principle, you'd match it with a regex something like /[—– ]/. However, if you are using Ruby 1.8, you've got the problem that the language itself doesn't have support for Unicode, so the character group in /[—– ]/ won't quite work properly: it'll try to remove every byte in the UTF-8 representation of –, — and , which will likely mangle any other characters.

A simple string replace for each target character would work correctly, as that doesn't require special character handling. (Naturally if you included characters like — directly in the source code you'd also have to get the file encoding of that script right, so probably easier to use a string literal escape like "\xe2\x80\x94".)

Because your regex is of the form /^...$/, it will only match against the entire string. You will only skip text that consists entirely of one HTML entity.

继续阅读：parsing regex ruby xml

How do i remove — – special characters from my XML files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？