开发者

How do i remove   — – special characters from my XML files

this is a sample of the xml file

<row tnote="0">
<entry namest="col2" nameend="col4" us="none" emph="bld"><blst>
<li><text>Single, head of household, or qualifying widow(er)&#x2014;$55,000</text></li>
<li><text>Married filing jointly&#x2014;$115,000</text></li>
</blst></entry>
<entry colname="col6" ldr="1" valign="middle">&#x2002;</entry>
<entry c开发者_开发问答olname="col7" valign="middle"> 5.</entry>
</row>

the &#x2014; etc represent HTML 4.0 entities. i want to store each line's text as an element of an array, but not if the line is just &#x2002;

if e.text.strip =~ /^&#x20[0-9][0-9];$/ then
next
else
subLines << e.text
end

but it doesn't seem to be working...is my regEx incorrect?


&#x...; isn't an entity reference, it's a character reference. To an XML parser, &#x2014; is absolutely identical to the raw character , so when you look at the DOM produced by an XML parser through a property such as element.text you won't see anything with an ampersand in it, but a simple character.

So in principle, you'd match it with a regex something like /[—– ]/. However, if you are using Ruby 1.8, you've got the problem that the language itself doesn't have support for Unicode, so the character group in /[—– ]/ won't quite work properly: it'll try to remove every byte in the UTF-8 representation of , and , which will likely mangle any other characters.

A simple string replace for each target character would work correctly, as that doesn't require special character handling. (Naturally if you included characters like directly in the source code you'd also have to get the file encoding of that script right, so probably easier to use a string literal escape like "\xe2\x80\x94".)


Because your regex is of the form /^...$/, it will only match against the entire string. You will only skip text that consists entirely of one HTML entity.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜