开发者

ruby (1.8.7): How to get rid of non-printable chars while scraping?

I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:

def clear_string(str)
  CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end

For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)

<tr>
    <td><span class="linkred2">Tramitaci&oacute;:</span></td>
    <td>&nbsp;ordinària </td>
</tr>

Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string (the method defined above)

row.at("td[1]").text # => "Tramitació:"
row.at("td[2]开发者_如何学编程").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]

I don't know why strip doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string, is dumped into a yaml file using YAML::dump. Its contents are respectively, for both texts:

"Tramitaci\xC3\xB3:"
!binary |
  wqBvcmRpbsOgcmlh

The first one seems barely OK, but I don't know how to fix the second case.


One way to translate characters from one character set to another is to use Iconv. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:

require 'iconv'

s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"

The TRANSLIT switch tells Iconv to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE switch:

Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"

Note that Iconv will throw an exception with TRANSLIT if it finds something it can't convert. For that you can combine IGNORE and TRANSLIT like so:

Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜