ruby (1.8.7): How to get rid of non-printable chars while scraping?
I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:
def clear_string(str)
CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end
For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)
<tr>
<td><span class="linkred2">Tramitació:</span></td>
<td> ordinària </td>
</tr>
Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string
(the method defined above)
row.at("td[1]").text # => "Tramitació:"
row.at("td[2]开发者_如何学编程").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]
I don't know why strip
doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string
, is dumped into a yaml file using YAML::dump
. Its contents are respectively, for both texts:
"Tramitaci\xC3\xB3:"
!binary |
wqBvcmRpbsOgcmlh
The first one seems barely OK, but I don't know how to fix the second case.
One way to translate characters from one character set to another is to use Iconv
. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:
require 'iconv'
s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"
The TRANSLIT
switch tells Iconv
to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE
switch:
Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"
Note that Iconv
will throw an exception with TRANSLIT
if it finds something it can't convert. For that you can combine IGNORE
and TRANSLIT
like so:
Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"
精彩评论