Strip Font Tags from XML with nokogiri
I'm trying to clean up some xml by removing Font Tags. Here's what I start with:
<?xml version="1.0"?>
<Worksheet ss:Name="Subtitles">
<Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="53" x:FullColumns="1"
x:FullRows="1" ss:DefaultRowHeight="13.5">
<Column ss:StyleID="s62" ss:Width="80.25" ss:Span="1"/>
<Column ss:Index="3" ss:StyleID="s63" ss:Width="249.75"/>
<Row ss:AutoFitHeight="0">
<Cell><Data ss:Type="String">00:00:11:09</Data></Cell>
<Cell><Data ss:Type="String">00:00:13:06</Data></Cell>
<Cell><ss:Data ss:Type="String" xmlns="http://www.w3.org/TR/REC-html40">안녕하세요<Font
html:Face="Arial" html:Color="#000000">, </Font><Font html:Face="돋움"
html:Color="#000000">저는</Font><Font html:Face="Arial" html:Color="#000000"> </Font><Font
html:Face="돋움" html:Color="#000000">잭</Font><Font html:Face="Arial"
html:Color="#000000">, 9</Font><Font html:Face="돋움" html:Color="#000000">살</Font><Font
html:Face="Arial" html:Color="#000000"> </Font><Font htm开发者_JAVA百科l:Face="돋움"
html:Color="#000000">입니다</Font><Font html:Face="Arial" html:Color="#000000">. </Font></ss:Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
And here's what I want:
<?xml version="1.0"?>
<Worksheet ss:Name="Subtitles">
<Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="53" x:FullColumns="1"
x:FullRows="1" ss:DefaultRowHeight="13.5">
<Column ss:StyleID="s62" ss:Width="80.25" ss:Span="1"/>
<Column ss:Index="3" ss:StyleID="s63" ss:Width="249.75"/>
<Row ss:AutoFitHeight="0">
<Cell><Data ss:Type="String">00:00:11:09</Data></Cell>
<Cell><Data ss:Type="String">00:00:13:06</Data></Cell>
<Cell><Data ss:Type="String">안녕하세요, 저는잭, 9살 입니다.</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
I'm trying to accomplish this with nokogiri. Here's what I've got so far:
require 'nokogiri'
f = File.open("junk_from_excel.xml")
doc = Nokogiri::XML(f)
# <Font html:Face="돋움" html:Color="#000000">, </Font>
doc.xpath('//@Face').each(&:remove)
# becomes: <Font html:Color="#000000">, </Font>
doc.xpath('//@Color').each(&:remove)
# becomes: <Font>, </Font>
puts doc
I'm not sure that helps at all. I still need to turn:
<Font>, </Font>
into:
,
Any help is much appreciated!
Thank you all for your help. I got it working with this hackish gsub stuff at the end:
require 'nokogiri'
f = File.open("full_file_from_excel.xml")
doc = Nokogiri::XML(f)
def font_killer(children)
children.each do |c|
if(c.name == 'Font')
c.replace(font_killer(c.children))
else
font_killer(c.children)
end
end
children
end
doc = Nokogiri::XML(open('full_file_from_excel.xml').read)
doc.encoding = 'utf-8'
doc.xpath('//ss:Data[@ss:Type="String"]|//Data[@ss:Type="String"]').each { |n| font_killer(n.children) }
#save to file:
open('full_file_from_excel_fixed.xml', 'w') { |f|
f.puts doc
}
# few more file cleanup find and replace
text = File.read("full_file_from_excel_fixed.xml")
replace = text.gsub(" </ss:Data>","</Data>")
replace = replace.gsub(" </ss:Data>","</Data>")
replace = replace.gsub("</ss:Data>","</Data>")
replace = replace.gsub("<ss:Data xmlns=\"http://www.w3.org/TR/REC-html40\" ss:Type=\"String\">", "<Data ss:Type=\"String\">")
File.open("full_file_from_excel_fixed.xml", "w") {|file| file.puts replace}
If the "junk from excel" is that simple and the ss:Type="String"
elements only contain text or the <Font>
nonsense then:
doc.encoding = 'utf-8'
doc.xpath('//Data[@Type="String"]').each { |n| n.content = n.text }
should remove the ugliness. That gives me this output:
<?xml version="1.0" encoding="utf-8"?>
<Worksheet Name="Subtitles">
<Table ExpandedColumnCount="3" ExpandedRowCount="53" FullColumns="1" FullRows="1" DefaultRowHeight="13.5">
<Column StyleID="s62" Width="80.25" Span="1"/>
<Column Index="3" StyleID="s63" Width="249.75"/>
<Row AutoFitHeight="0">
<Cell><Data Type="String">00:00:11:09</Data></Cell>
<Cell><Data Type="String">00:00:13:06</Data></Cell>
<Cell><Data xmlns="http://www.w3.org/TR/REC-html40" Type="String">안녕하세요, 저는 잭, 9살 입니다. </Data></Cell>
</Row>
</Table>
</Worksheet>
I don't know how to convince Nokogiri to remove that stray xmlns
though.
If you're not sure about what's inside the <Data>
, then you can remove only the <Font>
stuff with something like this:
def font_killer(children)
children.each do |c|
if(c.name == 'Font')
c.replace(font_killer(c.children))
else
font_killer(c.children)
end
end
children
end
doc = Nokogiri::XML(open('junk_from_excel.xml').read)
doc.encoding = 'utf-8'
doc.xpath('//Data[@Type="String"]').each { |n| font_killer(n.children) }
I tested this a bit but I'd recommend that you test it a bit more on your real data.
In either case, you lose the ss
and x
namespace prefixes; this is fair since the XML doesn't doesn't properly declare namespaces so Nokogiri pretends that they don't exist. If you want the namespaces, then add these attributes
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
to the <Worksheet>
element before the ss:Name
attribute and then update the XPath expressions to include the namespace:
doc.xpath('//ss:Data[@ss:Type="String"]|//Data[@ss:Type="String"]')
There's probably a better way of expressing that but my XPath-Fu isn't that strong; that should get the job done though.
You could extract the contents before removing them like this:
doc.xpath('//*[@Face]').each do |node|
node.children.each do |child|
node.parent << child
end
node.remove
end
source: http://rubyforge.org/pipermail/nokogiri-talk/2009-June/000333.html
精彩评论