Strip Font Tags from XML with nokogiri

2023-04-05 05:10 问答作者：

I'm trying to clean up some xml by removing Font Tags. Here's what I start with:

<?xml version="1.0"?>
 <Worksheet ss:Name="Subtitles">
  <Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="53" x:FullColumns="1"
   x:FullRows="1" ss:DefaultRowHeight="13.5">
   <Column ss:StyleID="s62" ss:Width="80.25" ss:Span="1"/>
   <Column ss:Index="3" ss:StyleID="s63" ss:Width="249.75"/>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">00:00:11:09</Data></Cell>
    <Cell><Data ss:Type="String">00:00:13:06</Data></Cell>
    <Cell><ss:Data ss:Type="String" xmlns="http://www.w3.org/TR/REC-html40">안녕하세요<Font
       html:Face="Arial" html:Color="#000000">, </Font><Font html:Face="돋움"
       html:Color="#000000">저는</Font><Font html:Face="Arial" html:Color="#000000"> </Font><Font
       html:Face="돋움" html:Color="#000000">잭</Font><Font html:Face="Arial"
       html:Color="#000000">, 9</Font><Font html:Face="돋움" html:Color="#000000">살</Font><Font
       html:Face="Arial" html:Color="#000000"> </Font><Font htm开发者_JAVA百科l:Face="돋움"
       html:Color="#000000">입니다</Font><Font html:Face="Arial" html:Color="#000000">. </Font></ss:Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

And here's what I want:

<?xml version="1.0"?>
 <Worksheet ss:Name="Subtitles">
  <Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="53" x:FullColumns="1"
   x:FullRows="1" ss:DefaultRowHeight="13.5">
   <Column ss:StyleID="s62" ss:Width="80.25" ss:Span="1"/>
   <Column ss:Index="3" ss:StyleID="s63" ss:Width="249.75"/>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">00:00:11:09</Data></Cell>
    <Cell><Data ss:Type="String">00:00:13:06</Data></Cell>
    <Cell><Data ss:Type="String">안녕하세요, 저는잭, 9살 입니다.</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

I'm trying to accomplish this with nokogiri. Here's what I've got so far:

require 'nokogiri'
f = File.open("junk_from_excel.xml")
doc = Nokogiri::XML(f)

# <Font html:Face="돋움" html:Color="#000000">, </Font>
doc.xpath('//@Face').each(&:remove)
# becomes: <Font html:Color="#000000">, </Font>
doc.xpath('//@Color').each(&:remove)
# becomes: <Font>, </Font>

puts doc

I'm not sure that helps at all. I still need to turn:

<Font>, </Font>

into:

Any help is much appreciated!

Thank you all for your help. I got it working with this hackish gsub stuff at the end:

require 'nokogiri'

f = File.open("full_file_from_excel.xml")
doc = Nokogiri::XML(f)

def font_killer(children)
    children.each do |c|
        if(c.name == 'Font')
            c.replace(font_killer(c.children))
        else
            font_killer(c.children)
        end
    end
    children
end
doc = Nokogiri::XML(open('full_file_from_excel.xml').read)
doc.encoding = 'utf-8'
doc.xpath('//ss:Data[@ss:Type="String"]|//Data[@ss:Type="String"]').each { |n| font_killer(n.children) }

#save to file:
open('full_file_from_excel_fixed.xml', 'w') { |f|
  f.puts doc
}

# few more file cleanup find and replace
text = File.read("full_file_from_excel_fixed.xml")
replace = text.gsub("  </ss:Data>","</Data>")
replace = replace.gsub(" </ss:Data>","</Data>")
replace = replace.gsub("</ss:Data>","</Data>")
replace = replace.gsub("<ss:Data xmlns=\"http://www.w3.org/TR/REC-html40\" ss:Type=\"String\">", "<Data ss:Type=\"String\">")
File.open("full_file_from_excel_fixed.xml", "w") {|file| file.puts replace}

If the "junk from excel" is that simple and the ss:Type="String" elements only contain text or the <Font> nonsense then:

doc.encoding = 'utf-8'
doc.xpath('//Data[@Type="String"]').each { |n| n.content = n.text }

should remove the ugliness. That gives me this output:

<?xml version="1.0" encoding="utf-8"?>
<Worksheet Name="Subtitles">
  <Table ExpandedColumnCount="3" ExpandedRowCount="53" FullColumns="1" FullRows="1" DefaultRowHeight="13.5">
   <Column StyleID="s62" Width="80.25" Span="1"/>
   <Column Index="3" StyleID="s63" Width="249.75"/>
   <Row AutoFitHeight="0">
    <Cell><Data Type="String">00:00:11:09</Data></Cell>
    <Cell><Data Type="String">00:00:13:06</Data></Cell>
    <Cell><Data xmlns="http://www.w3.org/TR/REC-html40" Type="String">안녕하세요, 저는 잭, 9살 입니다. </Data></Cell>
   </Row>
  </Table>
 </Worksheet>

I don't know how to convince Nokogiri to remove that stray xmlns though.

If you're not sure about what's inside the <Data>, then you can remove only the <Font> stuff with something like this:

def font_killer(children)
    children.each do |c|
        if(c.name == 'Font')
            c.replace(font_killer(c.children))
        else
            font_killer(c.children)
        end
    end
    children
end
doc = Nokogiri::XML(open('junk_from_excel.xml').read)
doc.encoding = 'utf-8'
doc.xpath('//Data[@Type="String"]').each { |n| font_killer(n.children) }

I tested this a bit but I'd recommend that you test it a bit more on your real data.

In either case, you lose the ss and x namespace prefixes; this is fair since the XML doesn't doesn't properly declare namespaces so Nokogiri pretends that they don't exist. If you want the namespaces, then add these attributes

xmlns:x="urn:schemas-microsoft-com:office:excel"  
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"

to the <Worksheet> element before the ss:Name attribute and then update the XPath expressions to include the namespace:

doc.xpath('//ss:Data[@ss:Type="String"]|//Data[@ss:Type="String"]')

There's probably a better way of expressing that but my XPath-Fu isn't that strong; that should get the job done though.

You could extract the contents before removing them like this:

doc.xpath('//*[@Face]').each do |node|
  node.children.each do |child|
    node.parent << child
  end
  node.remove
end

source: http://rubyforge.org/pipermail/nokogiri-talk/2009-June/000333.html

继续阅读：nokogiri ruby xml

Strip Font Tags from XML with nokogiri

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？