开发者

XML api response to json, or hash?

So, I'm using an API which happens to only return XML, that sucks. What I want to do is create a database entry for each record that get returned from the API, but I'm not sure how.

The XML that gets returned is huge and has lots of whitespace characters in it... is that normal? Here is a sample of some of the XML.

<!-- ... -->
        <attribute name="item_date">May 17, 2011</attribute>
        <attribute name="external_url">http://missionlocal.org/2011/05/rain-camioneta-part-i/</attribute>
            <attribute name="source" id="2478">Mission Loc@l</attribute>
            <attribute name="excerpt"></attribute>
    </attributes>
</newsitem>

<newsitem
    id="5185807"
    title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
    url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
    location_name="Van Ness and Filbert"
    schema="lost-and-found"
    schema_id="7"
    pub_date="May 17, 2011, 12:15 p.m."
    longitude="-122.424129925"
    latitude="37.7995100578"
>
    <attributes>
        <attribute name="item_date">May 17, 2011</attribute>
        <attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
    </attributes>
</newsitem>

<newsitem
    id="5185808"
    title="Plywood Update: Dumplings &amp; Buns Aims To Be &quot;Beard Papa Of Chinese Buns&quot;"
    url="http://sf.everyblock.com/news-articles/by-date/2011/5/17/5185808/"
    location_name="2411 California Street"
    schema="news-articles"
    schema_id="5"
    pub_date="May 17, 2011, 12:15 p.m."
    longitude="-122.434000442"
    latitude="37.7888985667"
>
    <attributes>
        <attribute name="item_date">May 17, 2011</attribute>
        <attribute name="external_url">http://sf.eater.com/archives/2011/05/17/dumplings_buns_aims_to_be_beard_papa_of_chinese_buns.php</attribute>
            <attribute name="source" id="2155">Eater SF</attribute>
            <attribute name="excerpt"></attribute>
    </attributes>
</newsitem>

<newsitem
    id="5185809"
    开发者_开发问答title="Freebies: This week, Piazza D&#39;Angelo (22 Miller..."
    url="http://sf.everyblock.com/news-articles/by-date/2011/5/17/5185809/"
    location_name="22 Miller"
    schema="news-articles"
    schema_id="5"
    pub_date="May 17, 2011, 12:15 p.m."
    longitude="-122.408894997"
    latitude="37.7931966922"
>
    <attributes>
        <attribute name="item_date">May 17, 2011</attribute>
        <attribute name="external_url">http://sf.eater.com/archives/2011/05/17/freebies_24.php</attribute>
            <attribute name="source" id="2155">Eater F</attribute>
            <attribute name="excerpt"></attribute>
<!-- ... -->

Any ideas?


That's not quite valid XML. That's some sort of escaped-string representation of XML, perhaps console output. It also doesn't seem to be complete. Other than that, it's fairly normal XML. Here's a smaller excerpt, unescaped and formatted:

<newsitem
    id="5185807"
    title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
    url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
    location_name="Van Ness and Filbert"
    schema="lost-and-found"
    schema_id="7"
    pub_date="May 17, 2011, 12:15 p.m."
    longitude="-122.424129925"
    latitude="37.7995100578">
    <attributes>
        <attribute name="item_date">May 17, 2011</attribute>
        <attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
    </attributes>
</newsitem>

You'll just need to determine what you want to extract and put in the database, and let that drive your DB design decision. Do you need multiple models with relationships intact, or are you just concerned with a subset of the data?


XML can have whitespace and not affect the quality of the data it contains. A good parser, which is how you should be processing the XML, will not care, and will give you access to the data whether there is whitespace or not.

Nokogiri is a favorite for me, and seems to be the de facto standard for Ruby now days. It is very easy to use, but you will have to learn how to tell it what nodes you want.

To get you going, here is some of the terminology:

  • Node is the term for a tag after it has been parsed.
  • Nodes have attributes, which can be accessed using node_var['attribute'].
  • Node text can be accessed using node_var.text or node_var.content or node_var.inner_text.
  • NodeSet is basically an array of Nodes.
  • at returns the first node matching the accessor you give the parser. % is an alias.
  • search returns a NodeSet of nodes matching the accessor you give the parser. / is an alias.

Here's how we can parse the snippet of XML:

require 'nokogiri'

xml =<<EOT
<newsitem
    id="5185807"
    title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
    url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
    location_name="Van Ness and Filbert"
    schema="lost-and-found"
    schema_id="7"
    pub_date="May 17, 2011, 12:15 p.m."
    longitude="-122.424129925"
    latitude="37.7995100578">
    <attributes>
        <attribute name="item_date">May 17, 2011</attribute>
        <attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
    </attributes>
</newsitem>
EOT

doc = Nokogiri::XML(xml)
doc.at('newsitem').text # => "\n    \n        May 17, 2011\n        http://sfbay.craigslist.org/sfc/laf/2386709187.html\n    \n"
(doc % 'attribute').content # => "May 17, 2011"
doc.at('attribute[name="external_url"]').inner_text # => "http://sfbay.craigslist.org/sfc/laf/2386709187.html"

doc.at('newsitem')['id'] # => "5185807"

newsitem = doc.at('newsitem')
newsitem['title'] # => "Lost Chrome messenger PBR bag and contents (marina / cow hollow)"

attributes = doc.search('attribute').map{ |n| n.text } 
attributes # => ["May 17, 2011", "http://sfbay.craigslist.org/sfc/laf/2386709187.html"]

attributes = (doc / 'attribute').map{ |n| n.text } 
attributes # => ["May 17, 2011", "http://sfbay.craigslist.org/sfc/laf/2386709187.html"]

All accesses are using CSS, just like you'd use when writing web pages. It's simpler, and usually more clear, but Nokogiri also support XPath, which is very powerful and lets you offload a lot of processing to the underlying libXML2 library, which will run very fast.

Nokogiri works very nicely with Ruby's Open-URI, so if you're retrieving the XML from a website, you can do it like this:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.example.com'))
doc.to_html.size # => 2825

That's parsing HTML, which Nokogiri excels at too, but the process is the same for XML, just replace Nokogiri::HTML with Nokogiri::XML.

See "How to avoid joining all text from Nodes when scraping" also.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜