Parsing SEC Edgar XML file using Ruby into Nokogiri
I'm having problems parsing the SEC Edgar files
Here is an example of this file.
The end result开发者_如何学Python is I want the stuff between <XML>
and </XML>
into a format I can access.
Here is my code so far that doesn't work:
scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
Ok, there are a couple of things wrong:
- sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
- You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.
Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(
open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603
I recommend practicing in IRB and reading the docs for Nokogiri
> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]
that should get you going
Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm
Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml
Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.
精彩评论