Most efficient way to parse and reformat data with Nokogiri & Sinatra
I'm working on reformatting HTML output from a search query for an inventory manager for a number of car dealers. There's no direct DB access, no information available from the service creators so I decided to attempts to parse and reformat the data with Nokogiri and generate new pages of results based on the search query.
On first load of the page, I'm just using a default search to generate the first results.
For the search to work, I'm sending the query to a URL like this:
post '/search/?:search_query' do
url = "http://domain.com/v/?DealerId=" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
doc = Nokogiri::HTML(open(url))
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child开发者_高级运维(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
erb :index
end
I know there's got to be a smarter way to do this.
Edit:
An example URL to request data:
http://domain.com/?DealerId=1234&object=list&lang=en&MAKE=&MODEL=&maxrows=50&MinYear=&MaxYear=2011&Type=N&MinPrice=&MaxPrice=&STYLE=&ExtColor=&MaxMiles=&StockNo=
A description of the HTML generated:
Unfortunately, it's old code that's almost entirely table-based, has inline-styles and lacks classes or ids in most areas.
An example of a CSS selector:
td:nth-child(5) .ForeColor4
An XPath selector:
//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", @class, " " ), concat( " ", "ForeColor4", " " ))]
I've also looked at mechanize or hpricot as possibilities but I'm not aware of the best tools for the job as I haven't attempted screen-scraping before.
Summary: I want to pull the data from the HTML, temporarily store it in a variable / session / cookie (data changes several times per day), and then be able to reformat the output into my own HTML/CSS styling.
Personally, I'd decouple the scraping from the user action. Have an independent process scrape and fill your database. This will improve performance drastically, as the fetching, creating a DOM, parsing, then rendering output on every action is going to be slow.
doc.css("td:nth-child(5) .ForeColor4").each do |msrp| session["msrp"] = msrp.inner_html end doc.css("td:nth-child(4) .ForeColor4").each do |price| session["price"] = price.inner_html end
You might want to use Nokogiri's at_css()
method instead of the regular css()
. at_css()
finds the first occurrence of your target and only returns that one node, similar to doing a .first
against the nodeset that .css()
returns.
That would simplify your lookups to this form:
session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html
I'd probably add something like rescue 'msrp lookup failed'
while testing to the end of the lookups just in case you've got bad accessors. Or you could let the code fail when inner_html()
got mad trying to read from a nil. It's just a bit friendlier way to debug.
Otherwise your lookups seem to be decent.
精彩评论