开发者

Parsing a document in a table

How do I parse a document in a table and send it across as a JSON file to another db.

Detailed Desc: I have crawled and taken data into a table from websites u开发者_运维技巧sing anemone. I need to now parse it and transfer it as a JSON file to another server. I think, I will have to first convert the document in the table into nokogiri document which can be parsed and converted to json file. Any idea how can I convert the doc into nokogiri document or if anyone has any other idea to parse it and send it as a json file ?


Nokogiri is your best bet for the HTML parsing, but as for converting it to JSON you're on your own from what I can tell.

Once you have it parsed via Nokogiri it shouldn't be terribly hard to extract the elements you need and generate JSON that represents them. What you're doing isn't a very common task, so you'll have to bridge the gap between Nokogiri and whichever gem you're using to generate the JSON.


Okay I found the answer long time back, I basically made use of REST to send message form one application to another, i sent it across as a hash. And the obvious one, I used nokogiri for parsing the table.

def post_me
     @page_hash = page_to_hash

    res = Net::HTTP.post_form(URI.parse('http://127.0.0.1:3007/element_data/save.json'),@page_hash)
  end

For sending the hash from one application to another using net/http.

def page_to_hash
    require 'rubygems'
    require 'nokogiri'
    require 'open-uri'
    require 'domainatrix'

    #page = self.page.sub(/^<!DOCTYPE html(.*)$/, '<!DOCTYPE html>')
    hash={}
    doc = Nokogiri::HTML(self.page)
    doc.search('*').each do |n|
      puts n.name
    end

Using Nokogiri for parsing the page table in my model. page table had the whole body of a webpage.

file_type = []

    file_type_data=doc.xpath('//a/@href[contains(. , ".pdf") or contains(. , ".doc")
                          or contains(. , ".xls") or contains(. , ".cvs") or contains(. , ".txt")]')
    file_type_data.each do |href|
      if href[1] == "/"
        href = "http://" + website_url + href
      end
      file_type << href
    end
file_type_str = file_type.join(",")
    hash ={:head => head,:title => title, :body => self.body,
      :image => images_str, :file_type => file_type_str, :paragraph => para_str, :description => descr_str,:keyword => key_str,
      :page_url=> self.url, :website_id=>self.parent_request_id, :website_url => website_url,
      :depth => self.depth, :int_links => @int_links_arr, :ext_links => @ext_links_arr
    }

A simple parsing example and how i formed my hash.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜