How to parse nested ul/li tags using Hpricot
I have the following HTML structure
<div id='my_categories'>
<ul>
<li><a href="1">Animals, Birds, & Pets</a></li>
<li><a href="2">Ask the Expert</a>
<ul>
<li><a href='21'>Health Care Providers</a></li>
<li><a href='22'>Influnza</a>
<ul>
<li><a href='221'>Flu Viruses (2)</a></li>
<li><a href='222'>Test</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
This is how the web page looks
What I need is, I have a categories table with fields category_name, category_url and parent_id.
I need to save each category and sub-category. The parent_id denotes under which category does this sub-category comes under.
How can I parse through this HTML structure using this Hpricot and save data to my database. Please help
My table looks like
id category_name category_url Parent_id
1 Animals, Birds, & Pets null 开发者_JAVA百科 null
2 Ask the expert null null
3 Health Care Providers null 2
4 Influenza null 2
5 Flu Viruses null 4
6 Test null 4
Thanks in advance
Below is the code that worked for me...
doc = Hpricot(open(categories_page).read)
doc.search("ul/li").each do |li|
category = li.search('a[@href]').first.inner_text.gsub(/ *\(.*?\)/, '')
category_url = li.search('a').first[:href]
category = Category.find_or_create_by_name(category, :url => category_url)
puts "---------- #{category.name} ------------"
nodes = li.search("ul/li/a")
unless nodes.empty?
nodes.each do |node|
node_name = node.inner_text.gsub(/ *\(.*?\)/, '')
node_url = node.attributes['href']
sub_category = Category.find_by_name(node_name)
if sub_category.blank?
sub_category = Category.create(:name => node_name, :url => node_url, :parent_category_id => category.id)
puts " #{sub_category.name}"
else
sub_category.update_attribute('parent_category_id', category.id)
puts " #{category.name} --> #{sub_category.name}"
end
end
end
end
精彩评论