Groovy Scraping Google Search Using HttpBuilder - Result doesn't seem to parse as html or xml
I am writing a simple Groovy script to request simple searches from Google Search and then parse the result set. I know that there is the Custom Search API - but that won't work for me, so please don't point me in that direction.
I am using HTTPBuilder to make the request. I found that all of the other methods "string".toURL(), HTMLCleaner... all of them get a http 403 code if you make the call with them. I am assuming it is because the request heading is not valid for Google.
I can get HTTP Builder to make and get a non 403 request. That said, when I do a println on the "html" (see code snippet below), it does not look like html or xml. It looks just like text.
here is the HTTPBuilder snippet to get the response:
//build query
def query = ""
queryTerms.eachWithIndex({term , i -> (i > 0) ? (query += "+" + term) : (query += term)})
def http = new HTTPBuilder(baseUrl)
http.request(Method.GET,ContentType.TEXT) { req ->
headers.'User-Agent' = 'Mozilla/5.0' }
def html = http.get(path : searchPath, contentType : ContentType.HTML, query : [q:query])
// println html
assert html instanceof groovy.util.slurpersupport.GPathResult
assert html.HEAD.size() == 1
assert html.BODY.size() == 1
I am getting back some result so I try to parse it as per below. I will provide the actual structure first and then the parsing. That said, nothing shows up in any of the parsed elements.
Actual Structure:
html->body#gsr->div#main->div->div开发者_StackOverflow社区#cnt->div#rcnt->div#center_col->div#res.med->div#search->div#ires->ol#rso->
Code:
def mainDiv = html.body.div.findAll {it.@id.text() == 'main'}
println mainDiv
def rcntDiv = mainDiv.div.div.div.findAll { it.@id.text() == 'rcnt' }
println rcntDiv
def searchDiv = rcntDiv.div.findAll { it.@id.text == "center_col" }.div.div.findAll { it.@id.text == "search" }
println searchDiv
searchDiv.div.ol.li.each { println it }
So is this just not possible? Is google spoofing me and sending me garbage data or do I need to tune my HTTPBuilder some more? Any ideas?
You didn't mention the search URL you were using, so I can't speak to why you were getting 403s. The following code does a search with the standard Google site, and works for me without any Forbidden or other status errors:
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.5.1' )
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.*
def http = new groovyx.net.http.HTTPBuilder('http://www.google.com')
def queryTerms =['queen','of','hearts']
http.request(GET,HTML) { req ->
uri.path = '/search'
uri.query= [q: queryTerms.join('+'), hl: 'en']
headers.'User-Agent' = 'Mozilla/5.0'
response.success = { resp, html ->
println "Site title: ${html.HEAD.TITLE.text()}"
}
response.failure = { resp ->
println resp.statusLine
}
}
It outputs site title, to show that it is successfully parsing HTML:
Site title: queen+of+hearts - Google Search
精彩评论