Groovy Scraping Google Search Using HttpBuilder - Result doesn't seem to parse as html or xml

2023-04-06 02:31 问答作者：

I am writing a simple Groovy script to request simple searches from Google Search and then parse the result set. I know that there is the Custom Search API - but that won't work for me, so please don't point me in that direction.

I am using HTTPBuilder to make the request. I found that all of the other methods "string".toURL(), HTMLCleaner... all of them get a http 403 code if you make the call with them. I am assuming it is because the request heading is not valid for Google.

I can get HTTP Builder to make and get a non 403 request. That said, when I do a println on the "html" (see code snippet below), it does not look like html or xml. It looks just like text.

here is the HTTPBuilder snippet to get the response:

    //build query
    def query = ""
    queryTerms.eachWithIndex({term , i -> (i > 0) ? (query += "+" + term) : (query        += term)})

    def http = new HTTPBuilder(baseUrl)

    http.request(Method.GET,ContentType.TEXT) { req ->
        headers.'User-Agent' = 'Mozilla/5.0' }

    def html = http.get(path : searchPath, contentType : ContentType.HTML, query : [q:query])
    // println html
    assert html instanceof groovy.util.slurpersupport.GPathResult
    assert html.HEAD.size() == 1
    assert html.BODY.size() == 1

I am getting back some result so I try to parse it as per below. I will provide the actual structure first and then the parsing. That said, nothing shows up in any of the parsed elements.

Actual Structure:

html->body#gsr->div#main->div->div开发者_StackOverflow社区#cnt->div#rcnt->div#center_col->div#res.med->div#search->div#ires->ol#rso->

Code:

    def mainDiv = html.body.div.findAll {it.@id.text() == 'main'}
    println mainDiv
    def rcntDiv = mainDiv.div.div.div.findAll { it.@id.text() == 'rcnt' }
    println rcntDiv
    def searchDiv = rcntDiv.div.findAll { it.@id.text == "center_col" }.div.div.findAll { it.@id.text == "search" }
    println searchDiv
    searchDiv.div.ol.li.each { println it }

So is this just not possible? Is google spoofing me and sending me garbage data or do I need to tune my HTTPBuilder some more? Any ideas?

You didn't mention the search URL you were using, so I can't speak to why you were getting 403s. The following code does a search with the standard Google site, and works for me without any Forbidden or other status errors:

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.5.1' )

import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.*

def http = new groovyx.net.http.HTTPBuilder('http://www.google.com')

def queryTerms =['queen','of','hearts']

http.request(GET,HTML) { req ->
    uri.path = '/search'
    uri.query= [q: queryTerms.join('+'), hl: 'en']

    headers.'User-Agent' = 'Mozilla/5.0'

  response.success = { resp, html ->
      println "Site title: ${html.HEAD.TITLE.text()}"
  }
  response.failure = { resp ->
    println resp.statusLine
  }
}

It outputs site title, to show that it is successfully parsing HTML:

Site title: queen+of+hearts - Google Search

继续阅读：groovy httpbuilder

Groovy Scraping Google Search Using HttpBuilder - Result doesn't seem to parse as html or xml

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？