开发者

Please help: How can I scrape this web page?

There's a site that offers a search service. You enter a number, search, and it returns results. What I want to do is run that search programmatically through coldfusion instead of having to go to the site and search manually.

This is what the form in the web page I'd like to read/scrape looks like (as seen when viewing the page source):

<form id="frmNumID" name="frmNum" action="" method="post">

    <TABLE border=0 cellPadding=0 cellSpacing=0>
     <TR>
      <TD align="center">
         <label class="NumLabel" for="Num" ACCESSKEY="1">ENTER NUM:</label>
        <input class="NumInput" id="Num" name="inputNum"  onfocusin="select()"  title="Num Input" tabindex="1" type="text" value=""  size ="29" maxlength="17" >&nbsp;&nbsp;

      </TD>

      <TD align="center">
         <input class="NumInput" title="Submit Num" tabindex="2" type="image" src="/include/pics/SubmitBtn.jpg" value="submit" ACCESSKEY="2">
      </TD>
     </TR>
     </TABLE>

     <TABLE border=0 cellPadding=0 cellSpacing=0>
     <TR>    
      <TD colspan="2" a开发者_高级运维lign="center">

        <input type="radio" name="displayType" value="NONE"   Checked  />No Pictures&nbsp;&nbsp;                          
        <input type="radio" name="displayType" value="STUFF"    /> Other Stuff&nbsp;&nbsp;                
        <input type="radio" name="displayType" value="MORESTUFF"    /> More Other Stuff  
      </TD>
     </TR>

    </TABLE>
    <div id="NUMMsg"></div>

  </form>

The only field I really care about is the Num input field. I want to post a value to that field, run the search, and get the results in my coldfusion code. This is what I have so far:

<cfhttp url="http://www.someurl.com/"
        method="POST">
    <cfhttpparam name="Num" type="FormField" value="123456789123456" />
</cfhttp>
<cfdump var="#cfhttp.filecontent#" />

But when I go to the page the dump just says "Connection Failure". What am I doing wrong?


It's not clear from your sample code what the submit action URL is? It's blank. Maybe the site sets it using JavaScript etc.?

You also probably need to post the displayType form variable as that is a check box and some value probably needs to be provided.

Beware screen scraping - it can be a maintenance nightmare. if there is anyway to use an official API they provide you should do it because as soon as they change their code (post URL / markup etc.) your code could very well break.


Ok, this website suggested a solution: http://australiansearchengine.wordpress.com/2009/09/28/cfhttp-connection-failure/

They suggested adding the following cfhttpparam tags:

<cfhttpparam type="header" name="accept-encoding" value="deflate;q=0">
<cfhttpparam type="header" name="te" value="deflate;q=0"> 

Now I no longer get a connection failure :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜