开发者

Getting the contents of web page and processing it (Print or save to file)

I am a Real Estate appraiser and have some limited experience with vb and .net. I have a task which I perform which requires me to go to the conuty开发者_Python百科 appraisers web site and print a copy (to image bmp or jpg or directly to the default printer) of the current public record info for anywhere from a few pages to 1,000 plus records at a time.

I don't really get paid to do this part of the job so they don't care whether it takes me a few minutes or several hours to do this. I thought there must be a way to automate this process, so last week I started searching and test code snippets.

What I have to date opens an instance of IE; navigates to the reguested page; finds the form elemet for AcctNo; fills it in and submits the form. The page that comes back is formatted for screen presentation and is not suitable to be sent to the printer. There is however a link which when clicked returns a page formatted for pinting. Downside is it also brings up the print dialog which then has to be handled. I was able through several methods to click either the print button or the cancel button which leaves me with a document that is either sent to printer or sitting on screen.

The questions are:

  1. Is there a way to do this without displaying the Print Dialog? Maybe a HTTPRequest or HTTPWebREquest as I have no need to see the screens just need the final page.

  2. The resultant page is typically longer then letter by a few lines and wants to print on two pages. It would be nice to resize the page to fit and typically it will be the same resizing.

  3. If I stick with the print dialog either clicking print or cancel how can I intercept the document and decide by options set in the program wheter to send the file to the printer or save to image?

I am sure I am working too hard to do this and figured there was someone out there who could answer this in a second while I've spent the better part of 3 days trying to figure it out.

I enjoy the challenge of figuring thigs out so pointing me at a class or some site is greatfully appreciated, however example code is helpful as I am not a seasoned programmer and basically take examples and change them to fit my needs.

Thanks


What you are trying to do is called web scraping. While I'm not a VB guy (sorry!) generally I break down web scraping programs like this-

  1. Download an HTML file from a URL, using either GET or POST.
  2. Extract information from that file.
  3. Format and return that information, or repeat, potentially following links found in the HTML.

A Google search for "vb web scraping" offered a number of different techniques, but I'm not sure what you're comfortable with. Ideally, a language that's more web-friendly might be a good idea. I do most of my scraping in Python. Though I used to do it the hard way, I've recently begun experimenting with a library- mechanize- that makes my life a lot easier.

This bit of Python goes to Google's homepage, goes to the "About" link, and saves the HTML to a file.

import mechanize, re

browser = mechanize.Browser()
browser.open("http://google.com")

#find and follow a link with the text "About" in it
about_page = browser.follow_link(text_regex = re.compile("About"))

#open a local html file to save to
output_file = open("about.html","w")

for line in about_page.read():
  output_file.write(line + '\n')
output_file.close()

I know you don't know Python, but it's one of the easiest to learn languages, and seems better suited to this task than VB. Plus, plenty of people on StackOverflow speak it- compare ~14k tags to ~5k.


This sounds like something that could be easy to solve (easier than a programming approach anyway) using a GUI-based Macro Recording software such as AutoHotkey. The only difficulties I can see is finding the right form elements and the print link.


What you could try is instead of programmatically clicking the link to the print page, grab the URL in the same manner. If I recall, you retrieve that through the Document property on the web browser and find the link using the DOM tools, then grabbing the HREF attribute.

Once you have the URL for the link, use an HttpClient to download that URL to a file (or memory stream). Load the file into memory (unless it's in a memory stream, in which case it's already in memory) and remove the scripts that go for the printer (or just disable all scripts by replacing <script with <!--<script and </script> with </script>-->. You might need to do some more logic as most script tags have HTML comment tags within them.

Once you've got this all processed, save it to the disk as a temporary file and navigate your browser control to the file.

If there are images or links that don't load/work, check out adding a <base> tag to the file when you process it. That should fix the URLs.

Hope this helps!


Chances are good that there's some kind of pattern between the URL for the regular (screen display) and printable documents. For example, they may use the same document ID number in the URL.

Once you know that pattern, you can calculate the printable URL and then just save the results of loading that URL into a file.

Just be sure to check for the "page not found" or other error (hack a URL to figure out what their error page is like), so that if they change the format of the printable URL you'll be alerted instead of blindly saving error pages :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜