开发者

PHP-Dom-Processing: Code-review of a little Parser-programme

many many thanks for running this board. I love this site. It has helped me so often! You are great fellows. What i do today is workin on a little php-parser!

I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval!

see an example:

I need to get all the data out of this site.

see the target: see this link here: Foundations in Germany - click here

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

see an example:

Bürgerstiftung Lebensraum Aachen
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Hubert Schramm
    Alexanderstr. 69/ 71
    52062 Aachen
    Telefon: 0241 - 4500130
    Telefax: 0241 - 4500131
    Email: info@buergerstiftung-aachen.de
    www.buergerstiftung-aachen.de
    >> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Helga Kühn
    Rotkehlchenstr. 72
    28832 Achim
    Telefon: 04202-84981
    Telefax: 04202-955210
    Email: info@buergerstiftung-achim.de
    www.buergerstiftung-achim.de
    >> Weitere Details zu dieser Stiftung 

I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize)

开发者_如何学编程

i started with an php-approach: But -if i run the code (see below) i get this results

PHP Fatal error:  Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5
martin@suse-linux:~/perl/foundations> cd foundations

caused by this code here

<?php

// Create DOM from URL or file
$html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

// split it via body, so you only get to the contents inside body tag
$split = split('<body>', $html);
// it is usually in the top of the array but just check to be sure
$body = $split[1];
// split again with, say,<p class="divider">A</p>
$split = split('<p class="divider">A</p>', $body);
// now this should contain just the data table you want to process
$data = $split[1];

// Find all links from original html
foreach($html->find('a') as $element) {
       $link = $element->href;

       // check if this link is in our data table
       if(substr_count($data, $link) > 0) {
           // link is in our data table, follow the link
           $html = file_get_html($link);
          // do what you have to do
       }
}


?>

well some musings about my approach:

the standard practice for scrapping the pages would be:

  1. read the page into a string (file_get_html or whatever is being used now)
  2. split the string, This depends on the page structure. First split it via , so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by
A

, since it has the link we described above:

  1. If we wish to follow the link, just repeat the same process, but using the link.
  2. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the tag. Much simpler that way.

Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward

Update: Hmm - i could try this:

addmiting that it doesn't get any simpler than using simple_html_dom.

$records = array();
foreach($html->find('#content dl') as $contact) {
   $record = array();
   $record["name"] = $contact->find("dt", 0)->plaintext;
   foreach($contact->find("dd") as $field) {
       /* parse each $field->plaintext in order to obtain $fieldname */
       $record[$fieldname] = $field->plaintext;
   }
   $records[] = $record;
}

Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm...

any ideas


I definitely wanted to point out, before you consider scraping any site, that you need to consider the legal and ethical repercussions of doing so. If this is not your site or if you do not have permission from the owner, you probably shouldn't be scraping. If its not for personal use, you especially probably shouldn't be scraping. Just be careful...

First, you need a semicolon (;) after $data = $split[1], that'll get rid of your PHP Syntax error. I'm a bit confused by the first error, referring to the *, because you have no *'s anywhere in your code.

After your syntax errors go away though it seems like you'll be on the right track to write a MySQL query and insert your findings.

You may also consider something like:

foreach($html->find('a') as $element) 
   echo $element->href;

I hope that helps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜