PHP-Dom-Processing: Code-review of a little Parser-programme
many many thanks for running this board. I love this site. It has helped me so often! You are great fellows. What i do today is workin on a little php-parser!
I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval!
see an example:
I need to get all the data out of this site.
see the target: see this link here: Foundations in Germany - click here
I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.
see an example:
Bürgerstiftung Lebensraum Aachen
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062 Aachen
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: info@buergerstiftung-aachen.de
www.buergerstiftung-aachen.de
>> Weitere Details zu dieser Stiftung
Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung
I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize)
开发者_如何学编程i started with an php-approach: But -if i run the code (see below) i get this results
PHP Fatal error: Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5
martin@suse-linux:~/perl/foundations> cd foundations
caused by this code here
<?php
// Create DOM from URL or file
$html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');
// split it via body, so you only get to the contents inside body tag
$split = split('<body>', $html);
// it is usually in the top of the array but just check to be sure
$body = $split[1];
// split again with, say,<p class="divider">A</p>
$split = split('<p class="divider">A</p>', $body);
// now this should contain just the data table you want to process
$data = $split[1];
// Find all links from original html
foreach($html->find('a') as $element) {
$link = $element->href;
// check if this link is in our data table
if(substr_count($data, $link) > 0) {
// link is in our data table, follow the link
$html = file_get_html($link);
// do what you have to do
}
}
?>
well some musings about my approach:
the standard practice for scrapping the pages would be:
- read the page into a string (file_get_html or whatever is being used now)
- split the string, This depends on the page structure. First split it via , so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by
, since it has the link we described above:
- If we wish to follow the link, just repeat the same process, but using the link.
- Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the tag. Much simpler that way.
Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward
Update: Hmm - i could try this:
addmiting that it doesn't get any simpler than using simple_html_dom.
$records = array();
foreach($html->find('#content dl') as $contact) {
$record = array();
$record["name"] = $contact->find("dt", 0)->plaintext;
foreach($contact->find("dd") as $field) {
/* parse each $field->plaintext in order to obtain $fieldname */
$record[$fieldname] = $field->plaintext;
}
$records[] = $record;
}
Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm...
any ideas
I definitely wanted to point out, before you consider scraping any site, that you need to consider the legal and ethical repercussions of doing so. If this is not your site or if you do not have permission from the owner, you probably shouldn't be scraping. If its not for personal use, you especially probably shouldn't be scraping. Just be careful...
First, you need a semicolon (;
) after $data = $split[1]
, that'll get rid of your PHP Syntax error. I'm a bit confused by the first error, referring to the *
, because you have no *'s anywhere in your code.
After your syntax errors go away though it seems like you'll be on the right track to write a MySQL query and insert your findings.
You may also consider something like:
foreach($html->find('a') as $element)
echo $element->href;
I hope that helps.
精彩评论