How to store crawled data from webpages
I want to 开发者_如何学Gobuild an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?
You can grab them with file_get_contents()
function. So you'd have
$homepage = file_get_contents('http://www.example.com/homepage');
This function returns the page into a string.
Hope this helps. Cheers
Building a crawler I would make the list of URLs to get and finally get them
A. Make the list
- Define a list of URL to crawl
- Add this URL to the list of URL to crawl (job list)
- Define the max depth
- Parse the first page, get all the find the href, get the link.
- For each link: if it's from same domain or relative, add it to the job list.
- Remove the current URL from the job list,
- Restart from the next URL job list if non empty.
For this you could use this class, which makes parsing html really easy : https://simplehtmldom.sourceforge.io/
B. Get content
Loop on the array made and get the content. file_get_contents will do this for you : https://www.php.net/file-get-contents
This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.
精彩评论