How to store crawled data from webpages

2023-03-03 19:25 问答作者：

I want to 开发者_如何学Gobuild an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?

You can grab them with file_get_contents() function. So you'd have

$homepage = file_get_contents('http://www.example.com/homepage');

This function returns the page into a string.

Hope this helps. Cheers

Building a crawler I would make the list of URLs to get and finally get them

A. Make the list

Define a list of URL to crawl
Add this URL to the list of URL to crawl (job list)
Define the max depth
Parse the first page, get all the find the href, get the link.
For each link: if it's from same domain or relative, add it to the job list.
Remove the current URL from the job list,
Restart from the next URL job list if non empty.

For this you could use this class, which makes parsing html really easy : https://simplehtmldom.sourceforge.io/

B. Get content

Loop on the array made and get the content. file_get_contents will do this for you : https://www.php.net/file-get-contents

This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.

继续阅读：php web-crawler

How to store crawled data from webpages

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？