How to crawl a wordpress blog?

2023-02-13 13:24 问答作者：

I write a c program to crawl blogs. It works well until it meets this blog: www.ipujia.com. I send the HTTP request:

GET http://www.ipujia.com/ HTTP/1.0

to the website and get the response as below:

HTTP/1.1 301 Moved Permanently
Date: Sun, 27 Feb 2011 13:15:26 GMT
Server: Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8e-fips-rhel5
mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 开发者_运维问答mod_perl/2.0.4 
Perl/v5.8.8
X-Powered-By: PHP/5.2.14
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Last-Modified: Sun, 27 Feb 2011 13:15:27 GMT
Location: http://http/www.ipujia.com/
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

This is strange because I cannot get the index page following the Location. Does anyone have any ideas?

The Location field in the response contains a malformed URI.

Location: http://http/www.ipujia.com/ (notice the protocol error) Should be

Location: http://www.ipujia.com/

Unless you are in control of the server there is little you could do here.

To solve it could you not parse the "Location" response and attempt to extract a valid URI from the it?

继续阅读：network-programming web-crawler wordpress

How to crawl a wordpress blog?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？