开发者

remove session id from url

I want to develope simple 开发者_开发百科web crawler, to grabb pages from several web sites and maintain them in actual condition. Some of this sites has session ids on each link, they doesn't store sesion ids in cookies at all. So, if i will parse site several times - my parsing table will containts dublicate pages with difference only in session id.

So my question is: how can I remove session id from all links, is there some intelligent idea? I'm developing on php, but all other platforms solutions will be useful, even just alhoritm on words.


As an Example, if you wanna use an RegEx this would remove all Sessions from your url (as long as they have 32 chars, which is the usual I guess):

$url = preg_replace('#([\w\d]+=[\w\d]{32})#',null,$url);


You can always use a regular expression for matching session keys, they're typical most of the time (PHPSESSID). Anyways, if you're crawling something and would like to accept and work with cookies, you should use cURL (see curl_setopt COOKIE, COOKIEFILE and COOKIEJAR).


You can use parse_str() and http_build_query() to extract, clear and rebuild the URL parameters. You can use regular expressions, but I think it would just be easier to get an array of the URL params to work with.

parse_str('session=123445&data=example&action=demo', $url_params);
// $url_params is now an associative array of the url params
unset($url_params['session'], $url_params['action']);
$new_url_param_string = http_build_query($url_params);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜