How can I efficiently group a large list of URLs by their host name in Perl?

2022-12-26 02:10 问答作者：

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:

{
    'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
    'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}

My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some开发者_开发知识库 more efficient ways?

My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.

EDIT

Here is my implementation (I've cut off irrelevant things):

while($line = <STDIN>) { 
    chomp($line); 
    $line =~ /(http:\/\/.+?)(\/|$)/i; 
    $host = "$1"; 
    push @{$urls{$host}}, $line; 
}

store \%urls, 'out.hash';

One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.

If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).

But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.

EDIT:

Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.

Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.

One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.

What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

继续阅读：performance perl

How can I efficiently group a large list of URLs by their host name in Perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？