开发者

How can I efficiently group a large list of URLs by their host name in Perl?

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:

{
    'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
    'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}

My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some开发者_开发知识库 more efficient ways?

My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.

EDIT

Here is my implementation (I've cut off irrelevant things):

while($line = <STDIN>) { 
    chomp($line); 
    $line =~ /(http:\/\/.+?)(\/|$)/i; 
    $host = "$1"; 
    push @{$urls{$host}}, $line; 
}

store \%urls, 'out.hash'; 


One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.


If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).

But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.

EDIT:

Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.


Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.

One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.


What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜