Joining two files with regular expression in Unix (ideally with perl)

2023-03-18 17:44 问答作者：

I have following two files disconnect.txt and answered.txt:

disconnect.txt

2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4039开发者_如何学JAVA7400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032

answered.txt

2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:

2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).

Thank you

Sounds like you have hundreds of millions of lines?

Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.

If the files are large the quadratic algorithm will take a lifetime.

Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:

def key s
  s.split('from:')[1].split('to:').map(&:strip).join('.')
end

h = {}
open 'disconnect.txt', 'r' do |f|
  while s = f.gets
    h[key(s)] = true
  end
end

open 'answered.txt', 'r' do |f|
  while a = f.gets
    puts a if h[key(a)]
  end
end

Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big¹ number, then you will probably not be able to fit all the keys in memory and you will need a database.

^{1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.}

First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)

Then take the sorted files and compare the first lines of each.

If the timestamps are the same, you have a match. Hooray! Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.
If not, grab the next line in whichever file has the earlier timestamp and compare again.

This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.

If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:

You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.
Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.

Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.

继续阅读：expression perl

Joining two files with regular expression in Unix (ideally with perl)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？