display records which exist in file2 but not in file1

2022-12-22 22:47 问答作者：

log file1开发者_运维问答 contains records of customers(name,id,date) who visited yesterday log file2 contains records of customers(name,id,date) who visited today

How would you display customers who visited yesterday but not today?

Constraint is: Don't use auxiliary data structure because file contains millions of records. [So, no hashes] Is there a way to do this using Unix commands ??

an example, but check the man page of comm for the option you want.

comm -2 <(sort -u yesterday) <(sort -u today)

The other tool you can use is diff

diff <(sort -u yesterday) <(sort -u today)

I was personally going for the creating a data structure and records of visits, but, I can see how you'd do it another way too.

In pseudocode, that looks something like python but could be re-written in perl or shell script or ...

import subprocess 
import os

for line in fileinput.input(['myfile'])::
    # split out data. For the sake of it I'm assuming name\tid\tdate
    fields = line.split("\")
    id = fields[1]

    grepresult = subprocess.Popen("grep \"" + id + "\" file1", shell=True, bufsize=bufsize, stdout=PIPE).stdout

    if len(grepresult) == 0:
        print fields # it wasn't in field1

That's not perfect, not tested so treat appropriately but it gives you the gist of how you'd use unix commands. That said, as sfussenegger points out C/C++ if that's what you're using should be able to handle pretty large files.

Disclaimer: this is a not so neat solution (repeatedly calling grep) to match the requirements of the question. If I was doing it, I would use C.

Is a customer identified by id? Is it an int or long? If the answer to both questions is yes, an array with 10,000,000 integers shouldn't take more than 10M*4 = 40MB memory - not a big deal on decent hardware. Simply sort and compare them.

btw, sorting an array with 10M random ints takes less than 2 seconds on my machine - again, nothing to be afraid of.

Here's some very simple Java code:

public static void main(final String args[]) throws Exception {

    // elements in each log file
    int count = 10000000;

    // "read" our log file
    Random r = new Random();
    int[] a1 = new int[count];
    int[] a2 = new int[count];
    for (int i = 0; i < count; i++) {
        a1[i] = Math.abs(r.nextInt());
        a2[i] = Math.abs(r.nextInt());
    }

    // start timer
    long start = System.currentTimeMillis();

    // sort logs
    Arrays.sort(a1);
    Arrays.sort(a2);

    // counters for each array
    int i1 = 0, i2 = 0, i3 = 0;

    // initial values
    int n1 = a1[0], n2 = a2[0];

    // result array
    int[] a3 = new int[count];

    try {
        while (true) {
            if (n1 == n2) {
                // we found a match, save value if unique and increment counters
                if (i3 == 0 || a3[i3-1] != n1) a3[i3++] = n1;
                n1 = a1[i1++];
                n2 = a2[i2++];
            } else if (n1 < n2) {
                // n1 is lower, increment counter (next value is higher)
                n1 = a1[i1++];
            } else {
                // n2 is lower, increment counter (next value is higher)
                n2 = a2[i2++];
            }
        }
    } catch (ArrayIndexOutOfBoundsException e) {
        // don't try this at home - it's not the pretties way to leave the loop!
    }

    // we found our results
    System.out.println(i3 + " commont clients");
    System.out.println((System.currentTimeMillis() - start) + "ms");

}

result

// sample output on my machine:
46308 commont clients
3643ms

as you see, quite efficient for 10M records in each log

继续阅读：file-io programming-languages

display records which exist in file2 but not in file1

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？