开发者

How can I return a random line from a file? Interview Question [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
开发者_JS百科

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 6 years ago.

Improve this question

I am preparing for a phone interview. I came upon these questions on the internet. Can anyone tell me some good answers for these?

  1. Suppose I give you a text file and ask you a to write a program that will return a random line from the file (all lines must have equal probability to be returned)

  2. Same as part 1, except this time the entire text file cannot fit into main memory

  3. Same as part 2, except now you have a stream instead of a file.

Please help.

Ok...@Everyone, I really had a few ideas in my mintod before asking this...Seeing the relentless attack by my fellow SOers, I am posting my answers. Please feel free to attack them too...

1: Count the number of '\n' in the file. Generate a random number between 1 and the number and return the line after the number-1 '\n'.

2: Bring the file into main memory part by part and follow the above procedure.

3: I dont have much idea about this and would appreciate any inputs.

Its wonderful that you guys really give an inspiration to push forward.....


  1. Read all lines into an array, return a random line in the range of 1 and the amount of lines.

  2. Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.

  3. You just have to remember one line. Each new line has a probability of 1/N (N being lines read).

    Pseudocode:

    i = 1
    chosen_line = ""
    for line in lines:
        if random() < 1/i:    # random returns a uniform random number in [0,1)
            chosen_line = line
        i += 1
    return chosen_line
    

Algorithm number 3 could be used for 1 and 2 too.


You can do this without having to read all the lines in memory, thus working well for huge files. Pseudocode:

linenum := 0
ret := ''
while more lines to read:
    line := readline()
    linenum := linenum + 1
    r := uniform_random(0, linenum)
    if r < 1:
        ret := line

return ret

Proof: We begin by noting that we always save the first line in ret. If the file has one line, you are going to choose it, and you're done.

For two-line file, ret will save the first line 100% of the times, and the second line will be saved in ret 50% of the time during the second iteration of the loop. Thus, each line has a probability of 0.5 of being selected.

Now, let's assume that this method works for files of ≤N lines. To prove that this means it works for N+1, in the (N+1)th iteration of the loop, there is a probability of 1/(N+1) that the last line will be selected (random(0, N+1) < 1 has that probability). Thus, the last line has 1/(N+1) probability of being selected. The probability of all other lines being selected is still going to be equal to each other, let's call this x. Then, N*x + 1/(N+1) == 1, which means that x = 1/(N+1).

Proof by induction is complete.

Edit: Oops, didn't see the first answer's third method before replying. Still, I will keep this post here if only for the proof, and the opportunity for other people to correct it if there are any errors in it.


Re 1: Use solution to 2

Re 2: You would want to scan the whole file using a RandomAccessFile access to count the number of lines and (possibly) cache the file pointers for each start of line. Then you could choose one at random (I'm assuming this question is not about how to generate random numbers) and move back to that start point, read the line and return it. If you want it fast then make sure you are buffering the reads (raf is v slow otherwise).

Re 3: If the stream doesn't fit in memory (i.e. you cannot cache the whole thing) and you don't know how many lines are in the stream without reading the whole stream (assuming you only get to read it once) then I cannot see a solution. I too wait for answers...


#3: write the stream to a file on disk and use solution 2. Not the most efficient solution, of course, but very simple.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜