Finding line duplicates in text file where lines can be identical to each other
I've made a system where the data in the database is filled when the system reads a file. This file may be filled at a later stage, which creates a demand to read the same file again.
The data itself is represented on each line of the file, and the tough part开发者_StackOverflow社区 is to find unique values, and I'll tell you why.
The file may look like this:
123 20110101 4123 Hello
123 20110101 4123 Hello
124 20110102 6133 Hello again
125 20110103 6425 Yes
The real problem here is that the first two lines aren't duplicates, so they're both going to get read into the database by the system.
As I earlier told, this file may be added to at a later stage, making it necessary that we read it again. As I was not familiar with how text was appended to the file, I made the assumption that new data would be appended to the end of the file. Therefore I added file row number to each row in the database, to make lines unique. However, I was wrong...
As it turns out, data where appended to the file in the middle of it as well.
This means we now may have the following file:
123 20110101 4123 Hello
123 20110101 4123 Hello
124 20110102 6133 Hello again
123 20110101 4123 Hello
125 20110103 6425 Yes
And now we stand before the second time we read the file. In this case I only want to read the fourth line, as this is the only new line. How can I find the new line and get rid of the others?
Save the old version of the file, then run a diff
on the old version and the new version. That will give you the newly added lines.
精彩评论