Handling large String lists in java

2023-04-10 04:51 问答作者：

I've got a task, where I've got to go through several billion string lines and check, whether each of those is unique. All the lines themselves cannot be accommodated within the RAM memory of the PC. Also, th开发者_如何转开发e number of lines is likely to be larger than Integer.MAX_VALUE.

I'm assuming that the best way to handle this amount of data is to put hash codes of each of the strings into some sort of HashTable.

So, here are my questions:

What should I use instead of String.hashCode()? (the return value is int, but I probably will need long)
What is the fastest way/framework for working with lists of this size? What I mostly need is an ability to quickly check if the list contains an element or not

You are over thinking the problem, this can all be done very simply with one MySQL table which saves data to the disk instead of holding everything in memory. That much data was never meant to be efficiently handled by a standalone application.

CREATE TABLE TONS_OF_STRINGS
(
  unique_string varchar(255) NOT NULL,
  UNIQUE (unique_string)
)

Just loop through the values (assuming a comma separated list here) and try to insert each token. Each failed token is a duplicate.

public static void main(args) {
  Connection con = DriverManager.getConnection("jdbc:mysql://localhost/database","username","password");
  FileReader file = new FileReader("SomeGiantFile.csv");
  Scanner scan = new Scanner(file);
  scan.useDelimiter(",");
  String token;
  while ( scan.hasNext() ) {
    token = scan.next();
    try {
      PreparedStatement ps = con.prepareStatement("Insert into TONS_OF_STRING (UNIQUE_STRING) values (?)");
      ps.setString(1, token);
      ps.executeUpdate();
    } catch (SQLException e) {
      System.out.println("Found duplicate: " + token );
    }
  }
  con.close();
  System.out.println("Well that was easy, I'm all done!");
  return 0;
}

Don't forget to clear the table when you are done though, thats a lot of data.

It is not sufficient to simply store 32 or 64 bit hashcodes because two distinct strings (out of a few billion) can easily have the same hashcode. Once you have two strings with the same hashcode, you need to compare the actual strings to see if they are actually equal.

Here's the way I'd solve this problem:

Read the file / stream of strings:
1. Read each line
2. Calculate the hash code for the line
3. Write the hashcode and the string to a temporary file with a suitable field separator in between
Use a decent external sort program to sort the temporary file using the hashcode field as the primary sort key and the string field as the secondary sort key.
Read the temporary file a line at a time. If two successive lines have the same hashcode field and different string fields then you've found a duplicate string.

Note: This approach will work equally well with 32 or 64 bit hashcodes.

继续阅读：bigdata data-structures hashset

Handling large String lists in java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？