开发者

List<T> -- Proper Collection to Use for Large Amount of Items?

Edit: Ok, since it's clear I'm taking the wrong approach with this, I'll explain what I was intending to do. The overall intent is to (as an exercise) verify all valid email addresses according to spec. This portion was to generate a portion of the data-set to verify the algorithm against.


As an exercise, I'm writing a program that will generate all possible email addresses. This will result in 808165 ≈ 1.4e122 possible items. I'm currently using List<T>s to store the generated items but my understanding is that it has a maximum capacity of Int32.MaxValue. I'm guessing a proper solution isn't going to involve Lists of Lists of Lists. This is what I have so far.

private void GenerateLocalPart()
{
    List<string> validLocalSymbols = new List<string>()
    {
        ".", "!", "#", "$", "%", "&", "*", "+", "-",
        "/", "^", "_", "`", "{", "|", "}", "~", "\"",
    };
    List<string> validLocalNumbers = new List<string>()
    {
        "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
    };
    List<string> validLocalLowercase = new List<string>()
    {
        "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
        "k", "l", "m", "n", "o", "p", "q", "r", "s", "t",
        "u", "v", "w", "x", "y", "z",
    };
    List<string> validLocalUppercase = new List<string>()
    {
        "A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
        "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T",
        "U", "V", "W", "X", "Y", "Z",
    };

    List<string> validLocalPartCharacters = new List<string>();
    validLocalPartCharacters.AddRange(validLocalSymbols);
    validLocalPartCharacters.AddRange(validLocalNumbers);
    validLocalPar开发者_高级运维tCharacters.AddRange(validLocalLowercase);
    validLocalPartCharacters.AddRange(validLocalUppercase);

    List<string> targetSequence           = validLocalLowercase;
    int lengthOfStringToGenerate          = 5;
    int numberOfDifferentSourceCharacters = targetSequence.Count;
    List<List<string>> localPart          = new List<List<string>>();
    List<string> localPartSeed            = new List<string>();

    localPart.Add(localPartSeed);
    foreach (string character in targetSequence)
        localPartSeed.Add(character);

    for (int i = 1; i < lengthOfStringToGenerate; i++)
    {
        List<string> bufferList = new List<string>();
        localPart.Add(bufferList);
        foreach (string lastListString in localPart[i - 1])
            foreach (string character in targetSequence)
                bufferList.Add(lastListString + character);
    }

    Console.WriteLine("Break here.");
}

lengthOfStringToGenerate is a maximum length of the strings (so it generates all combinations from 1 to lengthOfStringToGenerate). localPart will end up with an amount of Lists equivalent to the lengthOfStringToGenerate. Is there a different type of collection that I should be using? Is there a different overall approach I should be taking?


Where were you expecting to store all this data? List<T> will always store its values in memory... but even if you write something to store the results to disk, you're still not going to be able to hold 1.4e122 items. Have you really taken in just how big that number is? Even at a single bit per item, that's way more than the capacity of the universe, if the whole of the universe was one big hard disk.

The largest unit of data I've ever heard of being talked about in a meaningful way is an exabyte, which is 1018 bytes. For most people, a petabyte (1015 bytes) is a pretty huge amount of data. What you're considering makes those quantities seem microscopically small.

What are you trying to do with the data afterwards? And when would you expect such an algorithm to ever actually finish?


Your project might even be a bigger threat to reality than the LHC and possible micro black-holes, see Arthur C. Clarke’s ‘The Nine Billion Names of God’

Actually, given that your algorithm is entirely deterministic why store the values in the first place? Why not represent them as a in-memory IEnumerable stream that generates the next possible email address each time Next is called. You could store your progress so far (good luck with this) and use the “current position” (or rather “position reached”) as an input to the generator for subsequent runs. As Jon points out, it’s not just a space problem – there’s the total runtime to contend with too (another reason perhaps to use a non-persistent, pull based architecture – assuming you can use the results as they become available then at least this way you’ll derive some useful effect early).

Cute project, one does naturally wonder: why.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜