开发者

I need to generate 50 Millions Rows csv file with random data: how to optimize this program?

The program below can generate random data according to some specs (example here is for 2 columns)

It works with a few hundred of thousand lines on my PC (should depend on RAM). I need to scale to dozen of millions row.

How can I optimize the program to write directly to disk ? Subsidiarily ho开发者_如何学运维w can I "cache" the parsing rule execution as it is always the same pattern repeated 50 Millions times ?

Note: to use the program below, just type generate-blocks and save-blocks output will be db.txt

Rebol[]

specs: [
    [3 digits 4 digits 4 letters]
    [2 letters 2 digits]
]

;====================================================================================================================


digits: charset "0123456789"
letters: charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
separator: charset ";"

block-letters: [A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]

blocks: copy []

generate-row: func[][
    Foreach spec specs [

        rule: [

            any [

                [
                    set times integer! [['digits (                          
                                repeat n times [                    
                                block: rejoin [block random 9]                          
                            ]

                            )
                            | 
                            'letters (repeat n times [                  
                                block: rejoin [ block to-string pick block-letters random 24]                       
                            ]

                            )
                        ]
                        |
                        [
                            'letters (repeat n times [block: rejoin [ block to-string pick block-letters random 24]                     
                            ]

                            )       
                            | 
                        'digits (repeat n times [block: rejoin [block random 9]]

                        )   
                        ]
                    ]
                    |
                    {"} any separator {"}
                ]

            ]

            to end

        ]
        block: copy ""
        parse spec rule
        append blocks block
    ]
]

generate-blocks: func[m][
  repeat num m [  
    generate-row
  ]
]

quote: func[string][
    rejoin [{"} string {"}]
]

save-blocks: func[file][
    if exists? to-rebol-file file [
        answer: ask rejoin ["delete " file "? (Y/N): "]
        if (answer = "Y") [
            delete %db.txt
        ]
    ]
    foreach [field1 field2] blocks [
        write/lines/append %db.txt rejoin [quote field1 ";" quote field2]
    ]
]


Use open with /direct and /lines refinement to write directly to file without buffering the content:

file: open/direct/lines/write %myfile.txt
loop 1000 [
  t: random "abcdefghi"
  append file t
]
Close file

This will write 1000 random lines without buffering. You can also prepare a block of lines (lets say 10000 rows) then write it directly to file, this will be faster than writing line-by-line.

file: open/direct/lines/write %myfile.txt
loop 100 [
  b: copy []
  loop 1000 [append b random "abcdef"]
  append file b
]
close file

this will be much faster, 100000 rows less than a second. Hope this will help.

Note that, you can change the number 100 and 1000 according to your needs an memory of your pc, and use b: make block! 1000 instead of b: copy [], it will be faster.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜