I need to generate 50 Millions Rows csv file with random data: how to optimize this program?

2022-12-28 22:19 问答作者：

The program below can generate random data according to some specs (example here is for 2 columns)

It works with a few hundred of thousand lines on my PC (should depend on RAM). I need to scale to dozen of millions row.

How can I optimize the program to write directly to disk ? Subsidiarily ho开发者_如何学运维w can I "cache" the parsing rule execution as it is always the same pattern repeated 50 Millions times ?

Note: to use the program below, just type generate-blocks and save-blocks output will be db.txt

Rebol[]

specs: [
    [3 digits 4 digits 4 letters]
    [2 letters 2 digits]
]

;====================================================================================================================


digits: charset "0123456789"
letters: charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
separator: charset ";"

block-letters: [A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]

blocks: copy []

generate-row: func[][
    Foreach spec specs [

        rule: [

            any [

                [
                    set times integer! [['digits (                          
                                repeat n times [                    
                                block: rejoin [block random 9]                          
                            ]

                            )
                            | 
                            'letters (repeat n times [                  
                                block: rejoin [ block to-string pick block-letters random 24]                       
                            ]

                            )
                        ]
                        |
                        [
                            'letters (repeat n times [block: rejoin [ block to-string pick block-letters random 24]                     
                            ]

                            )       
                            | 
                        'digits (repeat n times [block: rejoin [block random 9]]

                        )   
                        ]
                    ]
                    |
                    {"} any separator {"}
                ]

            ]

            to end

        ]
        block: copy ""
        parse spec rule
        append blocks block
    ]
]

generate-blocks: func[m][
  repeat num m [  
    generate-row
  ]
]

quote: func[string][
    rejoin [{"} string {"}]
]

save-blocks: func[file][
    if exists? to-rebol-file file [
        answer: ask rejoin ["delete " file "? (Y/N): "]
        if (answer = "Y") [
            delete %db.txt
        ]
    ]
    foreach [field1 field2] blocks [
        write/lines/append %db.txt rejoin [quote field1 ";" quote field2]
    ]
]

Use open with /direct and /lines refinement to write directly to file without buffering the content:

file: open/direct/lines/write %myfile.txt
loop 1000 [
  t: random "abcdefghi"
  append file t
]
Close file

This will write 1000 random lines without buffering. You can also prepare a block of lines (lets say 10000 rows) then write it directly to file, this will be faster than writing line-by-line.

file: open/direct/lines/write %myfile.txt
loop 100 [
  b: copy []
  loop 1000 [append b random "abcdef"]
  append file b
]
close file

this will be much faster, 100000 rows less than a second. Hope this will help.

Note that, you can change the number 100 and 1000 according to your needs an memory of your pc, and use b: make block! 1000 instead of b: copy [], it will be faster.

继续阅读：rebol

I need to generate 50 Millions Rows csv file with random data: how to optimize this program?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？