开发者

How to read Stata binary format from a compressed file

The Stata knowledge base includes a note on reading ASCII data from a pipe, which would allow one to read a file without storing the decompressed version on disk. We have never had success with the method shown there - I always get the error message mypipe.pip: not found. We have terabytes of data that compresses very well, so this was always a disappointment. We'd be interested in hearing if it works for anyone else.

While investigating this we found a work-around that seems much better. Unlike the knowledge base suggestion, it will work with .dta files in addition to ASCII files. This is very much more interesting to us. This is done with the http option of the use command.

Our first try was to add the file test.cgi to our webservers cgi-bin directory:

#!/bin/sh
echo Content-type: application/x-stata
/usr/bin/zcat /data/sample.dta.gz

and we find that

use http://www.nber.org/test

works from Stata but this involved a lot of overhead as the file whipped around the LAN several times, so we haven't pursued taking the file name from the URL or otherwise making this practical.

We are developing an alternative that doesn't require an actual webserver, or even root permissions. This is done with the nc command which ships with most Linux distributions and is available for windows also. At the Stata prompt run the compound command:

.! (echo -ne "HTTP/1.0 200 OK\r\n\r\n"; zcat /data/sample.dta.gz;) | nc -l 8080 &

This command sets up the computer to transmit a header and the decompressed file to the first process that reads from port 8080. Since 8080 is a high port, no special permission is required to use it. This command won't return till the file is read from that port, when it will show you the exact Stata request. Because of the & Stata continues while nc waits. Then

. use http://127.0.0.1:8080

Note 开发者_StackOverflow中文版that you can't use "localhost" instead of 127.0.0.1 because the use command won't accept one-part host names.

If there is no nc on your machine, look for ncat, netcat or socat. Some versions will require a '-p' before the port number. You can install nc on a Windows machine and should be able to do the same thing, but we haven't tried it.

This could also be used for ascii files, encrypted files, split files, and perhaps other types. If only Stat/Transfer would write to the standard output!

There is a security issue - you give up the read restrictions in the Unix permission bits. It is also slower than reading the uncompressed file from disk, but still fast enough for us.

We have been trying to package this into an ado file, but without much success, since a user-friendly ado program would need to find an available port by itself, which we haven't seen a good way to do yet, and to communicate it back to the use command, for which we are also at a loss. I was hoping someone on the list might be inspired to suggest a method or that Statacorp might just incorporate decompression into the use command.

Daniel Feenberg feenberg@nber.org

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜