zcat on amazon s3
Is it possible to cat a gzip file stored on amaz开发者_如何转开发on s3? Maybe using some streamming client?
We are looking for a operation similar to zcat s3://bucket_name/your_file | grep "log_id"
Found this thread today, and liked Keith's answer. Fast forward to today's aws cli it's done with:
aws s3 cp s3://some-bucket/some-file.bz2 - | bzcat -c | mysql -uroot some_db
Might save someone else a tiny bit of time.
From S3 REST API » Operations on Objects » GET Object:
To use GET, you must have READ access to the object. If you grant READ access to the anonymous user, you can return the object without using an authorization header.
If that's the case, you can use:
$ curl <url-of-your-object> | zcat | grep "log_id"
or
$ wget -O- <url-of-your-object> | zcat | grep "log_id"
However, if you haven't granted anonymous READ access on the object, you need to create and send the authorization header as part of the GET
request and that becomes somewhat tedious to do with curl
/wget
. Lucky for you, someone has already done it and that's the Perl aws script by Tim Kay as recommended by Hari. Note that you don't have to put Tim Kay's script on your path or otherwise install it (except making it executable), as long as you use the command versions which start with aws
, eg.
$ ./aws cat BUCKET/OBJECT | zcat | grep "log_id"
You could also use s3cat, part of Tim Kay's command-line toolkit for AWS:
http://timkay.com/aws/
To get the equivalent of zcat FILENAME | grep "log_id"
, you'd do:
> s3cat BUCKET/OBJECT | zcat - | grep "log_id"
Not exaclty a zcat, but a way to use hadoop to download large files parallel from S3 could be http://hadoop.apache.org/common/docs/current/distcp.html
hadoop distcp s3://YOUR_BUCKET/your_file /tmp/your_file
or
hadoop distcp s3://YOUR_BUCKET/your_file hdfs://master:8020/your_file
Maybe from this point you can pipe a zcat...
To add your credentials you have to edit core-site.xml file with:
<configuration>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_KEY</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_KEY</value>
</property>
</configuration>
If your OS supports it (likely) you can use /dev/fd/1
for the target for aws s3 cp
:
aws s3 cp s3://bucket_name/your_file /dev/fd/1 | zcat | grep log_id
There seem to be some trailing bytes after EOF, but zcat
and bzcat
conveniently just write a warning to STDERR
.
I just confirmed that this works by loading some DB dumps straight from S3 like this:
aws s3 cp s3://some_bucket/some_file.sql.bz2 /dev/fd/1 | bzcat -c | mysql -uroot some_db
All this with nothing but the stuff already on your computer and the official AWS CLI tools. Win.
You need to try using s3streamcat, it supports bzip, gzip and xz compressed files.
Install with
sudo pip install s3streamcat
Usage
Usage:
s3streamcat s3://bucketname/dir/file_path
s3streamcat s3://bucketname/dir/file_path | more
s3streamcat s3://bucketname/dir/file_path | grep something
精彩评论