开发者

zcat on amazon s3

Is it possible to cat a gzip file stored on amaz开发者_如何转开发on s3? Maybe using some streamming client?

We are looking for a operation similar to zcat s3://bucket_name/your_file | grep "log_id"


Found this thread today, and liked Keith's answer. Fast forward to today's aws cli it's done with:

aws s3 cp s3://some-bucket/some-file.bz2 - | bzcat -c | mysql -uroot some_db

Might save someone else a tiny bit of time.


From S3 REST API » Operations on Objects » GET Object:

To use GET, you must have READ access to the object. If you grant READ access to the anonymous user, you can return the object without using an authorization header.

If that's the case, you can use:

$ curl <url-of-your-object> | zcat | grep "log_id"

or

$ wget -O- <url-of-your-object> | zcat | grep "log_id"

However, if you haven't granted anonymous READ access on the object, you need to create and send the authorization header as part of the GET request and that becomes somewhat tedious to do with curl/wget. Lucky for you, someone has already done it and that's the Perl aws script by Tim Kay as recommended by Hari. Note that you don't have to put Tim Kay's script on your path or otherwise install it (except making it executable), as long as you use the command versions which start with aws, eg.

$ ./aws cat BUCKET/OBJECT | zcat | grep "log_id"


You could also use s3cat, part of Tim Kay's command-line toolkit for AWS:

http://timkay.com/aws/

To get the equivalent of zcat FILENAME | grep "log_id", you'd do:

> s3cat BUCKET/OBJECT | zcat - | grep "log_id"


Not exaclty a zcat, but a way to use hadoop to download large files parallel from S3 could be http://hadoop.apache.org/common/docs/current/distcp.html

hadoop distcp s3://YOUR_BUCKET/your_file /tmp/your_file

or

hadoop distcp s3://YOUR_BUCKET/your_file hdfs://master:8020/your_file

Maybe from this point you can pipe a zcat...

To add your credentials you have to edit core-site.xml file with:

<configuration>
<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>YOUR_KEY</value>
 </property>
 <property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>YOUR_KEY</value>
</property>
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_KEY</value>
 </property>
 <property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>YOUR_KEY</value>
</property>
</configuration>


If your OS supports it (likely) you can use /dev/fd/1 for the target for aws s3 cp:

aws s3 cp s3://bucket_name/your_file /dev/fd/1 | zcat | grep log_id

There seem to be some trailing bytes after EOF, but zcat and bzcat conveniently just write a warning to STDERR.

I just confirmed that this works by loading some DB dumps straight from S3 like this:

aws s3 cp s3://some_bucket/some_file.sql.bz2 /dev/fd/1 | bzcat -c | mysql -uroot some_db

All this with nothing but the stuff already on your computer and the official AWS CLI tools. Win.


You need to try using s3streamcat, it supports bzip, gzip and xz compressed files.

Install with

sudo pip install s3streamcat Usage

Usage:

s3streamcat s3://bucketname/dir/file_path
s3streamcat s3://bucketname/dir/file_path | more
s3streamcat s3://bucketname/dir/file_path | grep something
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜