zcat on amazon s3

2023-04-12 12:14 问答作者：

Is it possible to cat a gzip file stored on amaz开发者_如何转开发on s3? Maybe using some streamming client?

We are looking for a operation similar to zcat s3://bucket_name/your_file | grep "log_id"

Found this thread today, and liked Keith's answer. Fast forward to today's aws cli it's done with:

aws s3 cp s3://some-bucket/some-file.bz2 - | bzcat -c | mysql -uroot some_db

Might save someone else a tiny bit of time.

From S3 REST API » Operations on Objects » GET Object:

To use GET, you must have READ access to the object. If you grant READ access to the anonymous user, you can return the object without using an authorization header.

If that's the case, you can use:

$ curl <url-of-your-object> | zcat | grep "log_id"

$ wget -O- <url-of-your-object> | zcat | grep "log_id"

However, if you haven't granted anonymous READ access on the object, you need to create and send the authorization header as part of the GET request and that becomes somewhat tedious to do with curl/wget. Lucky for you, someone has already done it and that's the Perl aws script by Tim Kay as recommended by Hari. Note that you don't have to put Tim Kay's script on your path or otherwise install it (except making it executable), as long as you use the command versions which start with aws, eg.

$ ./aws cat BUCKET/OBJECT | zcat | grep "log_id"

You could also use s3cat, part of Tim Kay's command-line toolkit for AWS:

http://timkay.com/aws/

To get the equivalent of zcat FILENAME | grep "log_id", you'd do:

> s3cat BUCKET/OBJECT | zcat - | grep "log_id"

Not exaclty a zcat, but a way to use hadoop to download large files parallel from S3 could be http://hadoop.apache.org/common/docs/current/distcp.html

hadoop distcp s3://YOUR_BUCKET/your_file /tmp/your_file

hadoop distcp s3://YOUR_BUCKET/your_file hdfs://master:8020/your_file

Maybe from this point you can pipe a zcat...

To add your credentials you have to edit core-site.xml file with:

<configuration>
<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>YOUR_KEY</value>
 </property>
 <property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>YOUR_KEY</value>
</property>
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_KEY</value>
 </property>
 <property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>YOUR_KEY</value>
</property>
</configuration>

If your OS supports it (likely) you can use /dev/fd/1 for the target for aws s3 cp:

aws s3 cp s3://bucket_name/your_file /dev/fd/1 | zcat | grep log_id

There seem to be some trailing bytes after EOF, but zcat and bzcat conveniently just write a warning to STDERR.

I just confirmed that this works by loading some DB dumps straight from S3 like this:

aws s3 cp s3://some_bucket/some_file.sql.bz2 /dev/fd/1 | bzcat -c | mysql -uroot some_db

All this with nothing but the stuff already on your computer and the official AWS CLI tools. Win.

You need to try using s3streamcat, it supports bzip, gzip and xz compressed files.

Install with

sudo pip install s3streamcat Usage

Usage:

s3streamcat s3://bucketname/dir/file_path
s3streamcat s3://bucketname/dir/file_path | more
s3streamcat s3://bucketname/dir/file_path | grep something

zcat on amazon s3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？