Is there a way to load a Gzipped file from Amazon S3 into Pentaho (PDI / Spoon / Kettle)?
Is there a way to load a Gzipped file from Amazon S3 into Pentaho Data Integration (Spoon)?
There is a "Text File Input" that has a Compression attribute that supports Gzip, but this module can't connect to S3 as a source.
There is an "S3 CSV Input" module, but no Compression attribute, so it can't decompress the Gzipped content 开发者_开发技巧into tabular form.
Also, there is no way to save the data from S3 to a local file. The downloaded content can only be "hopped" to another Step, but no Step can read gzipped data from a previous Step, the Gzip-compatible steps all read only from files.
So, I can get gzipped data from S3, but I can't send that data anywhere that can consume it.
Am I missing something? Is there a way to unzip zipped data from a non-file source?
Kettle uses VFS (Virtual File System) when working with files. Therefore, you can fetch a file through http, ssh, ftp, zip, ... and use it as a regular, local file in all the steps that read files. Just use the right "url". You will find more here and here, and a very nice tutorial here. Also, check out VFS transformation examples that come with Kettle.
This is url template for S3: s3://<Access Key>:<Secret Access Key>@s3<file path>
In your case, you would use "Text file input" with compression settings you mentioned and selected file would be:
s3://aCcEsSkEy:SecrEttAccceESSKeeey@s3/your-s3-bucket/your_file.gzip
I really don't know how but if you really need this you can look for using S3 through VFS capabilities that Pentaho Data Integration provides. I can se a vsf-providers.xml with the following content in my PDI CE distribution:
../data-integration/libext/pentaho/pentaho-s3-vfs-1.0.1.jar
<providers>
<provider class-name="org.pentaho.s3.vfs.S3FileProvider">
<scheme name="s3"/>
<if-available class-name="org.jets3t.service.S3Service"/>
</provider>
</providers>
You can also try with GZIP input control in peanatho kettle it is there.
精彩评论