How does Pig use Hadoop Globs in a 'load' statement?

2023-02-28 04:31 问答作者：

As I've noted previously, Pig doesn't cope well with empty (0-byte) files. U开发者_JAVA技巧nfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).

I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.

Here's an example: Assume I have the following files in S3:

mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes)

If I use a LOAD statement like this in my pig script:

myData = load 's3://mybucket/a/b/*.log as ( ... )

I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?

This is a fairly ugly solution, but globs that don't rely on the * wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.

For example, in the example above, we list "mybucket/a":

hadoop fs -lsr s3://mybucket/a

Which returns a list of files, plus other metadata. We can then create the glob from that data:

myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )

This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.

Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".

继续阅读：apache-pig

How does Pig use Hadoop Globs in a 'load' statement?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？