How to Get Pig to Work with lzo Files?

2023-04-01 05:26 问答作者：

So, I've seen a couple of tutorials for this online, but each seems to say to do something different. Also, each of them doesn't seem to specify whether you're trying to get things to work on a remote cluster, or to locally interact with a remote cluster, etc...

That said, my goal is just to get my local computer (a mac) to make pig work with lzo compressed files that exist on a Hadoop cluster that's already been setup to work with lzo files. I already have Hadoop installed locally and can 开发者_如何学运维get files from the cluster with hadoop fs -[command].

I also already have pig installed locally and communicating with the hadoop cluster when I run scripts or when I just run stuff through grunt. I can load and play around with non-lzo files just fine. My problem is only in terms of figuring out a way to load lzo files. Maybe I can just process them through the cluster's instance of ElephantBird? I have no idea, and have only found minimal information online.

So, any sort of short tutorial or answer for this would be awesome, and would hopefully help more people than just me.

I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!

NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).

Hooking PIG up to be able to work with LZOs

This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:

Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.
Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile this on a 64bit machine.
Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.
Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib
Then configure hadoop and pig to have the property java.library.path point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:
```
<property>
    <name>mapred.child.env</name>
    <value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
</property>
```
Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.
Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).
Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.
command: ant in the elephant-bird folder in order to create a jar.
For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.
Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

继续阅读：apache-pig lzo

How to Get Pig to Work with lzo Files?

Hooking PIG up to be able to work with LZOs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Hooking PIG up to be able to work with LZOs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？