开发者

Nutch on EMR problem reading from S3

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR.

To do this I specifiy an input directory from S3. I get the following error:

Fetcher: java.lang.IllegalArgumentException:
    This file system object (hdfs://ip-11-202-55-144.ec2.internal:9000)
    does not support access to the request path 
    's3n://crawlResults2/segments/20110823155002/crawl_fetch'
    You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path.

I understand the difference between FileSystem.get(uri, conf), and FileSystem.get(conf). If I were writing this myself I would FileSystem.get(uri, conf) however I am trying to use existing Nutch code.

I asked this question, and someone told me that I needed to modify hadoop-site.xml to include the following properties开发者_StackOverflow: fs.default.name, fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey. I updated these properties in core-site.xml (hadoop-site.xml does not exist), but that didn't make a difference. Does anyone have any other ideas? Thanks for the help.


try to specify in

hadoop-site.xml

<property>
  <name>fs.default.name</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

This will mention to Nutch that by default S3 should be used

Properties

fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey

specification you need only in case when your S3 objects are placed under authentication (In S3 object can be accessed to all users, or only by authentication)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜