How can I inspect a Hadoop SequenceFile for which I lack full schema information?

2023-04-08 00:10 问答作者：

I have a compressed Hadoop SequenceFile from a customer which I'd like to insp开发者_如何学JAVAect. I do not have full schema information at this time (which I'm working on separately).

But in the interim (and in the hopes of a generic solution), what are my options for inspecting the file?

I found a tool forqlift: http://www.exmachinatech.net/01/forqlift/

And have tried 'forqlift list' on the file. It complains that it can't load classes for the custom subclass Writables included. So I will need to track down those implementations.

But is there any other option available in the meantime? I understand that most likely I can't extract the data, but is there some tool for scanning how many key values and of what type?

From shell:

$ hdfs dfs -text /user/hive/warehouse/table_seq/000000_0

or directly from hive (which is much faster for small files, because it is running in an already started JVM)

hive> dfs -text /user/hive/warehouse/table_seq/000000_0

works for sequence files.

Check the SequenceFileReadDemo class in the 'Hadoop : The Definitive Guide'- Sample Code. The sequence files have the key/value types embedded in them. Use the SequenceFile.Reader.getKeyClass() and SequenceFile.Reader.getValueClass() to get the type information.

My first thought would be to use the Java API for sequence files to try to read them. Even if you don't know which Writable is used by the file, you can guess and check the error messages (there may be a better way that I don't know).

For example:

private void readSeqFile(Path pathToFile) throws IOException {
  Configuration conf = new Configuration();
  FileSystem fs = FileSystem.get(conf);

  SequenceFile.Reader reader = new SequenceFile.Reader(fs, pathToFile, conf);

  Text key = new Text(); // this could be the wrong type
  Text val = new Text(); // also could be wrong

  while (reader.next(key, val)) {
    System.out.println(key + ":" + val);
  }
}

This program would crash if those are the wrong types, but the Exception should say which Writable type the key and value actually are.

Edit: Actually if you do less file.seq usually you can read some of the header and see what the Writable types are (at least for the first key/value). On one file, for example, I see:

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable

I'm not a Java or Hadoop programmer, so my way of solving problem could be not the best one, but anyway.

I spent two days solving the problem of reading FileSeq locally (Linux debian amd64) without installation of hadoop.

The provided sample

while (reader.next(key, val)) {
    System.out.println(key + ":" + val);
  }

works well for Text, but didn't work for BytesWritable compressed input data.

What I did? I downloaded this utility for creating (writing SequenceFiles Hadoop data) github_com/shsdev/sequencefile-utility/archive/master.zip , and got it working, then modified for reading input Hadoop SeqFiles.

The instruction for Debian running this utility from scratch:

sudo apt-get install maven2
sudo mvn install
sudo apt-get install openjdk-7-jdk

edit "sudo vi /usr/bin/mvn",
change `which java` to `which /usr/lib/jvm/java-7-openjdk-amd64/bin/java`

Also I've added (probably not required)
'
PATH="/home/mine/perl5/bin${PATH+:}${PATH};/usr/lib/jvm/java-7-openjdk-amd64/"; export PATH;

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
export JAVA_VERSION=1.7
'
to ~/.bashrc


Then usage:
sudo mvn install
~/hadoop_tools/sequencefile-utility/sequencefile-utility-master$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.0-jar-with-dependencies.jar


-- and this doesn't break the default java 1.6 installation that is required for FireFox/etc.

For resolving error with FileSeq compatability (e.g. "Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"), I used the libs from the Hadoop master server as is (a kind of hack):

scp root@10.15.150.223:/usr/lib/libhadoop.so.1.0.0 ~/
sudo cp ~/libhadoop.so.1.0.0 /usr/lib/
scp root@10.15.150.223:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server/libjvm.so ~/
sudo cp ~/libjvm.so /usr/lib/
sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so.1
sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so

One night drinking coffee, and I've written this code for reading FileSeq hadoop input files (using this cmd for running this code "/usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.3-jar-with-dependencies.jar -d test/ -c NONE"):

import org.apache.hadoop.io.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.ValueBytes;

import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;

Path file = new Path("/home/mine/mycompany/task13/data/2015-08-30");
reader = new SequenceFile.Reader(fs, file, conf);
long pos = reader.getPosition();

logger.info("GO from pos "+pos);
DataOutputBuffer rawKey = new DataOutputBuffer();
ValueBytes rawValue = reader.createValueBytes();

int DEFAULT_BUFFER_SIZE = 1024 * 1024;
DataOutputBuffer kobuf = new DataOutputBuffer(DEFAULT_BUFFER_SIZE);
kobuf.reset();

int rl;
do {
  rl = reader.nextRaw(kobuf, rawValue);
  logger.info("read len for current record: "+rl+" and in more details ");
  if(rl >= 0)
  {
    logger.info("read key "+new String(kobuf.getData())+" (keylen "+kobuf.getLength()+") and data "+rawValue.getSize());
    FileOutputStream fos = new FileOutputStream("/home/mine/outb");
    DataOutputStream dos = new DataOutputStream(fos);
    rawValue.writeUncompressedBytes(dos);
    kobuf.reset();
  }
} while(rl>0);

I've just added this chunk of code to file src/main/java/eu/scape_project/tb/lsdr/seqfileutility/SequenceFileWriter.java just after the line

writer = SequenceFile.createWriter(fs, conf, path, keyClass, valueClass, CompressionType.get(pc.getCompressionType()));

Thanks to these sources of info: Links:

If using hadoop-core instead of mahour, then will have to download asm-3.1.jar manually: search_maven_org/remotecontent?filepath=org/ow2/util/asm/asm/3.1/asm-3.1.jar search_maven_org/#search|ga|1|asm-3.1

The list of avaliable mahout repos: repo1_maven_org/maven2/org/apache/mahout/ Intro to Mahout: mahout_apache_org/

Good resource for learning interfaces and sources of Hadoop Java classes (I used it for writing my own code for reading FileSeq): http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/io/BytesWritable.java

Sources of project tb-lsdr-seqfilecreator that I used for creating my own project FileSeq reader: www_javased_com/?source_dir=scape/tb-lsdr-seqfilecreator/src/main/java/eu/scape_project/tb/lsdr/seqfileutility/ProcessParameters.java

stackoverflow_com/questions/5096128/sequence-files-in-hadoop - the same example (read key,value that doesn't work)

https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java - this one helped me (I used reader.nextRaw the same as in nextKeyValue() and other subs)

Also I've changed ./pom.xml for native apache.hadoop instead of mahout.hadoop, but probably this is not required, because the bugs for read->next(key, value) are the same for both so I had to use read->nextRaw(keyRaw, valueRaw) instead:

diff ../../sequencefile-utility/sequencefile-utility-master/pom.xml ./pom.xml 
9c9
<     <version>1.0</version>
---
>     <version>1.3</version>
63c63
<             <version>2.0.1</version>
---
>             <version>2.4</version>
85c85
<             <groupId>org.apache.mahout.hadoop</groupId>
---
>             <groupId>org.apache.hadoop</groupId>
87c87
<             <version>0.20.1</version>
---
>             <version>1.1.2</version>
93c93
<             <version>1.1</version>
---
>             <version>1.1.3</version>

I was just playing with Dumbo. When you run a Dumbo job on a Hadoop cluster, the output is a sequence file. I used the following to dump out an entire Dumbo-generated sequence file as plain text:

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \
    -input totals/part-00000 \
    -output unseq \
    -inputformat SequenceFileAsTextInputFormat
$ bin/hadoop fs -cat unseq/part-00000

I got the idea from here.

Incidentally, Dumbo can also output plain text.

Following the anwer of Praveen Sripati, here a small example of SequenceFileReadDemo.java from Hadoop the Definitive Guide by Tom White.

Data are in HDFS, in this position : user/hduser/output-hashsort/ and the file is part-r-00001 In eclipse, in the Arguments folder I've written this string :

How can I inspect a Hadoop SequenceFile for which I lack full schema information?

and this is part of the output, with the debugger

How can I inspect a Hadoop SequenceFile for which I lack full schema information?

How can I inspect a Hadoop SequenceFile for which I lack full schema information?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？