I'm using Hadoop for data processing with python, what file format should be used?

2022-12-18 00:05 问答作者：

I'm using Hadoop for data processing with python, what file for开发者_运维技巧mat should be used?

I have project with a substantial amount of text pages.

Each text file has some header information that I need to preserve during the processing; however, I don't want the headers to interfere with the clustering algorithms.

I'm using python on Hadoop (or is there a sub package better suited?)

How should I format my text files, and store those text files in Hadoop for processing?

1) Files

If you use Hadoop Streaming, you have to use line-based text-files, data up to the first tab is passed to your mapper as key.

Just look at the documentation for streaming.

You can also put you input-files into HDFS, which would be recommendable for big files. Just look at the "Large Files"-section in the above link.

2) Metadata-preservation

The problem I see is that your header information (metadata) will only be treated as such data, so you have to filter it out by yourself (first step). To pass it along is more difficult, as the data of all input files will just be joined after the map-step.

You will have to add the metadata somewhere to the data itself (second step) to be able to relate it later. You could emit (key, data+metadata) for each data-line of file and thus be able to preserve the metadata for each data-line. Might be a huge overhead, but we are talking MapReduce, means: pfffrrrr ;)

Now comes the part where I don't know how much streaming really differs from a Java-implemented job. IF streaming invokes one mapper per file, you could spare yourself the following trouble, just take the first input of map() as metadata and add it (or a placeholder) to all following data-emits. If not, the next is about Java-Jobs:

At least with a JAR-mapper you can relate the data to its input-file (see here). But you would have to extract the metadata first, as the map-function just might be invoked on a partition of the file not containing the metadata. I'd propose something like that:

create a metadata-file beforehand, containing an placeholder-index: keyx:filex, metadatax
put this metadata-index into the HDFS
use a JAR-mapper, load during setup() the metadata-index-file
- see org.apache.hadoop.hdfs.DFSClient
match filex, set keyx for this mapper
add to each emitted data-line in map() the used keyx

If you're using Hadoop Streaming, your input can be in any line-based format; your mapper and reducer input comes from sys.stdin, which you read any way you want. You don't need to use the default tab-deliminated fields (although in my experience, one format should be used among all tasks for consistency when possible).

However, with the default splitter and partitioner, you cannot control how your input and output is partitioned or sorted, so you your mappers and reducers must decide whether any particular line is a header line or a data line using only that line - they won't know the original file boundaries.

You may be able to specify a partitioner which lets a mapper assume that the first input line is the first line in a file, or even move away from a line-based format. This was hard to do the last time I tried with Streaming, and in my opinion mapper and reducer tasks should be input agnostic for efficiency and reusability - it's best to think of a stream of input records, rather than keeping track of file boundaries.

Another option with Streaming is to ship header information in a separate file, which is included with your data. It will be available to your mappers and reducers in their working directories. One idea would be to associate each line with the appropriate header information in an inital task, perhaps by using three fields per line instead of two, rather than associating them by file.

In general, try and treat the input as a stream and don't rely on file boundaries, input size, or order. All of these restrictions can be implemented, but at the cost of complexity. If you do need to implement them, do so at the beginning or end of your task chain.

If you're using Jython or SWIG, you may have other options, but I found those harder to work with than Streaming.

继续阅读：python

I'm using Hadoop for data processing with python, what file format should be used?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？