creating a masked array from text fields

2023-01-16 00:18 问答作者：

The numpy documentation shows an example of masking existing values with ma.masked a posteriori (after array creation), or creating a masked array from an list of what seem to be valid data types (integer if dtype=int). I am trying to read in data from a file (and requires some text manipulation) but at some point 开发者_JAVA百科I will have a list of lists (or tuples) containing strings from which I want to make a numeric (float) array.

An example of the data might be textdata='1\t2\t3\n4\t\t6' (typical flat text format after cleaning).

One problem I have is that missing values may be encoded as '', which when trying to convert to float using the dtype argument, will tell me

ValueError: setting an array element with a sequence.

So I've created this function

def makemaskedarray(X,missing='',fillvalue='-999.',dtype=float):
    arr = lambda x: x==missing and fillvalue or x    
    mask = lambda x: x==missing and 1 or 0
    triple = dict(zip(('data','mask','dtype'),
                      zip(*[(map(arr,x),map(mask,x)) for x in X])+
                      [dtype]))
    return ma.array(**triple)

which seems to do the trick:

>>> makemaskedarray([('1','2','3'),('4','','6')])
masked_array(data =
 [[1.0 2.0 3.0]
 [4.0 -- 6.0]],
             mask =
 [[False False False]
 [False  True False]],
       fill_value = 1e+20)

Is this the way to do it? Or there is a built-in function?

The way you're doing it is fine. (though you could definitely make it a bit more readable by avoiding building the temporary "triple" dict, just to expand it a step later, i.m.o.)

The built-in way is to use numpy.genfromtxt. Depending on the amount of pre-processing you need to do to your text file, it may or may not do what you need. However, as a basic example: (Using StringIO to simulate a file...)

from StringIO import StringIO
import numpy as np

txt_data = """
1\t2\t3
4\t\t6
7t\8t\9"""

infile = StringIO(txt_data)
data = np.genfromtxt(infile, usemask=True, delimiter='\t')

Which yields:

masked_array(data =
 [[1.0 2.0 3.0]
 [4.0 -- 6.0]
 [7.0 8.0 9.0]],
             mask =
 [[False False False]
 [False  True False]
 [False False False]],
       fill_value = 1e+20)

One word of caution: If you do use tabs as your delimiter and an empty string as your missing value marker, you'll have issues with missing values at the start of a line. (genfromtxt essentially calls line.strip().split(delimiter)). You'd be better off using something like "xxx" as a marker for missing values, if you can.

继续阅读：numpy python scipy

creating a masked array from text fields

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？