Adding textual column and row headers to numpy array
I am creating a 2d summary matrix from a 3d array using the following code:
numTests=len(TestIDs)
numColumns=11
numRows=6
SummaryMeansArray = p.array([])
summary3dArray = ma.zeros((numTests,numColumns,numRows))
j=0
for j in range(0,len(TestIDs)):
print 'j is: ',j
TestID=str(TestIDs[j])
print 'TestID is: ',TestID
reader=csv.reader(inputfile)
m=1
for row in reader:
if row[0]!='TestID':
summary3dArray[j,1,m] =row[2]
summary3dArray[j,2,m] =row[3]
summary3dArray[j,3,m] =row[4]
summary3dArray[j,4,m] =row[5]
summary3dArray[j,5,m] =row[6]
summary3dArray[j,6,m] =row[7]
summary3dArray[j,7,m] =row[8]
summary3dArray[j,8,m] =row[9]
summary3dArray[j,9,m] =row[10]
summary3dArray[j,10,m] =row[11]
m+=1
inputfile.close()
outputfile=open(outputFileName, "wb")
writer = csv.writer(outputfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
outputfile.close()
smith='test'开发者_JS百科
summary3dArray.mask = (summary3dArray.data == 0) # mask all data equal to zero
summaryMeansArray = mean(summary3dArray, axis=0) # the returned shape is (numColumns,numRows)
print 'SummaryMeansArray is: ',summaryMeansArray
The data returned by printing the 2d matrix is:
SummaryMeansArray is: [[-- -- -- -- -- --]
[-- 0.872486111111 0.665114583333 0.578107142857 0.495854166667 0.531722222222]
[-- 69.6520408802 91.3136933451 106.82865123 125.834593798 112.847127834]
[-- 1.26883876577 1.64726525154 1.82965948427 1.93913919335 1.81572414167]
[-- 0.0707222222222 0.0696458333333 0.0654285714286 0.06196875 0.0669444444444]
[-- 0.219861111055 0.195958333333 0.179925 0.1641875 0.177]
[-- 0.290583333278 0.265604166667 0.245353571429 0.22615625 0.243944444444]
[-- 24.1924238322 23.4668576333 23.2784801383 22.8667912971 21.0416383955]
[-- 90.7234287345 108.496149905 112.364863351 113.57480005 144.061033524]
[-- 6.16448575902 9.7494285825 11.6270150699 13.5876342704 16.2569218735]
[-- 0.052665615304 0.069989497088 0.0783212378582 0.0846757181338 0.0862920065249]]
I have two questions:
1.) I want to add textual row headers and column headers to summaryMeansArray, but I am getting error messages when I try to do this now. What is the proper syntax for adding row headers and column headers in this code?2.) Is summaryMeansArray set up to have 11 columns and 6 rows? My understanding is that the proper syntax is columns,rows. However, it seems to be printing out 11 rows and 6 columns above. Is this just because python groups each column's data within its own brackets by convention? Or did I mess up the syntax?
1.) I would recommend storing column and row header information in a separate data structure. Numpy matrices can store mixed data types (in this case strings and floats), I try to avoid it. Mixing data types is messy and seems inefficient to me. If you want to, you can make your own class with your matrix data and header information in it. It seems like a cleaner solution to me.
2.) No, summaryMeansArray is set-up to have 11 rows and 6 columns. The first dimension of a matrix is the number of rows. You can get the transpose of summaryMeansArray with summaryMeansArray.T
. When you are taking the mean of summary3dArray on the 0th axis, the next axis becomes the rows and the one after that the columns.
Edit: As per request, you can create a python list from a numpy array with the method tolist()
. For instance,
newMeansArray = summaryMeansArray.tolist()
Then you can insert the column headers using
newMeansArray.insert(0,headers)
Inserting the row headers can be done with:
newMeansArray[i].insert(0,rowheader)
for each row i. Of course, if you've already inserted the column headers, then the counting for i starts with 1 rather than 0.
I agree with Justin Peel's answer, regarding question #1 (row/header labels).
I created my own class that allows me to decorate a matrix with extra data necessary to my task at hand (for example: row and column labels, a descriptive text for each row, or numerical properties of a row that are external to or independent of the matrix values).
My first solution that I used for almost 2 years was to have an object for each matrix row, where I would store each row's matrix values in a dictionary, with the dictionary key (ID) providing the second piece of information for that pair's matrix value. This was quite useful, especially for non-square matrices, and matrix manipulations and output were isolated cleanly.
However, I ran into a problem with this design: scalability. When using square, symmetric matrices, I needed 91 MB of memory for a 1000x1000 matrix, 327 MB of memory for a 2000x2000 matrix, and 1900 MB of memory for a 5000x5000 matrix. For my recent project that works on the order of 20000x20000 matrix entries, I will quickly and disastrously use up all of my workstation's 8GB of RAM and more.
My second solution was to have a single dictionary of (ID1,ID2)-->value mappings. Compared to my first solution, a 1000x1000 matrix required only 20 MB of memory. This solution also fails miserably in the scalability department, but in a different way, because the time to create and store C(1000+1,2)=500500 mappings was over 3 minutes, compared to 0.88 seconds when using my first design.
My third and current solution was to create a mapping between the numpy matrix row/column index and a matrix row/column label. Using numpy directly with a 5000x5000 matrix required 202 MB of memory on my system, a 10000x1000 matrix required 774 MB, and a 20000x2000 matrix required 3000 MB. A mapping of 20000 IDs to row/column indexes required 5 MB of memory on my system, which is negligible compared to the value matrix itself.
If one is processing only small matrices less than 100x100 elements, then my first solution will be quick and the implemented data structure will be easy to manipulate and extend. However, if you are thinking of large-scale processing, then I recommend the third solution.
精彩评论