Python "input data"
I have file *.data, which include data in this order:
2.5,10,U1
3,4.5,U1
3,9,U1
3.5,5.5,U1
3.5,8,U1
4,7.5,U1
4.5,3.5,U1
4.5,4.5,U1
4.5,6,U1
5,5,U1
5,7,U1
7,6.5,U1
3.5,9.5,U2
3.5,10.5,U2
4.5,8,U2
4.5,10.5,U2
5,9,U2
5.5,5.5,U2
5.5,7.5,U2
In this data(I have different types of data, this is just example where are just 2 classes...), is 2 classes: U1 and U2, and for every class there is 2 values... What I need is to read this data and separate them to classes, in this case to U1 and U2.... Then after that I need to take from every class 2/3 data to new value(learning_set), a开发者_StackOverflownd other 1/3 to other value(test_set).
I started with this code:
data = open('set.data', 'rt')
data_list=[]
border=2./3
data_list = [line.strip().split(',') for line in data]
learning_set=data_list[:int(round(len(data_list)*border))]
test_set=data_list[int(round(len(data_list)*border)):]
But there I take from all data 2/3 and 1/3, not from every class.
Many thanks for help
You can filter your list after reading into two distinct subsets:
data_list_1 = [(x,y,c) for (x,y,c) in data_list if c=='U1']
data_list_2 = [(x,y,c) for (x,y,c) in data_list if c=='U2']
Afterwards you can then construct two different learing sets and test sets as before but on the filtered lists, e.g.
learning_set = data_list_1[:int(round(len(data_list_1)*border))] + data_list_2[:int(round(len(data_list_2)*border))]
and same for test_set
.
Update: If you don't know the classes before you can use the following code to first detect all classes and then loop over them.
classes = set([t[-1] for t in data_list])
learning_set = []
test_set = []
for cl in classes:
data_list_filtered = [t for t in data_list if t[-1]==cl]
learning_set += data_list_filtered[:int(round(len(data_list_filtered)*border))]
test_set += data_list_filtered[int(round(len(data_list_filtered)*border)):]
Ah, you want itertools.groupby
:
import itertools
class_dict = dict(itertools.groupby(data_list, key=lambda x: x[-1]))
class_names = class_dict.keys()
class_lists = [list(group) for group in class_dict.values()]
Then just slice each list in class_lists
appropriately and extend
learning_set and test_set with the results.
Here's a full solution:
data_list = [line.strip().split(',') for line in data]
data_list.sort(key=lambda x: x[-1])
def bisect_list(split_list, fraction):
split_index = int(fraction * len(split_list))
return split_list[:split_index], split_list[split_index:]
learning_set, test_set = [], []
for key, group in itertools.groupby(data_list, key=lambda x: x[-1]):
l, t = bisect_list(list(group), 0.66)
learning_set.extend(l)
test_set.extend(t)
For what it's worth (and because I've typed it out already), I'd accomplish this with something like...
from itertools import groupby
from operator import attrgetter
from collections import namedtuple
row_container = namedtuple('row', 'val1,val2,klass')
def process_row(row):
"""Return a named tuple"""
return row_container(float(row[0]), float(row[1]), row[2])
def bisect_list(split_list, fraction):
split_index = int(fraction * len(split_list))
return split_list[:split_index], split_list[split_index:]
data = open('test.csv', 'rt')
## Parse & process each line
data = (row.strip().split(',') for row in data)
data = (process_row(row) for row in data)
## Sort & group the data by class
sorted_data = sorted(data, key=attrgetter('klass'))
grouped_data = groupby(sorted_data, attrgetter('klass'))
## For each class, create learning and test sets
final_data = {}
for klass, class_rows in grouped_data:
learning_set, test_set = bisect_list(list(class_rows), 0.66)
final_data[klass] = dict(learning=learning_set, test=test_set)
Method of operation is similar to other answers already provided. Uses namedtuple. bisectlist()
lifted from @senderle
I would use a defaultdict to collect the entries into separate lists.
from collections import defaultdict
data = open(r'C:\Documents and Settings\Administrator\Desktop\set.data', 'r')
data_lists = defaultdict(list)
border = 2.0 / 3
for line in data:
entries = line.strip().split(',')
data_lists[entries[-1]].append(entries[ : -1])
learning_sets = {}
test_sets = {}
for cls, values in data_lists.items():
pos = int(round(len(values) * border))
learning_sets[cls] = values[ : pos]
test_sets[cls] = values[pos : ]
for cls in learning_sets:
print "for class", cls
print "\tlearning set is", learning_sets[cls]
print "\ttest set is", test_sets[cls]
print
consider using a dict/hash instead of a list.
i'd write more, but I am having trouble comprehending what you want to do afterwards.
精彩评论