Python "input data"

2023-03-06 05:42 问答作者：

I have file *.data, which include data in this order:

2.5,10,U1
3,4.5,U1
3,9,U1
3.5,5.5,U1
3.5,8,U1
4,7.5,U1
4.5,3.5,U1
4.5,4.5,U1
4.5,6,U1
5,5,U1
5,7,U1
7,6.5,U1
3.5,9.5,U2
3.5,10.5,U2
4.5,8,U2
4.5,10.5,U2
5,9,U2
5.5,5.5,U2
5.5,7.5,U2

In this data(I have different types of data, this is just example where are just 2 classes...), is 2 classes: U1 and U2, and for every class there is 2 values... What I need is to read this data and separate them to classes, in this case to U1 and U2.... Then after that I need to take from every class 2/3 data to new value(learning_set), a开发者_StackOverflownd other 1/3 to other value(test_set).

I started with this code:

data = open('set.data', 'rt')                             
data_list=[]                                                   
border=2./3                                                  
data_list = [line.strip().split(',') for line in data]

learning_set=data_list[:int(round(len(data_list)*border))]
test_set=data_list[int(round(len(data_list)*border)):]

But there I take from all data 2/3 and 1/3, not from every class.

Many thanks for help

You can filter your list after reading into two distinct subsets:

data_list_1 = [(x,y,c) for (x,y,c) in data_list if c=='U1']
data_list_2 = [(x,y,c) for (x,y,c) in data_list if c=='U2']

Afterwards you can then construct two different learing sets and test sets as before but on the filtered lists, e.g.

learning_set = data_list_1[:int(round(len(data_list_1)*border))] + data_list_2[:int(round(len(data_list_2)*border))]

and same for test_set.

Update: If you don't know the classes before you can use the following code to first detect all classes and then loop over them.

classes = set([t[-1] for t in data_list])

learning_set = []
test_set = []

for cl in classes:
    data_list_filtered = [t for t in data_list if t[-1]==cl]

    learning_set += data_list_filtered[:int(round(len(data_list_filtered)*border))]
    test_set += data_list_filtered[int(round(len(data_list_filtered)*border)):]

Ah, you want itertools.groupby:

import itertools
class_dict = dict(itertools.groupby(data_list, key=lambda x: x[-1]))
class_names = class_dict.keys()
class_lists = [list(group) for group in class_dict.values()]

Then just slice each list in class_lists appropriately and extend learning_set and test_set with the results.

Here's a full solution:

data_list = [line.strip().split(',') for line in data]
data_list.sort(key=lambda x: x[-1])

def bisect_list(split_list, fraction):
    split_index = int(fraction * len(split_list))
    return split_list[:split_index], split_list[split_index:]

learning_set, test_set = [], []
for key, group in itertools.groupby(data_list, key=lambda x: x[-1]):
    l, t = bisect_list(list(group), 0.66)
    learning_set.extend(l)
    test_set.extend(t)

For what it's worth (and because I've typed it out already), I'd accomplish this with something like...

from itertools import groupby
from operator import attrgetter
from collections import namedtuple

row_container = namedtuple('row', 'val1,val2,klass')

def process_row(row):
    """Return a named tuple"""
    return row_container(float(row[0]), float(row[1]), row[2])

def bisect_list(split_list, fraction):
    split_index = int(fraction * len(split_list))
    return split_list[:split_index], split_list[split_index:]


data = open('test.csv', 'rt')

## Parse & process each line
data = (row.strip().split(',') for row in data)
data = (process_row(row) for row in data)

## Sort & group the data by class
sorted_data = sorted(data, key=attrgetter('klass'))
grouped_data = groupby(sorted_data, attrgetter('klass'))

## For each class, create learning and test sets
final_data = {}
for klass, class_rows in grouped_data:
    learning_set, test_set = bisect_list(list(class_rows), 0.66)
    final_data[klass] = dict(learning=learning_set, test=test_set)

Method of operation is similar to other answers already provided. Uses namedtuple. bisectlist() lifted from @senderle

I would use a defaultdict to collect the entries into separate lists.

from collections import defaultdict

data = open(r'C:\Documents and Settings\Administrator\Desktop\set.data', 'r')
data_lists = defaultdict(list)
border = 2.0 / 3
for line in data:
    entries = line.strip().split(',')
    data_lists[entries[-1]].append(entries[ : -1])

learning_sets = {}
test_sets = {}
for cls, values in data_lists.items():
    pos = int(round(len(values) * border))
    learning_sets[cls] = values[ : pos]
    test_sets[cls] = values[pos : ]

for cls in learning_sets:
    print "for class", cls
    print "\tlearning set is", learning_sets[cls]
    print "\ttest set is", test_sets[cls]
    print

consider using a dict/hash instead of a list.

i'd write more, but I am having trouble comprehending what you want to do afterwards.

继续阅读：python

Python "input data"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？