机器学习之数据清洗及六种缺值处理方式小结
目录
- 1 数据清洗
- 1.1 概念
- 1.2 重要性
- 1.3 注意事项
- 2 查转空值及数据类型转换和标准化
- 2.1 空值为空
- 2.2 空值不为空
- 2.3 类型转换和标准化处理
- 3 六种缺值处理方式
- 3.1 数据介绍
- 3.2 涉及函数导入及变量
- 3.3 删除空行填充
- 3.4 平均值填充
- 3.5 中位数填充
- 3.6 众数填充
- 3.7 线性填充
- 3.8 随机森林填充
- 4 数据保存
- 5 代码集合测试
1 数据清洗
1.1 概念
数据清洗(Data Cleaning)是指从数据集中识别并纠正或删除不准确、不完整、格式不统一或与业务规则不符的数据的过程。这个过程是数据预处理的一个重要组成部分,其目的是提高数据的质量,确保数据的一致性和准确性,从而为数据分析、数据挖掘和机器学习等后续数据处理工作提供可靠的基础。数据清洗是一个反复迭代的过程,可能需要多次调整和优化以达到理想的效果。
1.2 重要性
数据清洗是数据处理过程中的重要环节,它涉及到将原始数据转换为可用、可靠和有意义的形式,以便进行进一步的分析和挖掘。
数据清洗是数据科学和数据分析领域的一个重要步骤,因为它直接影响到后续分析结果的准确性和可靠性。不干净的数据可能会导致错误的结论和决策。1.3 注意事项
- 1.完整性:检查单条数据是否存在空值,统计的字段是否完善。
- 2.全面性:观察某一列的全部数值,可以通过比较最大值、最小值、平均值、数据定义等来判断数据是否全面。
- 3.合法性:检査数值的类型、内容、大小是否符合预设的规则。例如,人类的年龄超过1000岁这个数据就是不合法的。
- 4.唯一性:检查数据是否重复记录,例如一个人的数据被重复记录多次。
- 5.类别是否可靠。
2 查转空值及数据类型转换和标准化
2.1 空值为空
- null_num = data.isnull()判断是否为空,为空填充为TSDcmsrue
- null_all = null_num.sum()计算空值数量
2.2 空值不为空
- data.replace(‘NA’, ‘’, inplace=True)空值为NA填充,替换为空再计算null_num = data.isnull()null_all = null_num.sum()
2.2.1 结果
调试结果:
null_num = data.isnull()null_all
处理后原数据
2.3 类型转换和标准化处理
- 特征数据类型转换为数值pd.to_numeric(数据,errors=‘coerce’)
- 标准化处理scaler = StandardScaler()x_all_z = scaler.fit_transform(x_all)
调试结果:
x_all_z3 六种缺值处理方式
3.1 数据介绍
部分数据展示,第一列为序号,最后一行为结果类别,其他为特征变量
3.2 涉及函数导入及变量
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor
train_data,train_label,test_data,test_label分别为训练集变量、结果和测试集变量、结果,结果列名为质量评分
3.3 删除空行填充
# 删除空行 def cca_train_fill(train_data,train_label): data = pd.concat([train_data, train_label], axis=1) #reset_index()重新排序 data = data.reset_index(drop=True) #dropna()删除空行 df_filled = data.dropna() #dropna()删除空行 df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型 def cca_test_fill(train_data,train_label,test_data,test_label): data = pd.concat([test_data, test_label], axis=1) data = data.reset_index(drop=True) df_filled = data.dropna() df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型
3.4 平均值填充
# 平均数 # 填充训练集平均值 def mean_train_method(data): # 数据的平均值 fill_values = data.mean() # fillna(平均值)填充平均值 # 返回数据填充后结果 return data.fillna(fill_values) def mean_train_fill(train_data,train_label): data = pd.concat([train_data,train_label],axis=1) data = data.reset_index(drop=True) A = data[data['矿物类型'] == 0] B = data[data['矿物类型'] == 1] C = data[data['矿物类型'] == 2] D = data[data['矿物类型'] == 3] A = mean_train_method(A) B = mean_train_method(B) C = mean_train_method(C) D = mean_train_method(D) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1),df_filled.矿物类型 # 填充测试集平均数,测试集需根据训练集的平均值进行填充 def mean_test_method(train_data,test_data): fill_values = train_data.mean() return test_data.fillna(fill_values) def mean_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data,test_label],axis=1) test_data_all = test_data_编程客栈all.reset_index(drop=True) A_train = train_data_all[train_data_all['矿物类型'] == 0] B_train = train_data_all[train_data_all['矿物类型'] == 1] C_train = train_data_all[train_data_all['矿物类型'] == 2] D_train = train_data_all[train_data_all['矿物类型'] == 3] A_test = test_data_all[test_data_all['矿物类型'] == 0] B_test = test_data_all[test_data_all['矿物类型'] == 1] C_test = test_data_all[test_data_all['矿物类型'] == 2] D_test = test_data_all[test_data_all['矿物类型'] == 3] # 测试集根据训练集填充 A = mean_test_method(A_train,A_test) B = mean_test_method(B_train,B_test) C = mean_test_method(C_train,C_test) D = mean_test_method(D_train,D_test) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1),df_filled.矿物类型
3.5 中位数填充
# 中位数 def median_train_method(data): fill_values = data.median() return data.fillna(fill_values) def median_train_fill(train_data,train_label): data = pd.concat([train_data,train_label],axis=1) data = data.reset_index(drop=True) A = data[data['矿物类型'] == 0] B = data[data['矿物类型'] == 1] C = data[data['矿物类型'] == 2] D = data[data['矿物类型'] == 3] A = median_train_method(A) B = median_train_method(B) C = median_train_method(C) D = median_train_method(D) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1),df_filled.矿物类型 def median_test_method(train_data,test_data): fill_values = train_data.median() return test_data.fillna(fill_values) def median_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data,test_label],axis=1) test_data_all = test_data_all.reset_index(drop=True) A_train = train_data_all[train_data_all['矿物类型'] == 0] B_train = train_data_all[train_data_all['矿物类型'] == 1] C_train = train_data_all[train_data_all['矿物类型'] == 2] D_train = train_data_all[train_data_all['矿物类型'] == 3] A_test = test_data_all[test_data_all['矿物类型'] == 0] B_test = test_data_all[test_data_all['矿物类型'] == 1] C_test = test_data_all[test_data_all['矿物类型'] == 2] D_test = test_data_all[test_data_all['矿物类型'] == 3] A = median_test_method(A_train,A_test) B = median_test_method(B_train,B_test) C = median_test_method(C_train,C_test) D = median_test_method(D_train,D_test) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1),df_filled.矿物类型
3.6 众数填充
# 众数 def mode_train_method(data): # apply()每列应用函数 # 执行函数如果有众数个数不为0或空,填充第一个 fill_values = data.apply(lambda x:x.mode().iloc[0] if len(x.mode())>0 else None) # 每列众数 a = data.mode() return data.fillna(fill_values) def mode_train_fill(train_data,train_label): data = pd.concat([train_data,train_label],axis=1) data = data.reset_index(drop=True) A = data[data['矿物类型'] == 0] B = data[data['矿物类型'] == 1] C = data[data['矿物类型'] == 2] D = data[data['矿物类型'] == 3] A = mode_train_method(A) B = mode_train_method(B) C = mode_train_method(C) D = mode_train_method(D) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1),df_filled.矿物类型 def mode_test_method(train_data,test_data): fill_values = train_data.apply(lambda x:x.mode().iloc[0] if len(x.mode())>0 else None) return test_data.fillna(fill_values) def mode_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data,test_label],axis=1) test_data_all = test_data_all.reset_index(drop=True) A_train = train_data_all[train_data_all['矿物类型'] == 0] B_train = train_data_all[train_data_all['矿物类型'] == 1] C_train = train_data_all[train_data_all['矿物类型'] == 2] D_train = train_data_all[train_data_all['矿物类型'] == 3] A_test = test_data_all[test_data_all['矿物类型'] == 0] B_test = test_data_all[test_data_all['矿物类型'] == 1] C_test = test_data_all[test_data_all['矿物类型'] == 2] D_test = test_data_all[test_data_all['矿物类型'] == 3] A = mode_test_method(A_train,A_test) B = mode_test_method(B_train,B_test) C = mode_test_method(C_train,C_test) D = mode_test_method(D_train,D_test) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('矿物类型', axis=1),df_filled.矿物类型
3.7 线性填充
def lr_train_fill(train_data,train_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('矿物类型',axis=1) # 计算空值个数 null_num = train_data_x.isnull().sum() # 根据空值个数排列列名 null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) # 该列空值个数不为0 if null_num_sorted[i] != 0: # x为去除当前含空列的其他列特征数据 x = train_data_x[filling_feature].drop(i,axis=1) # y为含空列所有数据 y = train_data_x[i] # 空列行索引列表 row_numbers_null_list = train_data_x[train_data_x[i].isnull()].index.tolist() # 训练集x为去除空行的x x_train = x.drop(row_numbers_null_list) # 训练集y为去除空行的y y_train = y.drop(row_numbers_null_list) # 测试集空行的x数据 x_test = x.iloc[row_numbers_null_list] lr = LinearRegression() lr.fit(x_train,y_train) # 预测空值结果 y_pr = lr.predict(x_test) train_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成训练数据集中的{i}列数据清洗') return train_data_x,train_data_all.矿物类型 def lr_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data, test_label], axis=1) test_data_all = test_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('矿物类型',axis=1) test_data_x = test_data_all.drop('矿物类型',axis=1) null_num = test_data_x.isnull().sum() null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) if null_num_sorted[i] != 0: x_train = train_data_x[filling_feature].drop(i,axis=1) y_train = train_data_x[i] x_test = test_data_x[filling_feature].drop(i,axis=1) row_numbers_null_list = test_data_x[test_data_x[i].isnull()].index.tolist() x_test = x_test.iloc[row_numbers_null_list] lr = LinearRegression() # 根据训练集数据进行测试集数据空值填充 lr.fit(x_train,y_train) y_pr = lr.predict(x_test) test_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成测试数据集中的{i}列数据清洗') return test_data_x,test_data_all.矿物类型
3.8 随机森林填充
# 随机森林 def Random_train_fill(train_data,train_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('矿物类型',axis=1) null_num = train_data_x.isnull().sum() null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) if null_num_sorted[i] != 0: x = train_data_x[filling_feature].drop(i,axis=1) y = train_data_x[i] row_numbers_null_list = train_data_x[train_data_x[i].isnull()].index.tolist() x_train = x.drop(row_numbers_null_list) y_train = y.drop(row_numbers_null_list) x_test = x.iloc[row_numbers_null_list] lr = RandomForestRegressor(n_estimators=100,max_features=0.8,random_state=314,n_jobs=-1) lr.fit(x_train,y_train) y_pr = lr.predict(x_test) train_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成训练数据集中的{i}列数据清洗') return train_data_x,train_data_all.矿物类型 def Random_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data, test_label], axis=1) test_data_all = test_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('矿物类型',axis=1) test_data_x = test_data_all.drop('矿物类型',axis=1) null_num = test_data_x.isnull().s编程um() null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) if null_num_sorted[i] != 0: x_train = train_data_x[filling_feature].drop(i,axis=1) y_train = train_data_x[i] x_test = test_data_x[filling_feature].drop(i,axis=1) row_numbers_null_list = test_data_x[test_data_x[i].isnull()].index.tolist() x_test = x_test.iloc[row_numbers_null_list] lr = RandomForestRegressor(n_estimators=100,max_features=0.8,random_state=314,n_jobs=-1) lr.fit(x_train,y_train) y_pr = lr.predict(x_test) test_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成测试数据集中的{i}列数据清洗') return test_data_x,test_data_all.矿物类型
4 数据保存
不同处理方法得到的数据应分别保存,更改[ ]内内容即可
代码展示:data_train = pd.concat([ov_x_train,ov_y_train],axis=1).sample(frac=1,random_state=4) data_test = pd.concat([x_test_fill,y_test_fill],axis=1).sample(frac=1,random_state=4) data_train.to_excel(r'./data_train_test//训练数据集[随机森林回归].xlsx',index=False) data_test.to_excel(r'./data_train_test//测试数据集[随机森林回归].xlsx',index=False)
5 代码集合测试
为便于处理,将数据填充另封装为file_data,便于应用
全部代码展示:
import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.mod编程客栈el_selection import train_test_split import file_data from imblearn.over_sampling import SMOTE data = pd.read_excel('矿物数据.xls'php) data = data[data['矿物类型'] != 'E'] # 空值为nan null_num = data.isnull() # 计算空值数量 null_all = null_num.sum() x_all = data.drop('矿物类型',axis=1).drop('序号',axis=1) y_all = data.矿物类型 # 转换结果类别类型 label_dict = {'A':0,'B':1,'C':2,'D':3} encod_labels = [label_dict[label] for label in y_all] # 类别转serises y_all = pd.Series(encod_labels,name='矿物类型') # 特征数据类型转换为数值 for column_name in x_all.columns: x_all[column_name] = pd.to_numeric(x_all[column_name],errors='coerce') # 标准化处理 scaler = StandardScaler() x_all_z = scaler.fit_transform(x_all) x_all = pd.DataFrame(x_all_z,columns=x_all.columns) x_train,x_test,y_train,y_test = \ train_test_split(x_all,y_all,test_size=0.3,random_state=50000) ### 按注释依次使用不同填充缺值数据 # cca # x_train_fill,y_train_fill = file_data.cca_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.cca_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 平均值 # x_train_fill,y_train_fill = file_data.mean_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.mean_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 中位数 # x_train_fill,y_train_fill = file_data.median_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.median_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 众数 # x_train_fill,y_train_fill = file_data.mode_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.mode_test_fill(x_train_fill,y_train_fill,x_test,y_test) # # lr_train_fill线性回归 # x_train_fill,y_train_fill = file_data.lr_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.lr_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 随机森林回归 x_train_fill,y_train_fill = file_data.Random_train_fill(x_train,y_train) x_test_fill,y_test_fill = file_data.Random_test_fill(x_train_fill,y_train_fill,x_test,y_test) #打乱顺序 oversampler = SMOTE(k_neighbors=1,random_state=42) ov_x_train,ov_y_train = oversampler.fit_resample(x_train_fill,y_train_fill) # 数据存储 data_train = pd.concat([ov_x_train,ov_y_train],axis=1).sample(frac=1,random_state=4) data_test = pd.concat([x_test_fill,y_test_fill],axis=1).sample(frac=1,random_state=4) data_train.to_excel(r'./data_train_test//训练数据集[随机森林回归].xlsx',index=False) data_test.to_excel(r'./data_train_test//测试数据集[随机森林回归].xlsx',index=False)
依次运行结果:
到此这篇关于机器学习之数据清洗及六种缺值处理方式小结的文章就介绍到这了,更多相关机器学习数据清洗缺值内容请搜索编程客栈(www.devze.com)以前的文章或继续浏览下面的相关文章希望大家以后多多支持编程客栈(www.devze.com)!
精彩评论