Python数据预处理时缺失值的不同处理方式总结
目录
- 1. 固定值填充
- 2. 均值填充
- 3. 中位数填充
- 4. 插补法填充
在使用python做数据分析的时候,经常需要先对数据做统一化的处理,缺失值的处理是经常会使用到的。
一般情况下,缺失值的处理要么是删除缺失数据所在的行,要么就是对缺失的单元格数据进行填充。
今天介绍的是使用差补法/均值/固定值等不同的方式完成数据填充从而保证数据的完整性!
这里采用的还是pandas模块的DataFrame数据对象来做数据处理,因此,没有pandas的话使用pip的方式安装一下即可。
pipinstallpandas
下面是我们需要处理的源数据,由于是本地测试数据,数据量比较小。
使用pandas模块的read_excel函数将源数据全部读取出来返回DataFrame对象。
#ImportingthepandasmoduleandgivingitthealiASPd. importpandasaspd #Readingtheexcelfileandstoringitinadataframe. data_frame=pd.read_excel('D:/test-data-work/data.xlsx') #Printingthedataframe. print(data_frame) #姓名年龄班级成绩表现 #0Python集中营10.01210.099.0A #1Python集中营11.01211.0100.0A #2Python集中营12.01212.0101.0A #3Python集中营13.01213.0102.0A #4Python集中营14.01214.0103.0NaN #5Python集中营15.01215.0104.0A #6Python集中营16.01216.0105.0A #7Python集中营17.0NaN106.0A #8Python集中营18.01218.0NaNA #9Python集中营19.01219.0108.0A #10Python集中营NaN1220.0109.0NaN #11Python集中营NaNNaN110.0A #12Python集中营NaN1222.0NaNA #13Python集中营23.01223.0112.0A #14Python集中营24.01224.0113.0A #15Python集中营25.0NaNNaNNaN #16Python集中营NaphpN1226.0115.0A #17Python集中营27.01227.0NaNA #18Python集中营10.01210.099js.0NaN
源数据已经读取完成了,接下来使用四种常见的缺失值的处理方式来进行批量的数据填充。
1. 固定值填充
固定值填充也是一种比较简单并且常用的填充方式,只需要给某个列填充自己想要填充的值即可。
这里我们把'表现'这一个列的空值全部填充成'B',fillna函数就是填充空值的意思。
#ReplacingalltheNaNvaluesinthecolumn'表现'withthevalue'B'. data_frame[python'表现']=data_frame['表现'].fillna('B') #Printingthedataframe. print(data_frame) #姓名年龄班级成绩表现 #0Python集中营10.01210.099.0A #1Python集中营11.01211.0100.0A #2Python集中营12.01212.0101.0A #3Python集中营13.01213.0102.0A #4Python集中营14.01214.0103.0B #5Python集中营15.01215.0104.0A #6Python集中营16.01216.0105.0A #7Python集中营17.0NaN106.0A #8Python集中营18.01218.0NaNA #9Python集中营19.01219.0108.0A #10Python集中营NaN1220.0109.0B #11Python集中营NaNNaN110.0A #12Python集中营NaN1222.0NaNA #13Python集中营23.01223.0112.0A #14Python集中营24.01224.0113.0A #15Python集中营25.0NaNNaNB #16Python集中营NaN1226.0115.0A #17Python集中营27.01227.0NaNA #18Python集中营10.01210.099.0B
2. 均值填充
均值填充就是将缺失值所在列的数据进行一次均值计算,计算出结果后再填充到缺失值所在的单元格上面。
使用均值填充的前提是这一列的数据可以进行均值计算,比如'成绩'这一列都是数字可以使用mean函数做均值计算。
#ReplacingalltheNaNvaluesinthecolumn'成绩'withthemeanofthecolumn'成绩'. data_frame['成绩']=data_frame['成绩'].fillna(data_frame['成绩'].mean()) #It'sprintingthedataframe. print(data_frame) #姓名年龄班级成绩表现 #0Python集中营10.01210.099.000000A #1Python集中营11.01211.0100.000000A #2Python集中营12.01212.0101.000000A #3Python集中营13.01213.0102.000000A #4Python集中php营14.01214.0103.000000B #5Python集中营15.01215.0104.000000A #6Python集中营16.01216.0105.000000A #7Python集中营17.0NapythonN106.000000A #8Python集中营18.01218.0105.733333A #9Python集中营19.01219.0108.000000A #10Python集中营NaN1220.0109.000000B #11Python集中营NaNNaN110.000000A #12Python集中营NaN1222.0105.733333A #13Python集中营23.01223.0112.000000A #14Python集中营24.01224.0113.000000A #15Python集中营25.0NaN105.733333B #16Python集中营NaN1226.0115.000000A #17Python集中营27.01227.0105.733333A #18Python集中营10.01210.099.000000B
可以发现计算出的均值是105.733333,已经都填充到'成绩'这一列的缺失值上面了。
3. 中位数填充
中位数填充和均值填充差不多是一样的,不同的是使用median函数来计算缺失值所在列的中位数。
#ReplacingalltheNaNvaluesinthecolumn'年龄'withthemedianofthecolumn'年龄'. data_frame['年龄']=data_frame['年龄'].fillna(data_frame['年龄'].median()) #It'sprintingthedataframe. print(data_frame) #姓名年龄班级成绩表现 #0Python集中营10.01210.099.000000A #1Python集中营11.01211.0100.000000A #2Python集中营12.01212.0101.000000A #3Python集中营13.01213.0102.000000A #4Python集中营14.01214.0103.000000B #5Python集中营15.01215.0104.000000A #6Python集中营16.01216.0105.000000A #7Python集中营17.0NaN106.000000A #8Python集中营18.01218.0105.733333A #9Python集中营19.01219.0108.000000A #10Python集中营16.01220.0109.000000B #11Python集中营16.0NaN110.000000A #12Python集中营16.01222.0105.733333A #13Python集中营23.01223.0112.000000A #14Python集中营24.01224.0113.000000A #15Python集中营25.0NaN105.733333B #16Python集中营16.01226.0115.000000A #17Python集中营27.01227.0105.733333A #18Python集中营10.01210.099.000000B
4. 插补法填充
差补法填充可以根据该列的上一个数据和下一个数据得到该单元格需要插入的数据是多少。
比如:上一个班级是1220,下一个班级是1222,那么该单元格需要插入的数据应该是1221。
#ReplacingalltheNaNvaluesinthecolumn'班级'withtheinterpolatedvaluesofthecolumn'班级'. data_frame['班级']=data_frame['班级'].interpolate() #It'sprintingthedataframe. print(data_frame) #姓名年龄班级成绩表现 #0Python集中营10.01210.099.000000A #1Python集中营11.01211.0100.000000A #2Python集中营12.01212.0101.000000A #3Python集中营13.01213.0102.000000A #4Python集中营14.01214.0103.000000B #5Python集中营15.01215.0104.000000A #6Python集中营16.01216.0105.000000A #7Python集中营17.01217.0106.000000A #8Python集中营18.01218.0105.733333A #9Python集中营19.01219.0108.000000A #10Python集中营16.01220.0109.000000B #11Python集中营16.01221.0110.000000A #12Python集中营16.01222.0105.733333A #13Python集中开发者_JAVA学习营23.01223.0112.000000A #14Python集中营24.01224.0113.000000A #15Python集中营25.01225.0105.733333B #16Python集中营16.01226.0115.000000A #17Python集中营27.01227.0105.733333A #18Python集中营10.01210.099.000000B
到此这篇关于Python数据预处理时缺失值的不同处理方式总结的文章就介绍到这了,更多相关Python数据预处理内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!
精彩评论