Python数据预处理时缺失值的不同处理方式总结

2022-12-22 09:01 开发作者： Python 集中营

#ImportingthepandasmoduleandgivingitthealiASPd.
importpandasaspd

#Readingtheexcelfileandstoringitinadataframe.
data_frame=pd.read_excel('D:/test-data-work/data.xlsx')

#Printingthedataframe.
print(data_frame)

#姓名年龄班级成绩表现
#0Python集中营10.01210.099.0A
#1Python集中营11.01211.0100.0A
#2Python集中营12.01212.0101.0A
#3Python集中营13.01213.0102.0A
#4Python集中营14.01214.0103.0NaN
#5Python集中营15.01215.0104.0A
#6Python集中营16.01216.0105.0A
#7Python集中营17.0NaN106.0A
#8Python集中营18.01218.0NaNA
#9Python集中营19.01219.0108.0A
#10Python集中营NaN1220.0109.0NaN
#11Python集中营NaNNaN110.0A
#12Python集中营NaN1222.0NaNA
#13Python集中营23.01223.0112.0A
#14Python集中营24.01224.0113.0A
#15Python集中营25.0NaNNaNNaN
#16Python集中营NaphpN1226.0115.0A
#17Python集中营27.01227.0NaNA
#18Python集中营10.01210.099js.0NaN

源数据已经读取完成了，接下来使用四种常见的缺失值的处理方式来进行批量的数据填充。

1. 固定值填充

固定值填充也是一种比较简单并且常用的填充方式，只需要给某个列填充自己想要填充的值即可。

这里我们把'表现'这一个列的空值全部填充成'B',fillna函数就是填充空值的意思。

#ReplacingalltheNaNvaluesinthecolumn'表现'withthevalue'B'.
data_frame[python'表现']=data_frame['表现'].fillna('B')

#Printingthedataframe.
print(data_frame)

#姓名年龄班级成绩表现
#0Python集中营10.01210.099.0A
#1Python集中营11.01211.0100.0A
#2Python集中营12.01212.0101.0A
#3Python集中营13.01213.0102.0A
#4Python集中营14.01214.0103.0B
#5Python集中营15.01215.0104.0A
#6Python集中营16.01216.0105.0A
#7Python集中营17.0NaN106.0A
#8Python集中营18.01218.0NaNA
#9Python集中营19.01219.0108.0A
#10Python集中营NaN1220.0109.0B
#11Python集中营NaNNaN110.0A
#12Python集中营NaN1222.0NaNA
#13Python集中营23.01223.0112.0A
#14Python集中营24.01224.0113.0A
#15Python集中营25.0NaNNaNB
#16Python集中营NaN1226.0115.0A
#17Python集中营27.01227.0NaNA
#18Python集中营10.01210.099.0B

2. 均值填充

均值填充就是将缺失值所在列的数据进行一次均值计算，计算出结果后再填充到缺失值所在的单元格上面。

使用均值填充的前提是这一列的数据可以进行均值计算，比如'成绩'这一列都是数字可以使用mean函数做均值计算。

#ReplacingalltheNaNvaluesinthecolumn'成绩'withthemeanofthecolumn'成绩'.
data_frame['成绩']=data_frame['成绩'].fillna(data_frame['成绩'].mean())

#It'sprintingthedataframe.
print(data_frame)

#姓名年龄班级成绩表现
#0Python集中营10.01210.099.000000A
#1Python集中营11.01211.0100.000000A
#2Python集中营12.01212.0101.000000A
#3Python集中营13.01213.0102.000000A
#4Python集中php营14.01214.0103.000000B
#5Python集中营15.01215.0104.000000A
#6Python集中营16.01216.0105.000000A
#7Python集中营17.0NapythonN106.000000A
#8Python集中营18.01218.0105.733333A
#9Python集中营19.01219.0108.000000A
#10Python集中营NaN1220.0109.000000B
#11Python集中营NaNNaN110.000000A
#12Python集中营NaN1222.0105.733333A
#13Python集中营23.01223.0112.000000A
#14Python集中营24.01224.0113.000000A
#15Python集中营25.0NaN105.733333B
#16Python集中营NaN1226.0115.000000A
#17Python集中营27.01227.0105.733333A
#18Python集中营10.01210.099.000000B

可以发现计算出的均值是105.733333，已经都填充到'成绩'这一列的缺失值上面了。

3. 中位数填充

中位数填充和均值填充差不多是一样的，不同的是使用median函数来计算缺失值所在列的中位数。

#ReplacingalltheNaNvaluesinthecolumn'年龄'withthemedianofthecolumn'年龄'.
data_frame['年龄']=data_frame['年龄'].fillna(data_frame['年龄'].median())

#It'sprintingthedataframe.
print(data_frame)

#姓名年龄班级成绩表现
#0Python集中营10.01210.099.000000A
#1Python集中营11.01211.0100.000000A
#2Python集中营12.01212.0101.000000A
#3Python集中营13.01213.0102.000000A
#4Python集中营14.01214.0103.000000B
#5Python集中营15.01215.0104.000000A
#6Python集中营16.01216.0105.000000A
#7Python集中营17.0NaN106.000000A
#8Python集中营18.01218.0105.733333A
#9Python集中营19.01219.0108.000000A
#10Python集中营16.01220.0109.000000B
#11Python集中营16.0NaN110.000000A
#12Python集中营16.01222.0105.733333A
#13Python集中营23.01223.0112.000000A
#14Python集中营24.01224.0113.000000A
#15Python集中营25.0NaN105.733333B
#16Python集中营16.01226.0115.000000A
#17Python集中营27.01227.0105.733333A
#18Python集中营10.01210.099.000000B

4. 插补法填充

差补法填充可以根据该列的上一个数据和下一个数据得到该单元格需要插入的数据是多少。

比如：上一个班级是1220，下一个班级是1222，那么该单元格需要插入的数据应该是1221。

#ReplacingalltheNaNvaluesinthecolumn'班级'withtheinterpolatedvaluesofthecolumn'班级'.
data_frame['班级']=data_frame['班级'].interpolate()

#It'sprintingthedataframe.
print(data_frame)

#姓名年龄班级成绩表现
#0Python集中营10.01210.099.000000A
#1Python集中营11.01211.0100.000000A
#2Python集中营12.01212.0101.000000A
#3Python集中营13.01213.0102.000000A
#4Python集中营14.01214.0103.000000B
#5Python集中营15.01215.0104.000000A
#6Python集中营16.01216.0105.000000A
#7Python集中营17.01217.0106.000000A
#8Python集中营18.01218.0105.733333A
#9Python集中营19.01219.0108.000000A
#10Python集中营16.01220.0109.000000B
#11Python集中营16.01221.0110.000000A
#12Python集中营16.01222.0105.733333A
#13Python集中开发者_JAVA学习营23.01223.0112.000000A
#14Python集中营24.01224.0113.000000A
#15Python集中营25.01225.0105.733333B
#16Python集中营16.01226.0115.000000A
#17Python集中营27.01227.0105.733333A
#18Python集中营10.01210.099.000000B

到此这篇关于Python数据预处理时缺失值的不同处理方式总结的文章就介绍到这了,更多相关Python数据预处理内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们！

继续阅读：Python数据预处理 Python数据预处理缺失值 Python预处理

Python数据预处理时缺失值的不同处理方式总结

目录

1. 固定值填充

2. 均值填充

3. 中位数填充

4. 插补法填充

更多精彩内容

精彩评论

最新开发

C++中std::allocator的具体使用

c++ fielsystem库的具体使用

C++ gtest单元测试的实现示例

C语言结构体指针的示例代码

Visual Studio 2022 上使用ffmpeg的详细步骤

开发排行榜

springboot后端存储富文本内容的思路与步骤(含图片内容)

PyCharm运行python测试,报错“没有发现测试”/“空套件”的解决

return base64.b64encode(b).decode(

基于C语言实现钻石棋游戏的示例代码

Sublime Text 3解决中文乱码问题（实测可用）