本文共 1404 字,大约阅读时间需要 4 分钟。
dropna()
scikit-learn的Imputer类提供了估算缺失值的基本策略,可以使用缺失值所在的行或列的均值,中位数或最频繁值。这个类还允许不同的缺失值编码。
>>> import numpy as np>>> from sklearn.preprocessing import Imputer>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]>>> print(imp.transform(X)) [[ 4. 2. ] [ 6. 3.666...] [ 7. 6. ]]
fillna()
lb = 'Education'idx = 0tr = np.where(df_all[lb] != -1)[0]va = np.where(df_all[lb] == -1)[0]df_all.iloc[va, idx] = LogisticRegression(C=1).fit(X_all[tr], df_all.iloc[tr, idx]).predict(X_all[va])lb = 'age'idx = 2tr = np.where(df_all[lb] != -1)[0]va = np.where(df_all[lb] == -1)[0]df_all.iloc[va, idx] = LogisticRegression(C=2).fit(X_all[tr], df_all.iloc[tr, idx]).predict(X_all[va])lb = 'gender'idx = 3tr = np.where(df_all[lb] != -1)[0]va = np.where(df_all[lb] == -1)[0]df_all.iloc[va, idx] = LogisticRegression(C=2).fit(X_all[tr], df_all.iloc[tr, idx]).predict(X_all[va])df_all = pd.concat([df_all, df_te]).fillna(0)df_all.to_csv(cfg.data_path + 'all_v2.csv', index=None
转载地址:http://xnoji.baihongyu.com/