数据分析——泰坦尼克号预测_python

之前在学校做过课程设计，但是对流程比较一知半解，现在看完了机器学习实战这本书，带着自己的理解重新做一遍。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

数据导入

观察数据的具体情况，可以发现年龄变量Age和Cabin有缺失，然后Name，sex，Ticket，cabin和Embark是object类型，在后续的数据处理中要进行调整。

data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

再看看测试集

data_test= pd.read_csv(r'test.csv')
data_test.info()


RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

把索引设置为乘客编号

test_process = test_process.set_index(['PassengerId'])
test_process

现在测试集长这样

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
PassengerId
892	3	Kelly, Mr. James	male	34	0	0	330911	7.8292	Q	Mr	16	Kelly
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47	1	0	363272	7.0000	S	Mr	32	Wilkes
894	2	Myles, Mr. Thomas Francis	male	62	0	0	240276	9.6875	Q	Mr	25	Myles
895	3	Wirz, Mr. Albert	male	27	0	0	315154	8.6625	S	Mr	16	Wirz
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22	1	1	3101298	12.2875	S	Mr	44	Hirvonen
...	...	...	...	...	...	...	...	...	...	...	...	...
1305	3	Spector, Mr. Woolf	male	25	0	0	A.5. 3236	8.0500	S	Mr	18	Spector
1306	1	Oliva y Ocana, Dona. Fermina	female	39	0	0	PC 17758	108.9000	C	NaN	28	Oliva y Ocana
1307	3	Saether, Mr. Simon Sivertsen	male	38	0	0	SOTON/O.Q. 3101262	7.2500	S	Mr	28	Saether
1308	3	Ware, Mr. Frederick	male	25	0	0	359309	8.0500	S	Mr	19	Ware
1309	3	Peter, Master. Michael J	male	22	1	1	2668	22.3583	C	NaN	24	Peter

418 rows × 12 columns

数据处理缺失值处理

本次数据的缺失应该是完全随机的，不依赖于其他完全变量，所以可以采取删除和填补两种方式。

cabin缺失过多，直接删除这一特征，不放心的话可以计算一些相关度或者画图看看情况。

# 删除cabin
train_process = data_train.drop(['Cabin'],axis=1)

# 年龄数据进行缺失值填补
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄，x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges 
train_process.Age=Age_df.Age.astype(int)

年龄缺失值使用随机森林进行填补，建立回归方程进行拟合。

测试集也要删除cabin变量和进行年龄缺失值的填补。

#测试集
test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()


RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 32.8+ KB

Age_df = test_process[['Age','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄，x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges 
test_process.Age=Age_df.Age.astype(int)

文本数据处理

对文本数据名字进行处理，把名字的称谓，长度，前名提取出来并舍弃名字变量。

def change(df):
    df['Called'] = df['Name'].str.findall('Miss|Mr|Ms').str[0].to_frame()
    df['Name_length'] = df['Name'].apply(lambda x:len(x))
    df['First_name'] = df['Name'].str.split(',').str[0]
    df = df.drop(['Name'],axis=1)
    
change(train_process)
change(test_process)

TargetEncoder

把其他object类型变量进行编码处理。

sklearn有很多种编码方式，target适用于特征无内在顺序，category数量 > 4的情况
one-hot适用于特征无内在顺序，category数量 < 4的情况。

import category_encoders
from category_encoders import TargetEncoder
X_train = train_process.iloc[:,2:]
y_train = train_process.iloc[:,1]
tar_encoder1 = TargetEncoder(cols=['Sex','Ticket','Embarked','Called','Name_length','First_name'],
                             handle_missing='value',
                             handle_unknown='value')

tar_encoder1.fit(X_train,y_train)

TargetEncoder(cols=['Sex', 'Ticket', 'Embarked', 'Called', 'Name_length',
                    'First_name'])

X_train_encoded = tar_encoder1.transform(X_train)

X_train_encoded.drop(['Name'],axis=1)

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
0	3	0.188908	22.0	1	0	0.383838	7.2500	0.336957	0.283721	0.282051	0.103230
1	1	0.742038	38.0	1	0	0.383838	71.2833	0.553571	0.283721	0.998476	0.383838
2	3	0.742038	26.0	0	0	0.383838	7.9250	0.336957	0.697802	0.315789	0.383838
3	1	0.742038	35.0	1	0	0.468759	53.1000	0.336957	0.283721	0.999439	0.468759
4	3	0.188908	35.0	0	0	0.383838	8.0500	0.336957	0.283721	0.372093	0.468759
...	...	...	...	...	...	...	...	...	...	...	...
886	2	0.188908	27.0	0	0	0.383838	13.0000	0.336957	0.492063	0.325000	0.383838
887	1	0.742038	19.0	0	0	0.383838	30.0000	0.336957	0.697802	0.372093	0.632953
888	3	0.742038	NaN	1	2	0.103230	23.4500	0.336957	0.697802	0.428461	0.103230
889	1	0.188908	26.0	0	0	0.383838	30.0000	0.553571	0.283721	0.325000	0.383838
890	3	0.188908	32.0	0	0	0.383838	7.7500	0.389610	0.283721	0.234375	0.383838

891 rows × 11 columns

X_test = test_process

X_test.drop(['Name'],axis=1)

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
PassengerId
892	3	male	34	0	0	330911	7.8292	Q	Mr	16	Kelly
893	3	female	47	1	0	363272	7.0000	S	Mr	32	Wilkes
894	2	male	62	0	0	240276	9.6875	Q	Mr	25	Myles
895	3	male	27	0	0	315154	8.6625	S	Mr	16	Wirz
896	3	female	22	1	1	3101298	12.2875	S	Mr	44	Hirvonen
...	...	...	...	...	...	...	...	...	...	...	...
1305	3	male	25	0	0	A.5. 3236	8.0500	S	Mr	18	Spector
1306	1	female	39	0	0	PC 17758	108.9000	C	NaN	28	Oliva y Ocana
1307	3	male	38	0	0	SOTON/O.Q. 3101262	7.2500	S	Mr	28	Saether
1308	3	male	25	0	0	359309	8.0500	S	Mr	19	Ware
1309	3	male	22	1	1	2668	22.3583	C	NaN	24	Peter

418 rows × 11 columns

X_test_encoded = tar_encoder1.transform(X_test)

归一化

后面要多模型验证，所以要把数据归一化。

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train_encoded[['Age','Fare']])
scaler.fit(X_test_encoded[['Age','Fare']])

StandardScaler()

X_train_encoded[['Age','Fare']] = scaler.transform(X_train_encoded[['Age','Fare']])
X_test_encoded[['Age','Fare']] = scaler.transform(X_test_encoded[['Age','Fare']])

模型预测

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

X_train_encoded
X_test_encoded

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
PassengerId
892	3	0.188908	0.325138	0	0	0.383838	-0.497063	0.389610	0.283721	0.230769	0.732634
893	3	0.742038	1.326156	1	0	0.383838	-0.511926	0.336957	0.283721	0.565217	0.383838
894	2	0.188908	2.481178	0	0	0.383838	-0.463754	0.389610	0.283721	0.327273	0.383838
895	3	0.188908	-0.213872	0	0	0.383838	-0.482127	0.336957	0.283721	0.230769	0.383838
896	3	0.742038	-0.598880	1	1	0.383838	-0.417151	0.336957	0.283721	0.999439	0.383838
...	...	...	...	...	...	...	...	...	...	...	...
1305	3	0.188908	-0.367875	0	0	0.383838	-0.493105	0.336957	0.283721	0.200000	0.383838
1306	1	0.742038	0.710145	0	0	0.468759	1.314557	0.553571	0.492063	0.372093	0.383838
1307	3	0.188908	0.633143	0	0	0.383838	-0.507445	0.336957	0.283721	0.372093	0.383838
1308	3	0.188908	-0.367875	0	0	0.383838	-0.493105	0.336957	0.283721	0.234375	0.383838
1309	3	0.188908	-0.598880	1	1	0.834289	-0.236640	0.553571	0.492063	0.372093	0.834289

418 rows × 11 columns

X_train_encoded.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Pclass       891 non-null    int64  
 1   Sex          891 non-null    float64
 2   Age          891 non-null    int32  
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Ticket       891 non-null    float64
 6   Fare         891 non-null    float64
 7   Embarked     891 non-null    float64
 8   Called       891 non-null    float64
 9   Name_length  891 non-null    float64
 10  First_name   891 non-null    float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB

投票法

先看看投票法

lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)

  VotingClassifier(estimators=[('lr',
                                  LogisticRegression(max_iter=20000, n_jobs=-1,
                                                     penalty='l1', solver='saga')),
                                 ('rf',
                                  RandomForestClassifier(max_depth=8,
                                                         min_samples_split=5,
                                                         n_estimators=300,
                                                         random_state=42)),
                                 ('scv',
                                  SVC(C=2, kernel='poly', probability=True,
                                      random_state=42))],
                     voting='soft')

y_test = pd.read_csv(r'C:/Users/gender_submission.csv')

y_test = y_test['Survived']

from sklearn.metrics import accuracy_score

for clf in (lr_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train_encoded,y_train)
    y_pred = clf.predict(X_test_encoded)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952

再试试XGBoost，果然效果比较好。

XGBoost

import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)

Validation MSE: 0.5023153196818051

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/580511.html

数据分析——泰坦尼克号预测

发表评论

评论列表（0条）