kaggle初體驗:泰坦尼克號生存預測
第一次使用kaggle,遇到三次烏龍。先是無法導出代碼,於是只能在notebook上重新打了一遍,然後是之前兩部分的作業因為自己的不小心全部刪掉,而且當時關掉了notebook,找都找不到,最後就是知乎提交,希望這次只複製了一半的內容可以成功通過。
這一關最大的感觸:路漫漫其修遠兮——編程是個熟能生巧的過程,我還需要多實踐,多練習。
#姓名中頭銜字元串與定義頭銜類別的映射關係title_mapDict={"Capt": "Officer", "Col": "Officer", "Major": "Officer", "Jonkheer": "Royalty", "Don" : "Royalty", "Sir" : "Royalty", "Dr" : "Royalty", "Rev" : "Officer", "the Countess": "Royalty", "Dona": "Royalty", "Mme" : "Mrs", "Mlle": "Miss", "MS": "Mrs", "Mr": "Mr", "Mrs": "Mrs", "Miss": "Miss", "Master": "Master", "Lady": "Royalty" }#map函數:對Serise每個數據應用自定義的函數計算titleDf[Title]=titleDf[Title].map(title_mapDict)#使用get_dummies進行one-hot編碼titleDf=pd.get_dummies(titleDf[Title])titleDf.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;}.dataframe thead th {
text-align: right;}</style>MasterMissMrMrsOfficerRoyalty00010001000100201000030001004001000#添加one-hot編碼產生的虛擬變數(dummy variables)到泰坦尼克號數據集full=pd.concat([full,titleDf],axis=1)full.drop(Name,axis=1,inplace=True)full.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;}.dataframe thead th { text-align: right;}</style>AgeCabinEmbarkedFareParchPassengerIdPclassSexSibSpSurvived...Embarked_SPclass_1Pclass_2Pclass_3MasterMissMrMrsOfficerRoyalty022.0US7.2500013male10.0...1001001000138.0C85C71.2833021female11.0...0100000100226.0US7.9250033female01.0...1001010000335.0C123S53.1000041female11.0...1100000100435.0US8.0500053male00.0...10010010005 rows × 23 columns#查看客艙號內容full[Cabin].head()0 U1 C852 U3 C1234 UName: Cabin, dtype: object#匿名函數語法:lambda 參數1,參數2 :函數體#定義匿名函數 :對兩個數相加sum=lambda a,b : a+b#調用sum函數sum(10,20)30#存放提取後的特徵cabinDf = pd.DataFrame()客艙號的類別值是首字母,例如:C85 類別映射為首字母Cfull[Cabin]=full[Cabin].map(lambda c: c[0])#使用get_dummies進行one-hot編碼,列名前綴是CabincabinDf=pd.get_dummies(full[Cabin],prefix=Cabin)cabinDf.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th { text-align: right;}</style>Cabin_ACabin_BCabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_U00000000011001000000200000000130010000004000000001#添加One-hot編碼產生的虛擬變數到泰坦尼克號數據集full=pd.concat([full,cabinDf],axis=1)#刪掉客艙號這一列full.drop(Cabin,axis=1,inplace=True)full.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;}.dataframe thead th { text-align: right;}
</style>AgeEmbarkedFareParchPassengerIdPclassSexSibSpSurvivedTicket...RoyaltyCabin_ACabin_BCabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_U022.0S7.2500013male10.0A/5 21171...0000000001138.0C71.2833021female11.0PC 17599...0001000000226.0S7.9250033female01.0STON/O2. 3101282...0000000001335.0S53.1000041female11.0113803...0001000000435.0S8.0500053male00.0373450...00000000015 rows × 31 columns#存放家庭信息familyDf=pd.DataFrame()familyDf[FamilySize]=full[Parch]+full[SibSp]+1Family_Single :家庭人數=1Family_Small :2=<家庭人數<=4Family_Large :家庭人數>=5#if條件為真的時候返回if前面的內容,否則返回0familyDf[Family_Single]=familyDf[FamilySize].map(lambda s :1 if s==1 else 0)familyDf[Family_Small] =familyDf[FamilySize].map(lambda s :1 if 2<=s<=4 else 0)familyDf[Family_Large] =familyDf[FamilySize].map(lambda s :1 if 5<=s else 0)familyDf.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;}.dataframe thead th { text-align: right;}</style>FamilySizeFamily_SingleFamily_SmallFamily_Large0201012010211003201041100#添加One-hot編碼產生的虛擬變數(dummy variables)到泰坦尼克號數據集fullfull=pd.concat([full,familyDf],axis=1)full.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;}.dataframe thead th {
text-align: right;}</style>AgeEmbarkedFareParchPassengerIdPclassSexSibSpSurvivedTicket...Cabin_DCabin_ECabin_FCabin_GCabin_TCabin_UFamilySizeFamily_SingleFamily_SmallFamily_Large022.0S7.2500013male10.0A/5 21171...0000012010138.0C71.2833021female11.0PC 17599...0000002010226.0S7.9250033female01.0STON/O2. 3101282...0000011100335.0S53.1000041female11.0113803...0000002010435.0S8.0500053male00.0373450...00000111005 rows × 35 columnsfull.shape(1309, 35)corrDf=full.corr()corrDf
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;}.dataframe thead th { text-align: right;}</style>AgeFareParchPassengerIdPclassSibSpSurvivedEmbarked_CEmbarked_QEmbarked_S...Cabin_DCabin_ECabin_FCabin_GCabin_TCabin_UFamilySizeFamily_SingleFamily_SmallFamily_LargeAge1.0000000.171521-0.1308720.025731-0.366371-0.190747-0.0703230.076179-0.012718-0.059153...0.1328860.106600-0.072644-0.0859770.032461-0.271918-0.1969960.116675-0.038189-0.161210Fare0.1715211.0000000.2215220.031416-0.5584770.1602240.2573070.286241-0.130054-0.169894...0.0727370.073949-0.037567-0.0228570.001179-0.5071970.226465-0.2748260.1972810.170853Parch-0.1308720.2215221.0000000.0089420.0183220.3735870.081629-0.008635-0.1009430.071881...-0.0273850.0010840.0204810.058325-0.012304-0.0368060.792296-0.5490220.2485320.624627PassengerId0.0257310.0314160.0089421.000000-0.038354-0.055224-0.0050070.0481010.011585-0.049836...0.000549-0.0081360.000306-0.045949-0.0230490.000208-0.0314370.0285460.002975-0.063415Pclass-0.366371-0.5584770.018322-0.0383541.0000000.060832-0.338481-0.2696580.2304910.091320...-0.265341-0.2256490.0131220.052133-0.0427500.7138570.0500270.147393-0.2183030.127306SibSp-0.1907470.1602240.373587-0.0552240.0608321.000000-0.035322-0.048396-0.0486780.073709...-0.015727-0.027180-0.0086190.006015-0.0132470.0090640.861952-0.5910770.2535900.699681Survived-0.0703230.2573070.081629-0.005007-0.338481-0.0353221.0000000.1682400.003650-0.149683...0.1507160.1453210.0579350.016040-0.026456-0.3169120.016639-0.2033670.279855-0.125147Embarked_C0.0761790.286241-0.0086350.048101-0.269658-0.0483960.1682401.000000-0.164166-0.778262...0.1077820.027566-0.020010-0.031566-0.014095-0.258257-0.036553-0.1078740.159594-0.092825Embarked_Q-0.012718-0.130054-0.1009430.0115850.230491-0.0486780.003650-0.1641661.000000-0.491656...-0.061459-0.042877-0.020282-0.019941-0.0089040.142369-0.0871900.127214-0.122491-0.018423Embarked_S-0.059153-0.1698940.071881-0.0498360.0913200.073709-0.149683-0.778262-0.4916561.000000...-0.0560230.0029600.0305750.0405600.0181110.1373510.0877710.014246-0.0629090.093671Pclass_10.3625870.599956-0.0130330.026495-0.884911-0.0342560.2859040.325722-0.166101-0.181800...0.2756980.242963-0.073083-0.0354410.048310-0.776987-0.029656-0.1265510.165965-0.067523Pclass_2-0.014193-0.121372-0.0100570.022714-0.182413-0.0524190.093349-0.134675-0.1219730.196532...-0.037929-0.0502100.127371-0.032081-0.0143250.176485-0.039976-0.0350750.097270-0.118495Pclass_3-0.302093-0.4196160.019521-0.0415440.9152010.072610-0.322308-0.1714300.243706-0.003805...-0.207455-0.169063-0.0411780.056964-0.0300570.5276140.0584300.138250-0.2233380.155560Master-0.3639230.0115960.2534820.0022540.0952570.3291710.085221-0.014172-0.0090910.018297...-0.0421920.0018600.058311-0.013690-0.0061130.0411780.355061-0.2653550.1201660.301809Miss-0.2541460.0920510.066473-0.0500270.0244870.0775640.332795-0.0143510.198804-0.113886...-0.0125160.008700-0.0030880.061881-0.013832-0.0043640.087350-0.023890-0.0180850.083422Mr0.165476-0.192192-0.3047800.0141160.121492-0.243104-0.549199-0.065538-0.0802240.108924...-0.030261-0.032953-0.026403-0.0725140.0236110.131807-0.3264870.386262-0.300872-0.194207Mrs0.1992210.1417010.2162710.032793-0.1817330.0639410.3419940.100960-0.106723-0.021186...0.0815390.0464930.0139750.042987-0.011673-0.1652240.160264-0.3595710.3657590.014048Officer0.1475770.013263-0.0230240.007199-0.097900-0.024008-0.046444-0.001667-0.0100730.007884...-0.020547-0.019360-0.013748-0.006667-0.002977-0.045001-0.0283760.013900-0.000116-0.027833Royalty0.0943150.040143-0.037685-0.001710-0.1430200.0001140.0279090.057126-0.008031-0.045317...0.020490-0.018697-0.013276-0.006438-0.002875-0.086119-0.0205220.0083630.005137-0.026879Cabin_A0.1251770.020094-0.030707-0.002831-0.202143-0.0398080.0222870.094914-0.042105-0.056984...-0.024952-0.023510-0.016695-0.008096-0.003615-0.242399-0.0429670.045227-0.029546-0.033799Cabin_B0.1134580.3937430.0730510.015895-0.353414-0.0115690.1750950.161595-0.073613-0.095790...-0.043624-0.041103-0.029188-0.014154-0.006320-0.4237940.032318-0.0879120.0842680.013470Cabin_C0.1679930.4013700.0096010.006092-0.4300440.0486160.1146520.158043-0.059151-0.101861...-0.053083-0.050016-0.035516-0.017224-0.007691-0.5156840.037226-0.1374980.1419250.001362Cabin_D0.1328860.072737-0.0273850.000549-0.265341-0.0157270.1507160.107782-0.061459-0.056023...1.000000-0.034317-0.024369-0.011817-0.005277-0.353822-0.025313-0.0743100.102432-0.049336Cabin_E0.1066000.0739490.001084-0.008136-0.225649-0.0271800.1453210.027566-0.0428770.002960...-0.0343171.000000-0.022961-0.011135-0.004972-0.333381-0.017285-0.0425350.068007-0.046485Cabin_F-0.072644-0.0375670.0204810.0003060.013122-0.0086190.057935-0.020010-0.0202820.030575...-0.024369-0.0229611.000000-0.007907-0.003531-0.2367330.0055250.0040550.012756-0.033009Cabin_G-0.085977-0.0228570.058325-0.0459490.0521330.0060150.016040-0.031566-0.0199410.040560...-0.011817-0.011135-0.0079071.000000-0.001712-0.1148030.035835-0.0763970.087471-0.016008Cabin_T0.0324610.001179-0.012304-0.023049-0.042750-0.013247-0.026456-0.014095-0.0089040.018111...-0.005277-0.004972-0.003531-0.0017121.000000-0.051263-0.0154380.022411-0.019574-0.007148Cabin_U-0.271918-0.507197-0.0368060.0002080.7138570.009064-0.316912-0.2582570.1423690.137351...-0.353822-0.333381-0.236733-0.114803-0.0512631.000000-0.0141550.175812-0.2113670.056438FamilySize-0.1969960.2264650.792296-0.0314370.0500270.8619520.016639-0.036553-0.0871900.087771...-0.025313-0.0172850.0055250.035835-0.015438-0.0141551.000000-0.6888640.3026400.801623Family_Single0.116675-0.274826-0.5490220.0285460.147393-0.591077-0.203367-0.1078740.1272140.014246...-0.074310-0.0425350.004055-0.0763970.0224110.175812-0.6888641.000000-0.873398-0.318944Family_Small-0.0381890.1972810.2485320.002975-0.2183030.2535900.2798550.159594-0.122491-0.062909...0.1024320.0680070.0127560.087471-0.019574-0.2113670.302640-0.8733981.000000-0.183007Family_Large-0.1612100.1708530.624627-0.0634150.1273060.699681-0.125147-0.092825-0.0184230.093671...-0.049336-0.046485-0.033009-0.016008-0.0071480.0564380.801623-0.318944-0.1830071.00000032 rows × 32 columns查看各個特徵與生成情況(Survived)的相關係數,ascending=False表示降序排列corrDf[Survived].sort_values(ascending=False)Survived 1.000000Mrs 0.341994Miss 0.332795Pclass_1 0.285904Family_Small 0.279855Fare 0.257307Cabin_B 0.175095Embarked_C 0.168240Cabin_D 0.150716Cabin_E 0.145321Cabin_C 0.114652Pclass_2 0.093349Master 0.085221Parch 0.081629Cabin_F 0.057935Royalty 0.027909Cabin_A 0.022287FamilySize 0.016639Cabin_G 0.016040Embarked_Q 0.003650PassengerId -0.005007Cabin_T -0.026456SibSp -0.035322Officer -0.046444Age -0.070323Family_Large -0.125147Embarked_S -0.149683Family_Single -0.203367Cabin_U -0.316912Pclass_3 -0.322308Pclass -0.338481Mr -0.549199Name: Survived, dtype: float64#特徵選擇full_X=pd.concat([titleDf,#頭銜 pclassDf,#客艙等級 familyDf,#家庭大小 full[Fare],#船票價格] cabinDf,#船艙號 embarkedDf,#登船港口 full[Sex]#性別 ],axis=1)full_X.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th { text-align: right;}</style>MasterMissMrMrsOfficerRoyaltyPclass_1Pclass_2Pclass_3FamilySize...Cabin_DCabin_ECabin_FCabin_GCabin_TCabin_UEmbarked_CEmbarked_QEmbarked_SSex00010000012...000001001110001001002...000000100020100000011...000001001030001001002...000000001040010000011...00000100115 rows × 27 columnssourceRow=891#原始數據集:特徵source_X=full_X.loc[0:sourceRow-1,:]#原始數據集:標籤source_y=full.loc[0:sourceRow-1,Survived]# 預測數據集:特徵pred_X=full_X.loc[sourceRow:,:]print(原始數據集有多少行:,source_X.shape[0])print(測試數據集有多少行:,pred_X.shape[0])原始數據集有多少行: 891測試數據集有多少行: 418from sklearn.cross_validation import train_test_split#建立模型用的訓練數據集和測試數據集train_X,test_X,train_y,test_y=train_test_split(source_X,source_y,train_size=0.8)#輸出數據集大小print(原始數據集特徵:,source_X.shape, 訓練數據集特徵:,train_X.shape, 測試數據集特徵:,test_X.shape)print(原始數據集標籤:,source_y.shape, 訓練數據集標籤:,train_y.shape, 測試數據集標籤:,test_y.shape)原始數據集特徵: (891, 27) 訓練數據集特徵: (712, 27) 測試數據集特徵: (179, 27)原始數據集標籤: (891,) 訓練數據集標籤: (712,) 測試數據集標籤: (179,)#原始數據查看source_y.head()0 0.01 1.02 1.03 1.04 0.0Name: Survived, dtype: float64#選擇機器學習演算法,初學者建議從邏輯回歸開始#第1步:導入演算法from sklearn.linear_model import LogisticRegression#第2步:創建邏輯回歸模型model=LogisticRegression()#第3步:訓練模型model.fit(train_X,train_y)#分類問題,score得到的是模型的正確率model.score(test_X,test_y)0.8324022346368715#方案實施#使用機器學習模型,對預測數據集中的生存情況進行預測pred_Y=model.predict(pred_X)生成的預測值是浮點數(0.0,1.0)但是kaggle要求提交的結果是整型(0,1)所以要求對數據類型進行轉換pred_Y=pred_Y.astype(int)#乘客idpassenger_id=full.loc[sourceRow:,PassengerId]#數據框:乘客id,預測生存情況的值predDf=pd.DataFrame({PassengerId:passenger_id, Survived:pred_Y})predDf.shapepredDf.head()#保存結果predDf.to_csv(titanic_pred.csv,index=False)
推薦閱讀: