泰坦尼克號生存率預測練習(2)
3.2.1 分類數據:字元串類型
字元串類型:可能從這裡面提取出特徵來,也歸到分類數據中,這裡數據有:
乘客姓名(name)
客艙號(Cabin)
船票編號(Ticket)
從姓名中提取頭銜:查看姓名這一列長啥樣注意到在乘客名字(Name)中,有一個非常顯著的特點:乘客頭銜每個名字當中都包含了具體的稱謂或者說是頭銜,將這部分信息提取出來後可以作為非常有用一個新變數,可以幫助我們進行預測。例如:Braund, Mr. Owen HarrisHeikkinen, Miss. LainaOliva y Ocana, Dona. FerminaPeter, Master. Michael Jfull[Name].head()0 Braund, Mr. Owen Harris1 Cumings, Mrs. John Bradley (Florence Briggs Th...2 Heikkinen, Miss. Laina3 Futrelle, Mrs. Jacques Heath (Lily May Peel)4 Allen, Mr. William HenryName: Name, dtype: object# 練習從字元串中提取頭銜,例如Mr# split用於字元串分割,返回一個列表# 我們看到姓名中Braund, Mr. Owen Harris, 逗號前面的是『名』,逗號後面是『頭銜.姓』name1 = Braund, Mr.Owen Harrissplit 用於字元串按分隔符分隔,返回一個列表。這裡按逗號分隔字元串。也就是字元串Braund, Mr. Owen Harris被按分隔符 , 拆分成兩部分[Braund, Mr. Owen Harris]#Mr. Owen Harrisstr1 = name1.split(,)[1]繼續對字元串Mr.Owen Harris 按分隔符.拆分,得到這樣一個列表[Mr. Owen Harris]這裡獲取到列表中元素序號為0的元素,也就是獲取到頭銜所在的那部分Mr#Mr.str2 = str1.split(.)[0]#strip()方法用於移除字元串頭尾指定的字元(默認為空格)str3 = str2.strip()# 定義函數:從姓名中獲取頭銜def getTitle(name): str1 = name.split(,)[1] str2 = str1.split(.)[0] str3 = str2.strip() return str3# 存放提取後的特徵titleDf = pd.DataFrame()# map函數:對Series每個函數應用自定義的函數計算titleDf[Title] = full[Name].map(getTitle)titleDf.head()Title0 Mr1 Mrs2 Miss3 Mrs4 Mr定義以下幾種頭銜類別:Officer 政府官員Royalty 王室(皇室)Mr 已婚男士Mrs 已婚女士Miss 年輕未婚女子Master 有技能的人/教師#姓名中頭銜字元串與定義頭銜類別的映射關係title_mapDict = { "Capt": "Officer", "Col": "Officer", "Major": "Officer", "Jonkheer": "Royalty", "Don": "Royalty", "Sir" : "Royalty", "Dr": "Officer", "Rev": "Officer", "the Countess":"Royalty", "Dona": "Royalty", "Mme": "Mrs", "Mlle": "Miss", "Ms": "Mrs", "Mr" : "Mr", "Mrs" : "Mrs", "Miss" : "Miss", "Master" : "Master", "Lady" : "Royalty"}# map函數:對Series每個數據應用自定義的函數計算titleDf[Title] = titleDf[Title].map(title_mapDict)#使用get_dummies 進行one-hot編碼titleDf = pd.get_dummies(titleDf[Title])titleDf.head()Master Miss Mr Mrs Officer Royalty0 0 0 1 0 0 01 0 0 0 1 0 02 0 1 0 0 0 03 0 0 0 1 0 04 0 0 1 0 0 0# 添加one-hot 編碼產生的虛擬變數(dummp variables)到泰坦尼克號數據集fullfull = pd.concat([full,titleDf],axis=1)#刪掉姓名這一列full.drop(Name,axis=1,inplace=True)full.head()Age Cabin Fare Parch PassengerId Sex SibSp Survived Ticket Emabrked_C ... Emabrked_S Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer Royalty0 22.0 U 7.25 0 1 1 1 0.0 A/5 21171 0 ... 1 0 0 1 0 0 1 0 0 01 38.0 C85 71.2833 0 2 0 1 1.0 PC 17599 1 ... 0 1 0 0 0 0 0 1 0 02 26.0 U 7.925 0 3 0 0 1.0 STON/O2. 3101282 0 ... 1 0 0 1 0 1 0 0 0 03 35.0 C123 53.1 0 4 0 1 1.0 113803 0 ... 1 1 0 0 0 0 0 1 0 04 35.0 U 8.05 0 5 1 0 0.0 373450 0 ... 1 0 0 1 0 0 1 0 0 0從客艙號中提取客艙類別:# 匿名函數python 使用lambda來創建匿名函數。所謂匿名,意即不再使用def語句這樣標準的形式定義一個函數,如下:lambda 參數1,參數2 :函數體或表達式# 定義匿名函數:對兩個數相加sum = lambda a,b: a+b#調用sum函數print(相加後的值為:, sum(10,20))客艙號的首字母是客艙的類別# 查看客艙號的內容full[Cabin].head()0 U1 C852 U3 C1234 UName: Cabin, dtype: object# 存放客艙號信息cabinDf = pd.DataFrame()客場號的類別值是首字母,例如:C85 類別映射為首字母Cfull[Cabin] = full[Cabin].map(lambda c:c[0])#使用get_dummies進行one-hot編碼,列名前綴是CabincabinDf = pd.get_dummies(full[Cabin], prefix=Cabin)cabinDf.head()Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U0 0 0 0 0 0 0 0 0 11 0 0 1 0 0 0 0 0 02 0 0 0 0 0 0 0 0 13 0 0 1 0 0 0 0 0 04 0 0 0 0 0 0 0 0 1# 添加one-hot編碼產生的虛擬變數(dummy variables)到泰坦尼克號數據集fullfull = pd.concat([full,cabinDf],axis=1)#刪掉客艙號這一列full.drop(Cabin,axis=1,inplace=True)full.head()Age Fare Parch PassengerId Sex SibSp Survived Ticket Emabrked_C Emabrked_Q ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U0 22.0 7.25 0 1 1 1 0.0 A/5 21171 0 0 ... 0 0 0 0 0 0 0 0 0 11 38.0 71.2833 0 2 0 1 1.0 PC 17599 1 0 ... 0 0 0 1 0 0 0 0 0 02 26.0 7.925 0 3 0 0 1.0 STON/O2. 3101282 0 0 ... 0 0 0 0 0 0 0 0 0 13 35.0 53.1 0 4 0 1 1.0 113803 0 0 ... 0 0 0 1 0 0 0 0 0 04 35.0 8.05 0 5 1 0 0.0 373450 0 0 ... 0 0 0 0 0 0 0 0 0 1
建立家庭人數和家庭類別:
### 建立家庭人數和家庭類別familyDf = pd.DataFrame()家庭人數=同代直系親屬數(Parch)+不同代直系親屬數(SibSp)+乘客自己(因為乘客自己也是家庭成員的一個,所以這裡加1)familyDf[FamilySize] = full[Parch] + full[SibSp] + 1家庭類別:小家庭Family_Single:家庭人數=1中等家庭Family_Small: 2<=家庭人數<=4大家庭Family_Large: 家庭人數>=5#if 條件為真的時候返回if前面內容,否則返回0familyDf[ Family_Single ] = familyDf[ FamilySize ].map( lambda s : 1 if s == 1 else 0 )familyDf[ Family_Small ] = familyDf[ FamilySize ].map( lambda s : 1 if 2 <= s <= 4 else 0 )familyDf[ Family_Large ] = familyDf[ FamilySize ].map( lambda s : 1 if 5 <= s else 0 )familyDf.head()FamilySize Family_Single Family_Small Family_Large0 2 0 1 01 2 0 1 02 1 1 0 03 2 0 1 04 1 1 0 0# 添加one-hot編碼產生的虛擬變數(dummy variables)到泰坦尼克號數據集fullfull = pd.concat([full,familyDf],axis=1)full.head()Age Fare Parch PassengerId Sex SibSp Survived Ticket Emabrked_C Emabrked_Q ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Single Family_Small Family_Large0 22.0 7.25 0 1 1 1 0.0 A/5 21171 0 0 ... 0 0 0 0 0 1 2 0 1 01 38.0 71.2833 0 2 0 1 1.0 PC 17599 1 0 ... 0 0 0 0 0 0 2 0 1 02 26.0 7.925 0 3 0 0 1.0 STON/O2. 3101282 0 0 ... 0 0 0 0 0 1 1 1 0 03 35.0 53.1 0 4 0 1 1.0 113803 0 0 ... 0 0 0 0 0 0 2 0 1 04 35.0 8.05 0 5 1 0 0.0 373450 0 0 ... 0 0 0 0 0 1 1 1 0 0# 到現在我們已經有了這麼多特徵了full.shape(1309, 33)
3.3 特徵選擇
相關係數:計算各個特徵的相關係數
# 相關性矩陣corrDf = full.corr()corrDf Age Parch PassengerId Sex SibSp Survived Emabrked_C Emabrked_Q Emabrked_S Pclass_1 ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Single Family_Small Family_LargeAge 1.000000 -0.130872 0.025731 0.057397 -0.190747 -0.070323 0.076179 -0.012718 -0.059153 0.362587 ... 0.132886 0.106600 -0.072644 -0.085977 0.032461 -0.271918 -0.196996 0.116675 -0.038189 -0.161210Parch -0.130872 1.000000 0.008942 -0.213125 0.373587 0.081629 -0.008635 -0.100943 0.071881 -0.013033 ... -0.027385 0.001084 0.020481 0.058325 -0.012304 -0.036806 0.792296 -0.549022 0.248532 0.624627PassengerId 0.025731 0.008942 1.000000 0.013406 -0.055224 -0.005007 0.048101 0.011585 -0.049836 0.026495 ... 0.000549 -0.008136 0.000306 -0.045949 -0.023049 0.000208 -0.031437 0.028546 0.002975 -0.063415Sex 0.057397 -0.213125 0.013406 1.000000 -0.109609 -0.543351 -0.066564 -0.088651 0.115193 -0.107371 ... -0.057396 -0.040340 -0.006655 -0.083285 0.020558 0.137396 -0.188583 0.284537 -0.255196 -0.077748SibSp -0.190747 0.373587 -0.055224 -0.109609 1.000000 -0.035322 -0.048396 -0.048678 0.073709 -0.034256 ... -0.015727 -0.027180 -0.008619 0.006015 -0.013247 0.009064 0.861952 -0.591077 0.253590 0.699681Survived -0.070323 0.081629 -0.005007 -0.543351 -0.035322 1.000000 0.168240 0.003650 -0.149683 0.285904 ... 0.150716 0.145321 0.057935 0.016040 -0.026456 -0.316912 0.016639 -0.203367 0.279855 -0.125147Emabrked_C 0.076179 -0.008635 0.048101 -0.066564 -0.048396 0.168240 1.000000 -0.164166 -0.778262 0.325722 ... 0.107782 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.036553 -0.107874 0.159594 -0.092825Emabrked_Q -0.012718 -0.100943 0.011585 -0.088651 -0.048678 0.003650 -0.164166 1.000000 -0.491656 -0.166101 ... -0.061459 -0.042877 -0.020282 -0.019941 -0.008904 0.142369 -0.087190 0.127214 -0.122491 -0.018423Emabrked_S -0.059153 0.071881 -0.049836 0.115193 0.073709 -0.149683 -0.778262 -0.491656 1.000000 -0.181800 ... -0.056023 0.002960 0.030575 0.040560 0.018111 0.137351 0.087771 0.014246 -0.062909 0.093671Pclass_1 0.362587 -0.013033 0.026495 -0.107371 -0.034256 0.285904 0.325722 -0.166101 -0.181800 1.000000 ... 0.275698 0.242963 -0.073083 -0.035441 0.048310 -0.776987 -0.029656 -0.126551 0.165965 -0.067523Pclass_2 -0.014193 -0.010057 0.022714 -0.028862 -0.052419 0.093349 -0.134675 -0.121973 0.196532 -0.296526 ... -0.037929 -0.050210 0.127371 -0.032081 -0.014325 0.176485 -0.039976 -0.035075 0.097270 -0.118495Pclass_3 -0.302093 0.019521 -0.041544 0.116562 0.072610 -0.322308 -0.171430 0.243706 -0.003805 -0.622172 ... -0.207455 -0.169063 -0.041178 0.056964 -0.030057 0.527614 0.058430 0.138250 -0.223338 0.155560Master -0.363923 0.253482 0.002254 0.164375 0.329171 0.085221 -0.014172 -0.009091 0.018297 -0.084504 ... -0.042192 0.001860 0.058311 -0.013690 -0.006113 0.041178 0.355061 -0.265355 0.120166 0.301809Miss -0.254146 0.066473 -0.050027 -0.672819 0.077564 0.332795 -0.014351 0.198804 -0.113886 -0.011733 ... -0.012516 0.008700 -0.003088 0.061881 -0.013832 -0.004364 0.087350 -0.023890 -0.018085 0.083422Mr 0.165476 -0.304780 0.014116 0.870678 -0.243104 -0.549199 -0.065538 -0.080224 0.108924 -0.099725 ... -0.030261 -0.032953 -0.026403 -0.072514 0.023611 0.131807 -0.326487 0.386262 -0.300872 -0.194207Mrs 0.198091 0.213491 0.033299 -0.571176 0.061643 0.344935 0.098379 -0.100374 -0.022950 0.141102 ... 0.080393 0.045538 0.013376 0.042547 -0.011742 -0.162253 0.157233 -0.354649 0.361247 0.012893Officer 0.162818 -0.032631 0.002231 0.087288 -0.013813 -0.031316 0.003678 -0.003212 -0.001202 0.098788 ... 0.006055 -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 -0.026921 0.013303 0.003966 -0.034572Royalty 0.059466 -0.030197 0.004400 -0.020408 -0.010787 0.033391 0.077213 -0.021853 -0.054250 0.118561 ... -0.012950 -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 -0.023600 0.008761 -0.000073 -0.017542Cabin_A 0.125177 -0.030707 -0.002831 0.047561 -0.039808 0.022287 0.094914 -0.042105 -0.056984 0.228433 ... -0.024952 -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 -0.042967 0.045227 -0.029546 -0.033799Cabin_B 0.113458 0.073051 0.015895 -0.094453 -0.011569 0.175095 0.161595 -0.073613 -0.095790 0.399378 ... -0.043624 -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 0.032318 -0.087912 0.084268 0.013470Cabin_C 0.167993 0.009601 0.006092 -0.077473 0.048616 0.114652 0.158043 -0.059151 -0.101861 0.485974 ... -0.053083 -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 0.037226 -0.137498 0.141925 0.001362Cabin_D 0.132886 -0.027385 0.000549 -0.057396 -0.015727 0.150716 0.107782 -0.061459 -0.056023 0.275698 ... 1.000000 -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.025313 -0.074310 0.102432 -0.049336Cabin_E 0.106600 0.001084 -0.008136 -0.040340 -0.027180 0.145321 0.027566 -0.042877 0.002960 0.242963 ... -0.034317 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.017285 -0.042535 0.068007 -0.046485Cabin_F -0.072644 0.020481 0.000306 -0.006655 -0.008619 0.057935 -0.020010 -0.020282 0.030575 -0.073083 ... -0.024369 -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.005525 0.004055 0.012756 -0.033009Cabin_G -0.085977 0.058325 -0.045949 -0.083285 0.006015 0.016040 -0.031566 -0.019941 0.040560 -0.035441 ... -0.011817 -0.011135 -0.007907 1.000000 -0.001712 -0.114803 0.035835 -0.076397 0.087471 -0.016008Cabin_T 0.032461 -0.012304 -0.023049 0.020558 -0.013247 -0.026456 -0.014095 -0.008904 0.018111 0.048310 ... -0.005277 -0.004972 -0.003531 -0.001712 1.000000 -0.051263 -0.015438 0.022411 -0.019574 -0.007148Cabin_U -0.271918 -0.036806 0.000208 0.137396 0.009064 -0.316912 -0.258257 0.142369 0.137351 -0.776987 ... -0.353822 -0.333381 -0.236733 -0.114803 -0.051263 1.000000 -0.014155 0.175812 -0.211367 0.056438FamilySize -0.196996 0.792296 -0.031437 -0.188583 0.861952 0.016639 -0.036553 -0.087190 0.087771 -0.029656 ... -0.025313 -0.017285 0.005525 0.035835 -0.015438 -0.014155 1.000000 -0.688864 0.302640 0.801623Family_Single 0.116675 -0.549022 0.028546 0.284537 -0.591077 -0.203367 -0.107874 0.127214 0.014246 -0.126551 ... -0.074310 -0.042535 0.004055 -0.076397 0.022411 0.175812 -0.688864 1.000000 -0.873398 -0.318944Family_Small -0.038189 0.248532 0.002975 -0.255196 0.253590 0.279855 0.159594 -0.122491 -0.062909 0.165965 ... 0.102432 0.068007 0.012756 0.087471 -0.019574 -0.211367 0.302640 -0.873398 1.000000 -0.183007Family_Large -0.161210 0.624627 -0.063415 -0.077748 0.699681 -0.125147 -0.092825 -0.018423 0.093671 -0.067523 ... -0.049336 -0.046485 -0.033009 -0.016008 -0.007148 0.056438 0.801623 -0.318944 -0.183007 1.00000031 rows × 31 columns查看各個特徵與生成情況(Survived)的相關係數ascending=False表示按降序排列corrDf[Survived].sort_values(ascending=False)Survived 1.000000Mrs 0.344935Miss 0.332795Pclass_1 0.285904Family_Small 0.279855Cabin_B 0.175095Emabrked_C 0.168240Cabin_D 0.150716Cabin_E 0.145321Cabin_C 0.114652Pclass_2 0.093349Master 0.085221Parch 0.081629Cabin_F 0.057935Royalty 0.033391Cabin_A 0.022287FamilySize 0.016639Cabin_G 0.016040Emabrked_Q 0.003650PassengerId -0.005007Cabin_T -0.026456Officer -0.031316SibSp -0.035322Age -0.070323Family_Large -0.125147Emabrked_S -0.149683Family_Single -0.203367Cabin_U -0.316912Pclass_3 -0.322308Sex -0.543351Mr -0.549199Name: Survived, dtype: float64
根據各個特徵與生還情況(Survived)的相關係數大小,我們選擇了這幾個特徵作為模型的輸入:頭銜、客艙等級、家庭大小、船票價格、船艙號、登船港口、性別。
# 特徵選擇full_X = pd.concat([titleDf, #頭銜 pclassDf, #客艙等級 familyDf, #家庭大小 full[Fare], #船票價格 cabinDf, #船艙號 embarkedDf, #登船港口 full[Sex] #性別 ],axis=1)full_X.head()Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 FamilySize ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Emabrked_C Emabrked_Q Emabrked_S Sex0 0 0 1 0 0 0 0 0 1 2 ... 0 0 0 0 0 1 0 0 1 11 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 1 0 0 02 0 1 0 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 03 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 0 0 1 04 0 0 1 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 1
4.構建模型
用訓練數據和某個機器學習演算法得到機器學習模型,用測試數據評估模型。
4.1 建立訓練數據集和測試數據集
1)坦尼克號測試數據集因為是我們最後要提交給Kaggle的,裡面沒有生存情況的值,所以不能用於評估模型。我們將Kaggle泰坦尼克號項目給我們的測試數據,叫做預測數據集(記為pred,也就是預測英文單詞predict的縮寫)。也就是我們使用機器學習模型來對其生存情況就那些預測。2)我們使用Kaggle泰坦尼克號項目給的訓練數據集,做為我們的原始數據集(記為source),從這個原始數據集中拆分出訓練數據集(記為train:用於模型訓練)和測試數據集(記為test:用於模型評估)。# 原始數據集有891行sourceRow = 891sourceRow是我們在最開始合併數據前知道的,原始數據集有共有891條數據從特徵集合full_X中提取原始數據集提取前891行數據時,我們要減去1,因為行號是從0開始的。# 原始數據集:特徵source_X = full_X.loc[0:sourceRow-1,:]# 原始數據集:標籤source_y = full.loc[0:sourceRow-1,Survived]# 預測數據集:特徵pred_X = full_X.loc[sourceRow:,:]確保這裡原始數據集取的是前891行的數據,不如後面模型有錯誤# 原始數據有多少行print(原始數據集有多少行:,source_X.shape[0])# 預測數據集大小print(原始數據集有多少行,pred_X.shape[0])從原始數據集(source)中拆分出訓練數據集(用於模型訓練train),測試數據集(用於模型評估test)train_test_split是交叉驗證中常用的函數,功能是從樣本中隨機的按比例選取train data 和 test datatrain_data 所要劃分的樣本特徵集train_target 所要劃分的樣本結果test_size 樣本佔比,如果是整數的話就是樣本的數量from sklearn.cross_validation import train_test_split# 建立模型用的訓練數據集和測試數據集train_X, test_X, train_y, test_y = train_test_split(source_X,source_y,train_size=0.8)# 輸出數據集大小print(原始數據集特徵:, source_X.shape, 訓練數據集特徵:,train_X.shape, 測試數據集特徵:,test_X.shape )print( 原始數據集標籤:,source_y.shape, 訓練數據集標籤:,train_y.shape, 測試數據集標籤:,test_y.shape )# 原始數據查看source_y.head()0 0.01 1.02 1.03 1.04 0.0Name: Survived, dtype: float64
4.2 選擇機器學習演算法
選擇一個機器學習演算法,用於模型的訓練。
#第1步:導入演算法from sklearn.linear_model import LogisticRegression#第2步:創建模型:邏輯回歸(logisic regression)model = LogisticRegression()
4.3 訓練模型
# 第3步: 訓練模型model.fit(train_X, train_y)
5.評估模型
# 分類問題,source得到的是模型的正確率model.score(test_X,test_y)
6.方案實施
使用預測數據集預測結果,並包存在csv文件中。
# 使用機器學習模型,對預測數據集中的生存情況進行預測pred_Y = model.predict(pred_X)生成的預測值是浮點數(0.0,1,0)但是Kaggle要求提交的結果是整型(0,1)所以要對數據類型進行轉換pred_Y = pred_Y.astype(int)# 乘客idpassenger_id = full.loc[sourceRow:,PassengerId]# 數據框:乘客id,預測生存情況的值predDf = pd.DataFrame( { PassengerId:passenger_id, Survived:pred_Y })predDf.shapepredDf.head()# 保存結果predDf.to_csv(titianic_pred.csv, index=False)
推薦閱讀: