UFO長啥樣?--Python數據分析來告訴你

前言

n

真心講,長這麼大,還有見過UFO長啥樣,偶然看到美國UFO報告中心有關於UFO時間記錄的詳細信息,突然想分析下這些記錄里都包含了那些有趣的信息,於是有了這次的分析過程。

n

當然,原始數據包含的記錄信息比較多,我只是進了了比較簡單的分析,有興趣的童鞋可以一起來分析,別忘了也給大家分享下您的分析情況哦。

n

n

本次分析的主要內容涉及以下幾個方面:

n

  • UFO長啥樣?
  • n

  • UFO在哪些地方出現的次數較多?
  • n

  • UFO在哪些年份出現的次數較多?
  • n

  • 熱力圖同時顯示哪些州和哪些年UFO出現次數最多
  • n

import pandas as pdnimport numpy as npnimport matplotlib.pyplot as pltn% matplotlib inlinenplt.style.use(ggplot)n

n

1 數據整理與清洗

n

df = pd.read_csv(nuforc_events.csv)n

n

print(df.shape) # 查看數據的結構nprint(df.head())n

n

(110265, 13)n Event_Time Event_Date Year Month Day Hour Minute n0 2017-04-20T14:15:00Z 2017-04-20 2017.0 4.0 20.0 14.0 15.0 n1 2017-04-20T04:56:00Z 2017-04-20 2017.0 4.0 20.0 4.0 56.0 n2 2017-04-19T23:55:00Z 2017-04-19 2017.0 4.0 19.0 23.0 55.0 n3 2017-04-19T23:50:00Z 2017-04-19 2017.0 4.0 19.0 23.0 50.0 n4 2017-04-19T23:29:00Z 2017-04-19 2017.0 4.0 19.0 23.0 29.0 nn City State Shape Duration n0 Palmyra NJ Other 5 minutes n1 Bridgeview IL Light 20 seconds n2 Newton AL Triangle 5 seconds n3 Newton AL Triangle 5-6 minutes n4 Denver CO Light 1 hour nn Summary n0 I observed an aircraft that seemed to look odd. n1 Bridgeview, IL, blue light. ((anonymous report)) n2 Silent triangle UFO. n3 My friend and I stepped outside hoping to catc... n4 Moved slow but made quick turns staying and ci... nn Event_URL n0 [http://www.nuforc.org/webreports/133/S133726.html](http://www.nuforc.org/webreports/133/S133726.html) n1 [http://www.nuforc.org/webreports/133/S133720.html](http://www.nuforc.org/webreports/133/S133720.html) n2 [http://www.nuforc.org/webreports/133/S133724.html](http://www.nuforc.org/webreports/133/S133724.html) n3 [http://www.nuforc.org/webreports/133/S133723.html](http://www.nuforc.org/webreports/133/S133723.html) n4 [http://www.nuforc.org/webreports/133/S133721.html](http://www.nuforc.org/webreports/133/S133721.html) n

  • 由於存在許多包含NaN的數據信息,在進行分析之前,先用dropna()方法去除包含NaN的行數
  • n

df_clean = df.dropna()nprint(df_clean.shape) # 查看去除Nan後還有多少行nprint(df_clean.head())n

n

(95004, 13)n Event_Time Event_Date Year Month Day Hour Minute n0 2017-04-20T14:15:00Z 2017-04-20 2017.0 4.0 20.0 14.0 15.0 n1 2017-04-20T04:56:00Z 2017-04-20 2017.0 4.0 20.0 4.0 56.0 n2 2017-04-19T23:55:00Z 2017-04-19 2017.0 4.0 19.0 23.0 55.0 n3 2017-04-19T23:50:00Z 2017-04-19 2017.0 4.0 19.0 23.0 50.0 n4 2017-04-19T23:29:00Z 2017-04-19 2017.0 4.0 19.0 23.0 29.0 nn City State Shape Duration n0 Palmyra NJ Other 5 minutes n1 Bridgeview IL Light 20 seconds n2 Newton AL Triangle 5 seconds n3 Newton AL Triangle 5-6 minutes n4 Denver CO Light 1 hour nn Summary n0 I observed an aircraft that seemed to look odd. n1 Bridgeview, IL, blue light. ((anonymous report)) n2 Silent triangle UFO. n3 My friend and I stepped outside hoping to catc... n4 Moved slow but made quick turns staying and ci... nn Event_URL n0 [http://www.nuforc.org/webreports/133/S133726.html](http://www.nuforc.org/webreports/133/S133726.html) n1 [http://www.nuforc.org/webreports/133/S133720.html](http://www.nuforc.org/webreports/133/S133720.html) n2 [http://www.nuforc.org/webreports/133/S133724.html](http://www.nuforc.org/webreports/133/S133724.html) n3 [http://www.nuforc.org/webreports/133/S133723.html](http://www.nuforc.org/webreports/133/S133723.html) n4 [http://www.nuforc.org/webreports/133/S133721.html](http://www.nuforc.org/webreports/133/S133721.html) n

  • 由於1900年以前的數據較少,這裡選擇1900年以後的數據來進行分析,如下:
  • n

df_clean = df_clean[df_clean[Year]>=1900] # 獲取1900年以後的數據來進行分析n

n

  • 查看導入的每列數據的數據類型,通過運行結果,可以看到,「Event_Date」列並不是日期類型,因此要將之轉換。
  • n

  • 可以採用pd.to_datetime()方法來操作
  • n

df_clean.dtypesn

n

Event_Time objectnEvent_Date objectnYear float64nMonth float64nDay float64nHour float64nMinute float64nCity objectnState objectnShape objectnDuration objectnSummary objectnEvent_URL objectndtype: objectn

  • 用pd.to_datetime()方法來將str格式的日期轉換成日期類型
  • n

pd.to_datetime(df_clean[Event_Date]) # 1061-12-31年不能顯示n# OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1061-12-31 00:00:00ndf_clean.dtypesn

n

Event_Time objectnEvent_Date objectnYear float64nMonth float64nDay float64nHour float64nMinute float64nCity objectnState objectnShape objectnDuration objectnSummary objectnEvent_URL objectndtype: objectn

2 UFO長啥樣?

n

  • 按UFO出現的形狀類型來分析,統計不同類型的UFO出現的次數
  • n

s_shape = df_clean.groupby(Shape)[Event_Date].count()nprint(type(s_shape))ns_shape.sort_values(inplace=True)ns_shapen

n

<class pandas.core.series.Series>nnShapenChanged 1nHexagon 1nPyramid 1nFlare 1nRound 2nCrescent 2nDelta 7nCross 287nCone 383nEgg 842nTeardrop 866nChevron 1187nDiamond 1405nCylinder 1495nRectangle 1620nFlash 1717nCigar 2313nChanging 2378nFormation 3070nOval 4332nDisk 5841nSphere 6482nOther 6658nUnknown 6887nFireball 7785nTriangle 9358nCircle 9818nLight 20254nName: Event_Date, dtype: int64n

剔除特殊情況

n

  • 剔除出現次數少於10次的類型
  • n

  • 剔除「Unknown」及「Other」類型
  • n

s_shape_normal = s_shape[s_shape.values>10]ns_shape_normaln

n

ShapenCross 287nCone 383nEgg 842nTeardrop 866nChevron 1187nDiamond 1405nCylinder 1495nRectangle 1620nFlash 1717nCigar 2313nChanging 2378nFormation 3070nOval 4332nDisk 5841nSphere 6482nOther 6658nUnknown 6887nFireball 7785nTriangle 9358nCircle 9818nLight 20254nName: Event_Date, dtype: int64n

s_shape_normal = s_shape_normal[s_shape_normal.index.isin([Unknown, Other])==False]ns_shape_normaln

n

ShapenCross 287nCone 383nEgg 842nTeardrop 866nChevron 1187nDiamond 1405nCylinder 1495nRectangle 1620nFlash 1717nCigar 2313nChanging 2378nFormation 3070nOval 4332nDisk 5841nSphere 6482nFireball 7785nTriangle 9358nCircle 9818nLight 20254nName: Event_Date, dtype: int64n

from matplotlib import font_manager as fmnfrom matplotlib import cmnnlabels = s_shape_normal.indexnsizes = s_shape_normal.valuesnnexplode = (0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1) # "explode" , show the selected slicennfig, axes = plt.subplots(figsize=(10,5),ncols=2) # 設置繪圖區域大小nax1, ax2 = axes.ravel()nncolors = cm.rainbow(np.arange(len(sizes))/len(sizes)) # colormaps: Paired, autumn, rainbow, gray,spring,Darksnpatches, texts, autotexts = ax1.pie(sizes, labels=labels, autopct=%1.0f%%,explode=explode,n shadow=False, startangle=150, colors=colors, labeldistance=1.2,pctdistance=1.05, radius=0.95)n# labeldistance: 控制labels顯示的位置n# pctdistance: 控制百分比顯示的位置n# radius: 控制切片突出的距離nnax1.axis(equal) nn# 重新設置字體大小nproptease = fm.FontProperties()nproptease.set_size(xx-small)n# font size include: 『xx-small』,x-small』,small』,medium』,『large』,『x-large』,『xx-large』 or number, e.g. 12nplt.setp(autotexts, fontproperties=proptease)nplt.setp(texts, fontproperties=proptease)nnax1.set_title(Shapes, loc=center)nn# ax2 只顯示圖例(legend)nax2.axis(off)nax2.legend(patches, labels, loc=center left, fontsize=9)nn# plt.tight_layout()n# plt.savefig("pie_shape_ufo.png", bbox_inches=tight)nplt.savefig(ufo_shapes.jpg)nplt.show()n

n

運行結果如下:

3 UFO在美國那些州(state)出現的次數比較多?

n

按」State」進行分組運算,統計ufo在各個州出現的次數

n

s_state = df_clean.groupby(State)[Event_Date].count()nprint(type(s_state))ns_state.head()n

n

<class pandas.core.series.Series>nnStatenAB 438nAK 472nAL 930nAR 791nAZ 3488nName: Event_Date, dtype: int64n

將分析得到的結果進行可視化顯示,如下:

n

fig, ax1 = plt.subplots(figsize=(12,8))nnwidth = 0.5nstate = s_state.indexnx_pos1 = np.arange(len(state))ny1 = s_state.valuesnax1.bar(x_pos1, y1,color=#4F81BD,align=center, width_=width, label=Amounts, linewidth_=0)nax1.set_title(Amount of reporting UFO events by State )nax1.set_xlim(-1, len(state))nax1.set_xticks(x_pos1)nax1.set_xticklabels(state, rotation = -90)nax1.set_ylabel(Amount)nnfig.savefig(ufo_state.jpg)nplt.show()n

n

運行結果如下:

從上圖可看出,ufo在加州(CA)出現的總次數明顯比其他地方多,難道是ufo偏愛加州人民?

n

4 UFO在哪些年份出現的次數較多?

n

按」Year」進行分組運算,統計ufo在各個年份出現的次數

n

# df_clean[Year].astype(int)ns_year = df_clean.groupby(df_clean[Year].astype(int))[Event_Date].count()nprint(type(s_year))ns_year.head()n

n

<class pandas.core.series.Series>nnYearn1905 1n1910 2n1920 1n1925 1n1929 1nName: Event_Date, dtype: int64n

將分析得到的結果進行可視化顯示,如下:

n

fig, ax = plt.subplots(figsize=(12,20))n# fig, ax1 = plt.subplots(figsize=(12,8))n# fig, axes = plt.subplots(nrows=2, figsize=(12,8))n# fig, axes = plt.subplots(ncols=2, figsize=(18,4))nnyear = s_year.indexny_pos = np.arange(len(year))nx_value = s_year.valuesnax.barh(y_pos, x_value,color=#4F81BD,align=center, label=Amounts, linewidth_=0)nax.set_title(Amount of reporting UFO events by Year )nax.set_ylim(-0.5, len(year)-0.5)nax.set_yticks(y_pos)nax.set_yticklabels(year, rotation = 0, fontsize=6)nax.set_xlabel(Amount)nnplt.savefig(ufo_year.jpg)nplt.show()n

n

運行結果如下:

從上圖可看出,近年來UFO出現的報告次數最多

n

5 1997年以後的UFO事件分析

n

  • 通過上述分析可看出,1997年以前,報告發現UFO的事件相對較少,下面將針對1997年以後的情況進行分析
  • n

df_97 = df_clean[(df_clean[Year]>=1997)]ndf_97[Year] = df_97[Year].astype(int)n# df_97.astype({Year:int})nprint(df_97.shape)nprint(df_97.head())n

n

(86041, 13)n Event_Time Event_Date Year Month Day Hour Minute n0 2017-04-20T14:15:00Z 2017-04-20 2017 4.0 20.0 14.0 15.0 n1 2017-04-20T04:56:00Z 2017-04-20 2017 4.0 20.0 4.0 56.0 n2 2017-04-19T23:55:00Z 2017-04-19 2017 4.0 19.0 23.0 55.0 n3 2017-04-19T23:50:00Z 2017-04-19 2017 4.0 19.0 23.0 50.0 n4 2017-04-19T23:29:00Z 2017-04-19 2017 4.0 19.0 23.0 29.0 nn City State Shape Duration n0 Palmyra NJ Other 5 minutes n1 Bridgeview IL Light 20 seconds n2 Newton AL Triangle 5 seconds n3 Newton AL Triangle 5-6 minutes n4 Denver CO Light 1 hour nn Summary n0 I observed an aircraft that seemed to look odd. n1 Bridgeview, IL, blue light. ((anonymous report)) n2 Silent triangle UFO. n3 My friend and I stepped outside hoping to catc... n4 Moved slow but made quick turns staying and ci... nn Event_URL n0 [http://www.nuforc.org/webreports/133/S133726.html](http://www.nuforc.org/webreports/133/S133726.html) n1 [http://www.nuforc.org/webreports/133/S133720.html](http://www.nuforc.org/webreports/133/S133720.html) n2 [http://www.nuforc.org/webreports/133/S133724.html](http://www.nuforc.org/webreports/133/S133724.html) n3 [http://www.nuforc.org/webreports/133/S133723.html](http://www.nuforc.org/webreports/133/S133723.html) n4 [http://www.nuforc.org/webreports/133/S133721.html](http://www.nuforc.org/webreports/133/S133721.html) n

將數據按」Year」和」State」進行分組運算,如下:

n

df_amount_year = df_97.groupby([Year, State])[Event_Date].size().reset_index()ndf_amount_year.columns = [Year, State, Amount]nprint(df_amount_year.head())n

n

Year State Amountn0 1997 AB 6n1 1997 AK 5n2 1997 AL 8n3 1997 AR 10n4 1997 AZ 127n

import seaborn as snsnndf_pivot = df_amount_year.pivot_table(index=State, columns=Year, values=Amount)nnf, ax = plt.subplots(figsize = (10, 15))ncmap = sns.cubehelix_palette(start = 1, rot = 3, gamma=0.8, as_cmap = True)nsns.heatmap(df_pivot, cmap = cmap, linewidths = 0.05, ax = ax)nnax.set_title(Amounts per State and Year since Year 1997)nax.set_xlabel(Year)nax.set_ylabel(State)nax.set_xticklabels(ax.get_xticklabels(), rotation=90)nf.savefig(ufo_per_year_state.jpg)n

n

  • 上圖中,顏色越深的地方,表示UFO事件報告的次數越多。
  • n

原始數據來源於美國的UFO事件報告中心,數據鏈接如下:

n

  • Data resource: Home | data.world
  • n

n

更多精彩內容請關注公眾號:

n

「Python數據之道」

推薦閱讀:

中國有嘻哈丨數據分析誰是押韻轟炸機
Infovis的圖形推理(譯)
excel怎麼把一欄數據分別複製到其他欄?

TAG:Python | 数据分析 | 数据可视化 |