Kaggle案例:2016美國大選
項目地址:https://www.kaggle.com/fivethirtyeight/2016-election-polls
Dataset Information
This dataset is a collection of state and national polls conducted from November 2015-November 2016 on the 2016 presidential election. Data on the raw and weighted poll results by state, date, pollster, and pollster ratings are included.
我們使用2016年美國大選的數據集,來簡單分析一下川普和希拉里的民意調查結果。熟悉使用Python處理數據的過程,如數據列的提取、數據格式的轉換、數據的可視化。
下圖截取了部分數據:
首先,我們來提取其中的 enddate, rawpoll_clinton, rawpoll_trump,adjpoll_clinton, adjpoll_trump列。分別代表截止日期、希拉里的原始民意調查、川普的原始民意調查、希拉里調整後的民意調查、川普調整後的民意調查。接著,用圖表來呈現調查結果。
以下是完整代碼實現過程:
# -*- coding: utf-8 -*-"""Created on Tue Oct 3 10:18:26 2017"""#引入必要的庫import numpy as npimport datetimeimport matplotlib.pyplot as plt#文件地址filename=./presidential_polls.csv####列名預處理#####讀取列名,即第一行數據with open(filename,r) as f: col_names_str=f.readline()[:-1] #[:-1]表示不讀取末尾的換行符號#將字元串拆分,並組成列表col_name_lst=col_names_str.split(,)#使用的列名use_col_name_lst=[enddate, rawpoll_clinton, rawpoll_trump,adjpoll_clinton, adjpoll_trump]#獲取相應列名的索引號use_col_index_lst=[col_name_lst.index(use_col_name) for use_col_name in use_col_name_lst]#讀取數據data_array=np.loadtxt(filename, #文件名 delimiter=,, #分隔符 skiprows=1, #跳過第一行,即跳過列名 dtype=str, #將所有數據默認為str類型,稍後對日期數據進行轉換 usecols=use_col_index_lst) #讀取指定列的數據#print(data_array,data_array.shape) #查看數據 ####數據處理######處理日期格式數據enddate_idx=use_col_name_lst.index(enddate)enddate_lst=data_array[:,enddate_idx].tolist() #利用切片操作,取得日期列,然後將數組轉換為列表,方便接下來操作#將日期字元串同一格式,即『yy/dd/mm』enddate_lst=[enddate.replace(-,/) for enddate in enddate_lst ]enddate_lst = [enddate.replace(b, ) for enddate in enddate_lst]enddate_lst = [enddate.replace(, ) for enddate in enddate_lst]#將日期字元轉換為日期date_lst = [datetime.datetime.strptime(enddate, %m/%d/%Y) for enddate in enddate_lst]#構造年份-月份列表month_lst=[%d-%02d%(date_obj.year,date_obj.month)for date_obj in date_lst]#print(month_lst)month_array=np.array(month_lst)months=np.unique(month_array)#print(months)#數據分析#統計民意投票數#cliton#原始數據rawpollrawpoll_clinton_idx=use_col_name_lst.index(rawpoll_clinton)rawpoll_clinton_data=data_array[:,rawpoll_clinton_idx]rawpoll_clinton_data = [rawpoll_clinton.replace(b, ) for rawpoll_clinton in rawpoll_clinton_data]rawpoll_clinton_data = [rawpoll_clinton.replace(, ) for rawpoll_clinton in rawpoll_clinton_data]rawpoll_clinton_data =np.array(rawpoll_clinton_data)#調整後的數據adjpolladjpoll_clinton_idx = use_col_name_lst.index(adjpoll_clinton)adjpoll_clinton_data = data_array[:, adjpoll_clinton_idx]adjpoll_clinton_data = [adjpoll_clinton.replace(b, ) for adjpoll_clinton in adjpoll_clinton_data]adjpoll_clinton_data = [adjpoll_clinton.replace(, ) for adjpoll_clinton in adjpoll_clinton_data]adjpoll_clinton_data =np.array(adjpoll_clinton_data)# trump# 原始數據 rawpollrawpoll_trump_idx = use_col_name_lst.index(rawpoll_trump)rawpoll_trump_data = data_array[:, rawpoll_trump_idx]rawpoll_trump_data = [rawpoll_trump.replace(b, ) for rawpoll_trump in rawpoll_trump_data]rawpoll_trump_data = [rawpoll_trump.replace(, ) for rawpoll_trump in rawpoll_trump_data]rawpoll_trump_data=np.array(rawpoll_trump_data)# 調整後的數據 adjpolladjpoll_trump_idx = use_col_name_lst.index(adjpoll_trump)adjpoll_trump_data = data_array[:, adjpoll_trump_idx]adjpoll_trump_data = [adjpoll_trump.replace(b, ) for adjpoll_trump in adjpoll_trump_data]adjpoll_trump_data = [adjpoll_trump.replace(, ) for adjpoll_trump in adjpoll_trump_data]adjpoll_trump_data=np.array(adjpoll_trump_data)# 結果保存results = []def is_convert_float(s): """ 判斷一個字元串能否轉換為float """ try: float(s) except: return False return Truedef get_sum(str_array): """ 返回字元串數組中數字的總和 """ # 去掉不能轉換成數字的數據 cleaned_data = list(filter(is_convert_float, str_array)) # 轉換數據類型 float_array = np.array(cleaned_data, np.float) return np.sum(float_array)for month in months: # clinton # 原始數據 rawpoll rawpoll_clinton_month_data = rawpoll_clinton_data[month_array == month] # 統計當月的總票數 rawpoll_clinton_month_sum = get_sum(rawpoll_clinton_month_data) # 調整數據 adjpoll adjpoll_clinton_month_data = adjpoll_clinton_data[month_array == month] # 統計當月的總票數 adjpoll_clinton_month_sum = get_sum(adjpoll_clinton_month_data) # trump # 原始數據 rawpoll rawpoll_trump_month_data = rawpoll_trump_data[month_array == month] # 統計當月的總票數 rawpoll_trump_month_sum = get_sum(rawpoll_trump_month_data) # 調整數據 adjpoll adjpoll_trump_month_data = adjpoll_trump_data[month_array == month] # 統計當月的總票數 adjpoll_trump_month_sum = get_sum(adjpoll_trump_month_data) results.append((month, rawpoll_clinton_month_sum, adjpoll_clinton_month_sum, rawpoll_trump_month_sum, adjpoll_trump_month_sum))#print(results)months, raw_clinton_sum, adj_clinton_sum, raw_trump_sum, adj_trump_sum = zip(*results)#可視化分析結果plt.subplots(2,2, figsize=(15,10))# 原始數據趨勢展示plt.subplot(221)plt.plot(raw_clinton_sum,color=r)plt.plot(raw_trump_sum,color=g)plt.subplot(222)width = 0.3x = np.arange(len(months))plt.bar(x, raw_clinton_sum, width, color=r)plt.bar(x + width, raw_trump_sum, width, color=g)# 調整數據趨勢展示plt.subplot(223)plt.plot(adj_clinton_sum, color=r)plt.plot(adj_trump_sum, color=g)plt.subplot(224)width = 0.3x = np.arange(len(months))plt.bar(x, adj_clinton_sum, width, color=r)plt.bar(x + width, adj_trump_sum, width, color=g)plt.subplots_adjust(wspace=0.2)plt.show()
圖形展示如下:
從圖中可以看出,大選期間,希拉里和川普的民意調查結果不相上下。
以上代碼比較有通用性,展示了整個數據處理的一般過程。
推薦閱讀: