如何用Python來EDA數據分析

之前有分享過一篇外文的基於Python的EDA文章,今天是周末就花點時間來認真讀一下,順便記錄一些有用的知識點。

由於本文的篇幅較大,所以我會分幾期來寫,同時,代碼不會全部放在正文,因為代碼一放上去的話,文章會炒雞長的,如果大家對代碼感興趣,可以在我的微信公眾號(SAMshare)的後台回復關鍵字:eda,獲取全文代碼


0. EDA定義

即探索性數據分析(Exploratory Data Analysis),是指對已有的數據(特別是調查或觀察得來的原始數據)在盡量少的先驗假定下進行探索,通過作圖、製表、方程擬合、計算特徵量等手段探索數據的結構和規律的一種數據分析方法。

——百度百科

1. 數據集介紹

This dataset is a subset of Yelps businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelps data and share their discoveries. In the dataset youll find information about businesses across 11 metropolitan areas in four countries.

簡單來說,就是一些關於Yelp的業務、評論和用戶數據的數據集,在裡面有4個國家的11個大城市的商業信息。

(總共有7個數據集,其中有2個數據集平均下來超過2G的,下載了半天。。我已經把數據集上傳到了微信公眾號(SAMshare)後台,有興趣的可以在後台進行索取,後台輸入:eda

數據集一覽,可在微信公眾號後台進行下載(SAMshare 後台 輸入:eda)

2. 環境交代及相關庫準備

本人使用的是python3,用的是anaconda來進行代碼編寫,雖然說,anaconda會預先安裝了一些庫,但是本次的數據分析還是有一些「漏網之魚」的庫,所以需要自己另外安裝

至於到底你缺了什麼庫,你在import的時候就會提示你的啦。

這裡分享一個basemap庫的安裝(因為這個炒雞難安裝)

其他庫,我們直接pip install+庫名 就可以安裝了,炒雞方便,但是這個basemap很麻煩,安裝過程相對比較曲折,我摸索了一下子才弄清楚了,具體教程如下:

1)Mac 如何安裝python的Basemap?

2)windows下Python安裝basemap教程

3. 導入相關庫

# package imports#basicsimport numpy as npimport pandas as pd#miscimport gcimport timeimport warnings#vizimport matplotlib.pyplot as pltimport seaborn as sns import matplotlib.gridspec as gridspec import matplotlib.gridspec as gridspec # graph vizimport plotly.offline as pyofrom plotly.graph_objs import *import plotly.graph_objs as go#map sectionimport imageioimport foliumimport folium.plugins as pluginsfrom mpl_toolkits.basemap import Basemap#graph sectionimport networkx as nximport heapq # for getting top n number of things from list,dict#settingsstart_time=time.time()color = sns.color_palette()sns.set_style("dark")warnings.filterwarnings("ignore")pyo.init_notebook_mode()%matplotlib inline

4. 導入所有數據集

#importing every datasetbusiness=pd.read_csv("/Users/yongsenlin/Desktop/yelp_business.csv")business_attributes=pd.read_csv("/Users/yongsenlin/Desktop/yelp_business_attributes.csv")business_hours=pd.read_csv("/Users/yongsenlin/Desktop/yelp_business_hours.csv")check_in=pd.read_csv("/Users/yongsenlin/Desktop/yelp_checkin.csv")reviews=pd.read_csv("/Users/yongsenlin/Desktop/yelp_review.csv")tip=pd.read_csv("/Users/yongsenlin/Desktop/yelp_tip.csv")user=pd.read_csv("/Users/yongsenlin/Desktop/yelp_user.csv")end_time=time.time()print("Took",end_time-start_time,"s")

5. 初閱數據

#分別執行,看一下這些數據表的基本情況business.head()business_attributes.head()business_hours.head()check_in.head()reviews.head()tip.head()user.head()

business表,其他表就不貼上來了

6. 分布情況探索

1)商家星級分布探索

#Get the distribution of the ratingsx=business[stars].value_counts()x=x.sort_index()#plotplt.figure(figsize=(8,4))ax= sns.barplot(x.index, x.values, alpha=0.8)plt.title("Star Rating Distribution ——by Sam")plt.ylabel(# of businesses, fontsize=12)plt.xlabel(Star Ratings——by Sam, fontsize=12)#adding the text labelsrects = ax.patcheslabels = x.valuesfor rect, label in zip(rects, labels): height = rect.get_height() ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha=center, va=bottom)plt.show()

客戶對商鋪的評級情況分布

可以看出大多的客戶對商家的評級分布情況,大多數集中在4.0分,而且有明顯的右偏,看來大家對商家的滿意度還是中等偏高的。

2)商家類型探索

business_cats= .join(business[categories])cats=pd.DataFrame(business_cats.split(;),columns=[category])x=cats.category.value_counts()print("There are ",len(x)," different types/categories of Businesses in Yelp!")#prep for chartx=x.sort_values(ascending=False)x=x.iloc[0:20]#chartplt.figure(figsize=(16,4))ax = sns.barplot(x.index, x.values, alpha=0.8)#,color=color[5])plt.title("What are the top categories? ——by Sam",fontsize=25)locs, labels = plt.xticks()plt.setp(labels, rotation=80)plt.ylabel(# businesses, fontsize=12)plt.xlabel(Category, fontsize=12)#adding the text labelsrects = ax.patcheslabels = x.valuesfor rect, label in zip(rects, labels): height = rect.get_height() ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha=center, va=bottom)plt.show()

商鋪類型分布

從圖中可以看出,商家的類型主要集中在餐飲、購物、食物、家庭服務等等,有個nightlife(夜生活)也很靠前啊,看來歪果仁還是會玩的哦~

7. 地理維度探索

這裡,我們從地理維度進行數據探索,我們嘗試分析這些商家在地理維度上的規律。

  1. 首先我們從全局上看一下整個Yelp數據集里的所有商家的地理分布情況;
  2. 接著我們會針對兩個相對商家比較集中的地區進行放大觀察(北美和歐洲);
  3. 探索那些評論最多的2座城市的基本情況;
  4. 可視化這2座城市。

1)全局觀看

#basic basemap of the worldplt.figure(1, figsize=(15,6))# use ortho projection for the globe type versionm1=Basemap(projection=ortho,lat_0=20,lon_0=-50)# hex codes from google maps color pallete = http://www.color-hex.com/color-palette/9261#add continentsm1.fillcontinents(color=#bbdaa4,lake_color=#4a80f5) # add the oceansm1.drawmapboundary(fill_color=#4a80f5) # Draw the boundaries of the countiresm1.drawcountries(linewidth_=0.1, color="black")#Add the scatter points to indicate the locations of the businessesmxy = m1(business["longitude"].tolist(), business["latitude"].tolist())m1.scatter(mxy[0], mxy[1], s=3, c="orange", lw=3, alpha=1, zorder=5)plt.title("World-wide Yelp Reviews")plt.show()

可以看出商家信息最多的區域主要是北美洲以及歐洲區呀,下面我們重點看一下這2個地區。

2)北美洲和歐洲區情況

# Sample it down to only the North America region lon_min, lon_max = -132.714844, -59.589844lat_min, lat_max = 13.976715,56.395664#create the selectoridx_NA = (business["longitude"]>lon_min) & (business["longitude"]<lon_max) & (business["latitude"]>lat_min) & (business["latitude"]<lat_max)#apply the selector to subsetNA_business=business[idx_NA]#initiate the figureplt.figure(figsize=(12,6))m2 = Basemap(projection=merc, llcrnrlat=lat_min, urcrnrlat=lat_max, llcrnrlon=lon_min, urcrnrlon=lon_max, lat_ts=35, resolution=i)m2.fillcontinents(color=#191919,lake_color=#000000) # dark grey land, black lakesm2.drawmapboundary(fill_color=#000000) # black backgroundm2.drawcountries(linewidth_=0.1, color="w") # thin white line for country borders# Plot the datamxy = m2(NA_business["longitude"].tolist(), NA_business["latitude"].tolist())m2.scatter(mxy[0], mxy[1], s=5, c="#1292db", lw=0, alpha=0.05, zorder=5)plt.title("North America Region")# Sample it down to only the Eurozone + Britain :p lon_min, lon_max = -8.613281,16.699219lat_min, lat_max = 40.488737,59.204064#create the selectoridx_euro = (business["longitude"]>lon_min) & (business["longitude"]<lon_max) & (business["latitude"]>lat_min) & (business["latitude"]<lat_max)#apply the selector to subseteuro_business=business[idx_euro]#initiate the figureplt.figure(figsize=(12,6))m3 = Basemap(projection=merc, llcrnrlat=lat_min, urcrnrlat=lat_max, llcrnrlon=lon_min, urcrnrlon=lon_max, lat_ts=35, resolution=i)m3.fillcontinents(color=#191919,lake_color=#000000) # dark grey land, black lakesm3.drawmapboundary(fill_color=#000000) # black backgroundm3.drawcountries(linewidth_=0.1, color="w") # thin white line for country borders# Plot the datamxy = m3(euro_business["longitude"].tolist(), euro_business["latitude"].tolist())m3.scatter(mxy[0], mxy[1], s=5, c="#1292db", lw=0, alpha=0.05, zorder=5)plt.title("Europe Region")plt.show()

北美洲區域商鋪分布情況

歐洲區域商鋪分布情況

可以看出,這些商家也是相對集中在幾個大城市的,相對比較集中,大多數的地方對於商家的信息還是缺失的(或是說商家在這裡地方根本不存在),反正,我們EDA也就是對數據有一個大體的認識,達到這個目的即可。

3)城市角度觀察

我們再從城市的角度出發,這裡我們大致看一下那些最多評論(最多人關注)的城市。

#Get the distribution of the ratingsx=business[city].value_counts()x=x.sort_values(ascending=False)x=x.iloc[0:20]plt.figure(figsize=(16,4))ax = sns.barplot(x.index, x.values, alpha=0.8,color=color[3])plt.title("Which city has the most reviews?")locs, labels = plt.xticks()plt.setp(labels, rotation=45)plt.ylabel(# businesses, fontsize=12)plt.xlabel(City, fontsize=12)#adding the text labelsrects = ax.patcheslabels = x.valuesfor rect, label in zip(rects, labels): height = rect.get_height() ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha=center, va=bottom)plt.show()

最多人關注的城市

從圖中我們看到了4個城市,分別是Las Vegas, Pheonix,Stuttgart,Edinburgh,我們放更多時間在這裡。

#get all ratings datarating_data=business[[latitude,longitude,stars,review_count]]# Creating a custom column popularity using stars*no_of_reviewsrating_data[popularity]=rating_data[stars]*rating_data[review_count]f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,7))#a random point inside vegaslat = 36.207430lon = -115.268460#some adjustments to get the right piclon_min, lon_max = lon-0.3,lon+0.5lat_min, lat_max = lat-0.4,lat+0.5#subset for vegasratings_data_vegas=rating_data[(rating_data["longitude"]>lon_min) & (rating_data["longitude"]<lon_max) & (rating_data["latitude"]>lat_min) & (rating_data["latitude"]<lat_max)]#Facet scatter plotratings_data_vegas.plot(kind=scatter, x=longitude, y=latitude, color=yellow, s=.02, alpha=.6, subplots=True, ax=ax1)ax1.set_title("Las Vegas")ax1.set_facecolor(black)#a random point inside pheonixlat = 33.435463lon = -112.006989#some adjustments to get the right piclon_min, lon_max = lon-0.3,lon+0.5lat_min, lat_max = lat-0.4,lat+0.5#subset for pheonixratings_data_pheonix=rating_data[(rating_data["longitude"]>lon_min) & (rating_data["longitude"]<lon_max) & (rating_data["latitude"]>lat_min) & (rating_data["latitude"]<lat_max)]#plot pheonixratings_data_pheonix.plot(kind=scatter, x=longitude, y=latitude, color=yellow, s=.02, alpha=.6, subplots=True, ax=ax2)ax2.set_title("Pheonix")ax2.set_facecolor(black)f.show()

從圖中可以看出了這4個城市的商鋪分布的特點,有美國城市特有的街區網格結構(Blocks or grid)以及其他城市的流體結構(bit fluid)的設計差異。

8. 深度挖掘用戶評論

1)找出評論數最多的10名用戶

user_agg=reviews.groupby(user_id).agg({review_id:[count],date:[min,max],useful:[sum],funny:[sum],cool:[sum],stars:[mean]})user_agg=user_agg.sort_values([(review_id,count)],ascending=False)print(" Top 10 Users in Yelp")user_agg.head(10)

最為積極的10個用戶

可以看到評論數排名前10的用戶以及一些相關信息(如最早評論時間,最近評論時間,以及一些評論內容情況等)

2)用戶評論次數統計

Anyways,我們還是重點關注一下這些評論數靠前的用戶,看一下一些統計量數據。(為了優化顯示,我們只看評論數30條以下的)

# Cap max reviews to 30 for better visualsuser_agg[(review_id,count)].loc[user_agg[(review_id,count)]>30] = 30plt.figure(figsize=(12,5))plt.suptitle("User Deep dive",fontsize=20)gridspec.GridSpec(1,2)plt.subplot2grid((1,2),(0,0))#Cumulative Distributionax=sns.kdeplot(user_agg[(review_id,count)],shade=True,color=r)plt.title("How many reviews does an average user give?",fontsize=15)plt.xlabel(# of reviews given, fontsize=12)plt.ylabel(# of users, fontsize=12)#Cumulative Distributionplt.subplot2grid((1,2),(0,1))sns.distplot(user_agg[(review_id,count)], kde_kws=dict(cumulative=True))plt.title("Cumulative dist. of user reviews",fontsize=15)plt.ylabel(Cumulative perc. of users, fontsize=12)plt.xlabel(# of reviews given, fontsize=12)plt.show()end_time=time.time()print("Took",end_time-start_time,"s")

可以看出,大多數用戶的評論數集中在2-3條,接近80%的用戶寫的評論都是少於5條的。

9. 登陸數據探索

# 定義並封裝突出單元格顏色的代碼def highlight_max(data, color=yellow): highlight the maximum in a Series or DataFrame attr = background-color: {}.format(color) if data.ndim == 1: # Series from .apply(axis=0) or axis=1 is_max = data == data.max() return [attr if v else for v in is_max] else: # from .apply(axis=None) is_max = data == data.max().max() return pd.DataFrame(np.where(is_max, attr, ), index=data.index, columns=data.columns)

探索checkin時間及次數的規律

#checkins explorationdf=check_in.groupby([weekday,hour])[checkins].sum()df=df.reset_index()df=df.pivot(index=hour,columns=weekday)[[checkins]]df.columns = df.columns.droplevel()df=df.reset_index()# Workaround for not being able to sort the values by hourdf.hour=df.hour.apply(lambda x: str(x).split(:)[0])df.hour=df.hour.astype(int)# Sort the hour column df=df.sort_values(hour)df=df[[hour,Mon, Tue,Wed,Thu,Fri,Sat, Sun ]]# df=df.set_index(hour)cm = sns.light_palette("orange", as_cmap=True)#highlight the max of each columndf.style.apply(highlight_max, color=darkorange, axis=0)

每一周每一個小時的

#https://python-graph-gallery.com/125-small-multiples-for-line-chart/ -- this is a goldmine# Initialize the figureplt.style.use(seaborn-darkgrid)# create a color palettepalette = plt.get_cmap(Set1)plt.figure(figsize=(10,10))plt.suptitle("Checkins variation across time",fontsize=20)gridspec.GridSpec(3,3)plt.subplots_adjust(hspace=0.4)# multiple line plotnum=0for column in df.drop(hour, axis=1): num+=1 # Find the right spot on the plot if num==7: # adjustment to fit sunday plt.subplot2grid((3,3),(2,0),colspan=3) else: plt.subplot(3,3, num) # plot every groups, but discreet for v in df.drop(hour, axis=1): plt.plot(df[hour], df[v], marker=, color=grey, linewidth_=0.6, alpha=0.3) # Plot the lineplot plt.plot(df[hour], df[column], marker=, color=palette(num), linewidth_=2.4, alpha=0.9, label=column) # Same limits for everybody! plt.xlim(0,24) plt.ylim(-2,260000) # Not ticks everywhere if num in range(4) : plt.tick_params(labelbottom=off) if num not in [1,4,7] : plt.tick_params(labelleft=off) # Add title plt.title(column, loc=left, fontsize=12, fontweight=0, color=palette(num))

從圖中可以看出,用戶checkin的高峰都是集中在晚上8點至凌晨,且周六的用戶登陸得最為頻繁。

10. To be continued...

本文的代碼以及數據集,可在小弟的微信公眾號(SAMshare)後台進行獲取哦!(輸入關鍵字:eda

-------分割線-------

歡迎關注我的微信公眾賬號:SAMshare

這裡有關於數據分析優質文章分享、常見演算法分享、python數據分析庫學習歷程、sql知識、sas知識以及數據建模等,歡迎各位朋友關注。嘻嘻!

weixin.qq.com/r/3jo4INX (二維碼自動識別)


推薦閱讀:

沫小姐學數據分析之Python入門篇
2017年3D列印行業大數據報告,3D列印品牌數據分析
如何入門精益數據分析
Python 數據分析學習路線

TAG:數據分析 | Python | 數據可視化 |