TMDb電影數據可視化分析

05-18

TMDb電影數據可視化分析

來自專欄從零開始轉行數據分析

Kaggle項目：https://www.kaggle.com/tmdb/tmdb-movie-metadata

數據來源：https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

客戶想確保電影能成功，從而使新公司立足市場。他們希望你能幫助他們了解電影市場趨勢，使他們能做出正確的決策。提出的問題如下：

電影類型隨著時間推移發生的變化；
各類型電影的產量；
Universal Pictures 和 Paramount Pictures 之間的對比；
改編電影和原創電影的對比；
電影導演之間的對比。

1 電影類型產量對比

2 電影類型隨著時間推移發生的變化

1980年之前，各種類型的電影上升幅度都很小。1980-1990年，劇情片（Drama）、喜劇片（Comedy）、驚悚片（Thriller）、恐怖片（Horror）、動作片（Action）、冒險片（Adventure）都有上升的趨勢。

1990年之後，劇情片（Drama）的產量上升幅度最大。在2006年達到巔峰，2006年產齣劇情片（Drama）135部左右。其次，喜劇片（Comedy）、驚悚片（Thriller）、動作片（Action）、冒險片（Adventure）的產量也都在上升，漲幅較為平緩。

3 電影評分、預算、票房、關注度之間的關係

預算與票房，預算與關注度之間有較強的線性關係；
票房與關注度之間有較強的線性關係；
電影評分與預算、票房、關注度之前的線性關係都很弱。

4 Universal Pictures 和 Paramount Pictures 之間的對比

Universal Pictures 和 Paramount Pictures 合拍的5部電影

Universal Pictures 和 Paramount Pictures 電影產量對比

Universal Pictures 和 Paramount Pictures 電影評分對比

平均評分接近；
Paramount Pictures 的電影評分在6分以上的更多；
電影評分最高和最低的都在Paramount Pictures。

Universal Pictures 和 Paramount Pictures 電影預算對比

Paramount Pictures 的平均預算更高；
Paramount Pictures 1.6億美元預算以上的電影更多；

Universal Pictures 和 Paramount Pictures 電影票房對比

Paramount Pictures 的平均票房更高；
Paramount Pictures 1.6億美元票房以上的電影更多。

Universal Pictures 和 Paramount Pictures 電影關注度對比

Universal Pictures 的平均關注度更高；
最高關注度的電影產自Universal Pictures。

Universal Pictures 和 Paramount Pictures 各類型電影數量對比

Universal Pictures 產出更多的劇情片（Drama）、喜劇片（Comedy）、愛情片（Romance）、恐怖片（Horror）、奇幻片（Fantasy）、家庭片（Family）；
Paramount Pictures 產出更多的驚悚片（Thriller）、動作片（Action）、冒險片（Adventure）、科幻片（Science Fiction）、懸疑片（Mystery）。

5 改編電影和原創電影的對比

改編與原創電影數量對比

原創電影的數量遠大於改編電影。

改編與原創電影平均評分對比

改編電影的平均評分高於原創電影。

改編電影的預算、票房、利潤、關注度評都遠高於原創電影。

5 電影產量前20的導演

Steven Spielberg 是最高產的導演。

斯皮爾伯格與夢工廠電影公司簽署的協議里，有這樣一節條款：即使在電影拍片階段，也必須每晚給斯皮爾伯格固定的時間回家與家人共進晚餐。

每晚都回家吃飯，可見效率是真高。

6 高分電影產量前20的導演

我們設定評分大於等於8分的電影為高分電影。

Christopher Nolan 的作品中有5部高分電影，是高分電影產量最高的導演。

諾蘭的5部高分電影

諾蘭出品，必屬精品。

7 電影平均票房前20的導演

8 電影平均關注度前20的導演

9 代碼

import jsonimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as snssns.set()sns.set_palette(Blues_d)from matplotlib.font_manager import FontProperties myfont=FontProperties(fname=rC:cobenSimHei-windows.ttf,size=14) sns.set(font=myfont.get_name())#導入數據movies=pd.read_csv(C:\coben\tmdb_5000_movies.csv,encoding=utf-8)credits=pd.read_csv(C:\coben\tmdb_5000_credits.csv,encoding=utf-8)#理解數據movies.head(3)credits.head(3)#清洗數據-刪除無用列del movies[homepage]del movies[overview]del movies[title]movies.head(1)del credits[title]credits.head(1)#合併成一個數據full=pd.merge(movies,credits,left_on=id,right_on=movie_id)full.head(1)full.shape(4803, 20)#轉換幾列數據格式jsonColumns=[genres,keywords,production_companies, production_countries,spoken_languages, cast,crew]for i in jsonColumns: full[i]=full[i].apply(json.loads)full.head(1)full.info()print(------------------------------------)full.isnull().sum()<class pandas.core.frame.DataFrame>Int64Index: 4803 entries, 0 to 4802Data columns (total 20 columns):budget 4803 non-null int64genres 4803 non-null objectid 4803 non-null int64keywords 4803 non-null objectoriginal_language 4803 non-null objectoriginal_title 4803 non-null objectpopularity 4803 non-null float64production_companies 4803 non-null objectproduction_countries 4803 non-null objectrelease_date 4802 non-null objectrevenue 4803 non-null int64runtime 4801 non-null float64spoken_languages 4803 non-null objectstatus 4803 non-null objecttagline 3959 non-null objectvote_average 4803 non-null float64vote_count 4803 non-null int64movie_id 4803 non-null int64cast 4803 non-null objectcrew 4803 non-null objectdtypes: float64(3), int64(5), object(12)memory usage: 788.0+ KB------------------------------------budget 0genres 0id 0keywords 0original_language 0original_title 0popularity 0production_companies 0production_countries 0release_date 1revenue 0runtime 2spoken_languages 0status 0tagline 844vote_average 0vote_count 0movie_id 0cast 0crew 0dtype: int64#缺失數據處理full[full.release_date.isnull()]del full[tagline]value1={release_date:2017-11-01}full.fillna(value=value1,limit=1,inplace=True)full.loc[4553]budget 0genres []id 380097keywords []original_language enoriginal_title America Is Still the Placepopularity 0production_companies []production_countries []release_date 2017-11-01revenue 0runtime 0spoken_languages []status Releasedvote_average 0vote_count 0movie_id 380097cast []crew []Name: 4553, dtype: objectfull[full.runtime.isnull()]value2={runtime:98.0}value3={runtime:81.0}full.fillna(value=value2,limit=1,inplace=True)full.fillna(value=value3,limit=1,inplace=True)full.loc[2656]budget 15000000genres [{id: 18, name: Drama}]id 370980keywords [{id: 717, name: pope}, {id: 5565, na...original_language itoriginal_title Chiamatemi Francesco - Il Papa della gentepopularity 0.738646production_companies [{name: Taodue Film, id: 45724}]production_countries [{iso_3166_1: IT, name: Italy}]release_date 2015-12-03revenue 0runtime 98spoken_languages [{iso_639_1: es, name: Espa?ol}]status Releasedvote_average 7.3vote_count 12movie_id 370980cast [{cast_id: 5, character: Jorge Mario Berg...crew [{credit_id: 5660019ac3a36875f100252b, de...Name: 2656, dtype: objectfull.info()<class pandas.core.frame.DataFrame>Int64Index: 4803 entries, 0 to 4802Data columns (total 19 columns):budget 4803 non-null int64genres 4803 non-null objectid 4803 non-null int64keywords 4803 non-null objectoriginal_language 4803 non-null objectoriginal_title 4803 non-null objectpopularity 4803 non-null float64production_companies 4803 non-null objectproduction_countries 4803 non-null objectrelease_date 4803 non-null objectrevenue 4803 non-null int64runtime 4803 non-null float64spoken_languages 4803 non-null objectstatus 4803 non-null objectvote_average 4803 non-null float64vote_count 4803 non-null int64movie_id 4803 non-null int64cast 4803 non-null objectcrew 4803 non-null objectdtypes: float64(3), int64(5), object(11)memory usage: 910.5+ KB#時間序列處理full[release_date]=pd.to_datetime(full[release_date],format=%Y-%m-%d)yearList=[]for x in full[release_date]: year=x.year yearList.append(year)yearSer=pd.Series(yearList)full[year]=yearSerfull[year].head()0 20091 20072 20153 20124 2012Name: year, dtype: int64full.head(1)#提取特徵值數據anaDf=full[[original_title,year,runtime,vote_average,vote_count,budget,revenue,popularity]].reset_index(drop=True)anaDf.head()#數據相關性corr=anaDf.corr()ax=plt.subplots(figsize=(8,8))ax=sns.heatmap(corr,annot=True) plt.xticks(fontsize=12)plt.yticks(rotation=horizontal,fontsize=12)sns.lmplot(x=budget,y=revenue,data=anaDf,size=8)plt.title(預算與票房的關係)<matplotlib.text.Text at 0x1fe3a428be0>def pipe_flatten_names(i): return |.join([x[name] for x in i])#電影風格特徵值處理full[genres]=full[genres].apply(pipe_flatten_names)full.head(1)#set會自動刪除重複的數據genresSet=set()for s in full[genres].str.split(|): genresSet=set().union(s,genresSet)genresList=list(genresSet)genresList.remove()genresList[Science Fiction, Music, Drama, Thriller, Comedy, Horror, Romance, History, Animation, Adventure, TV Movie, Foreign, Western, Mystery, Fantasy, Family, Crime, Action, War, Documentary]#補充電影風格特徵值for i in genresList: anaDf[i] = full[genres].str.contains(i).apply(lambda x:1 if x else 0)anaDf.head(3)genrescountList=[]for i in genresList: genrescountList.append(anaDf[i].sum())genrescountSer=pd.Series(genrescountList,index=genresList)#電影類型數量數據genrescountSer=genrescountSer.sort_values(ascending=False)genrescountSerDrama 2297Comedy 1722Thriller 1274Action 1154Romance 894Adventure 790Crime 696Science Fiction 535Horror 519Family 513Fantasy 424Mystery 348Animation 234History 197Music 185War 144Documentary 110Western 82Foreign 34TV Movie 8dtype: int64ax=plt.subplots(figsize=(10,8))ax=sns.barplot(x=genrescountSer.values,y=genrescountSer.index)#設置y軸、X軸的坐標名字與字體大小plt.ylabel(電影類型, fontsize=12)plt.xlabel(數量, fontsize=12)#設置X軸的各列下標字體是水平的plt.xticks(rotation=horizontal)#設置Y軸下標的字體大小plt.yticks(fontsize=12)#年份 vs 電影類型anaDf=anaDf.sort_values(by=year,axis=0,ascending=True)anaDf.reset_index(drop=True).head(3)genres_yearDf=pd.DataFrame()for i in genresList: genres_yearDf[i]=anaDf.groupby(by=[year])[i].sum()genres_yearDf.head()full[year].max()2017genres_yearDf.loc[1916:1990].plot(figsize=(20,10),marker=.)plt.title(年份 vs 電影類型,fontsize=22)plt.xlabel(年份,fontsize=15)plt.ylabel(數量,fontsize=15)plt.xticks(fontsize=15)plt.grid(True)genres_yearDf.loc[1990:2017].plot(figsize=(20,10),marker=.)plt.title(年份 vs 電影類型,fontsize=22)plt.xlabel(年份,fontsize=15)plt.ylabel(數量,fontsize=15)plt.xticks(fontsize=15)plt.grid(True)#環球影業 vs 派拉蒙影業full[production_companies]=full[production_companies].apply(pipe_flatten_names)full.head(3)compList=[Universal Pictures,Paramount Pictures]compList[Universal Pictures, Paramount Pictures]for i in compList: anaDf[i] = full[production_companies].str.contains(i).apply(lambda x:1 if x else 0)anaDf.head(1)#有5部電影是兩家電影公司合拍的，放在bothDf里bothDf=pd.DataFrame()for i in range(4802): if anaDf.loc[i,Universal Pictures]==1 and anaDf.loc[i,Paramount Pictures]==1: bothDf=bothDf.append(anaDf.loc[i])bothDfuni_paraDf=anaDffor i in range(4802): if uni_paraDf.loc[i,Universal Pictures]==1: uni_paraDf.loc[i,company]=Universal Picturesuni_paraDf=anaDffor i in range(4802): if uni_paraDf.loc[i,Universal Pictures]==0 and uni_paraDf.loc[i,Paramount Pictures]==0: uni_paraDf=uni_paraDf.drop(i)uni_paraDf.shape(595, 31)for i in uni_paraDf.index: if uni_paraDf.loc[i,Universal Pictures]==1: uni_paraDf.loc[i,company]=Universal Pictures else: uni_paraDf.loc[i,company]=Paramount Picturesfor i in bothDf.index: bothDf.loc[i,company]=Paramount PicturesbothDfuni_paraDf=uni_paraDf.append(bothDf)uni_paraDf=uni_paraDf.reset_index(drop=True)uni_paraDf.head(3)ax=plt.subplots(figsize=(10,2))ax=sns.countplot(y=company, data=uni_paraDf)plt.title(電影數量對比)fig=plt.figure(1)fig=plt.subplots(figsize=[10,5])ax1=plt.subplot(1,2,1)ax1=sns.barplot(x=company, y=vote_average, data=uni_paraDf,ci=0)ax2=plt.subplot(1,2,2)ax2=sns.violinplot(x=company, y=vote_average, data=uni_paraDf)plt.title(電影評分對比)fig=plt.figure(1)fig=plt.subplots(figsize=[10,5])ax1=plt.subplot(1,2,1)ax1=sns.barplot(x=company, y=budget, data=uni_paraDf,ci=0)ax2=plt.subplot(1,2,2)ax2=sns.violinplot(x=company, y=budget, data=uni_paraDf)plt.title(電影預算對比)fig=plt.figure(1)fig=plt.subplots(figsize=[10,5])ax1=plt.subplot(1,2,1)ax1=sns.barplot(x=company, y=revenue, data=uni_paraDf,ci=0)ax2=plt.subplot(1,2,2)ax2=sns.violinplot(x=company, y=revenue, data=uni_paraDf)plt.title(電影票房對比)fig=plt.figure(1)fig=plt.subplots(figsize=[10,5])ax1=plt.subplot(1,2,1)ax1=sns.barplot(x=company, y=popularity, data=uni_paraDf,ci=0)ax2=plt.subplot(1,2,2)ax2=sns.violinplot(x=company, y=popularity, data=uni_paraDf)plt.title(電影關注度對比)fig=plt.figure(1)fig=plt.subplots(figsize=[8,36])ax1=plt.subplot(12,1,1)ax1=sns.barplot(y=company,x=Drama,data=uni_paraDf,estimator=np.sum,ci=0)ax2=plt.subplot(12,1,2)ax2=sns.barplot(y=company,x=Comedy,data=uni_paraDf,estimator=np.sum,ci=0)ax3=plt.subplot(12,1,3)ax3=sns.barplot(y=company,x=Thriller,data=uni_paraDf,estimator=np.sum,ci=0)ax4=plt.subplot(12,1,4)ax4=sns.barplot(y=company,x=Action,data=uni_paraDf,estimator=np.sum,ci=0)ax5=plt.subplot(12,1,5)ax5=sns.barplot(y=company,x=Romance,data=uni_paraDf,estimator=np.sum,ci=0)ax6=plt.subplot(12,1,6)ax6=sns.barplot(y=company,x=Adventure,data=uni_paraDf,estimator=np.sum,ci=0)ax7=plt.subplot(12,1,7)ax7=sns.barplot(y=company,x=Crime,data=uni_paraDf,estimator=np.sum,ci=0)ax8=plt.subplot(12,1,8)ax8=sns.barplot(y=company,x=Science Fiction,data=uni_paraDf,estimator=np.sum,ci=0)ax9=plt.subplot(12,1,9)ax9=sns.barplot(y=company,x=Horror,data=uni_paraDf,estimator=np.sum,ci=0)ax10=plt.subplot(12,1,10)ax10=sns.barplot(y=company,x=Family,data=uni_paraDf,estimator=np.sum,ci=0)ax11=plt.subplot(12,1,11)ax11=sns.barplot(y=company,x=Fantasy,data=uni_paraDf,estimator=np.sum,ci=0)ax12=plt.subplot(12,1,12)ax12=sns.barplot(y=company,x=Mystery,data=uni_paraDf,estimator=np.sum,ci=0)#原創 vs 改編full[keywords]=full[keywords].apply(pipe_flatten_names)full.head(3)anaDf[original_or_not] = full[keywords].str.contains(based on novel).apply(lambda x:based on novel if x else original)anaDf[profit]=anaDf[revenue]-anaDf[budget]anaDf.head(3)ax=plt.subplots(figsize=(3,6))ax=sns.countplot(x=original_or_not,data=anaDf)ax=plt.subplots(figsize=(3,6))ax=sns.barplot(x=original_or_not,y=vote_average,data=anaDf,ci=0)fig=plt.figure(1)fig=plt.subplots(figsize=[15,8])ax1=plt.subplot(1,4,1)ax1=sns.barplot(x=original_or_not,y=budget,data=anaDf,ci=0)ax2=plt.subplot(1,4,2)ax2=sns.barplot(x=original_or_not,y=revenue,data=anaDf,ci=0)ax3=plt.subplot(1,4,3)ax3=sns.barplot(x=original_or_not,y=profit,data=anaDf,ci=0)ax4=plt.subplot(1,4,4)ax4=sns.barplot(x=original_or_not,y=popularity,data=anaDf,ci=0)# directordef director(x): for i in x: if i[job]==Director: return i[name]anaDf[director]=full[crew].apply(director)anaDf.head(1)director_count=anaDf.groupby(director)[original_title].count()top_count=director_count.sort_values(ascending=False).head(20)ax=plt.subplots(figsize=(10,10))ax=sns.barplot(y=top_count.index,x=top_count.values)plt.title(電影產量前20名的導演)vote_8_count=anaDf[anaDf[vote_average]>=8].groupby(director)[original_title].count()top_vote_8_count=vote_8_count.sort_values(ascending=False).head(20)ax=plt.subplots(figsize=(8,10))ax=sns.barplot(y=top_vote_8_count.index,x=top_vote_8_count.values)plt.title(高分電影產量前20的導演)ave_revenue=anaDf.groupby(director)[revenue].mean()top_ave_revenue=ave_revenue.sort_values(ascending=False).head(20)ax=plt.subplots(figsize=(10,10))ax=sns.barplot(y=top_ave_revenue.index,x=top_ave_revenue.values)plt.title(電影平均票房前20的導演)ave_popularity=anaDf.groupby(director)[popularity].mean()top_ave_popularity=ave_popularity.sort_values(ascending=False).head(20)ax=plt.subplots(figsize=(10,10))ax=sns.barplot(y=top_ave_popularity.index,x=top_ave_popularity.values)plt.title(電影平均關注度前20的導演)

10 參考文章

TMDB Means per genre?

www.kaggle.com

hanajya：像製作人一樣思考——電影數據分析?

zhuanlan.zhihu.com
推薦閱讀：

※Kaggle 入門 1.3——Titanic Solution Using speedml
※遺憾未進前10%， Kaggle&Quora競賽賽後總結
※如何看待Kaggle最新比赛Zillow禁止中国居民参加第二轮？
※Zillow Prize競賽系列--（一）競賽簡介
※Python決策樹模型做 Titanic數據集預測並可視化（一）

TAG:Kaggle | 數據分析 | Python |