Kaggle:電影數據分析

04-30

本文通過分電影數據，解決以下幾個問題：

1、電影票房與電影投資之間的關係

2、排名前十電影類型上映數量

3、產齣電影數量最多的國家

4、電影每年的收入趨勢圖

5、每年上映的電影數量

6、Paramount Pictures與Universal Pictures對比

7、電影風格隨時間變化趨勢圖

理解數據集：

movies.csv 中的數據，以下是每個欄位（列）的含義介紹：

● id：標識號

● imdb_id：IMDB 標識號

● popularity：在 Movie Database 上的相對頁面查看次數

● budget：預算（美元）

● revenue：收入（美元）

● original_title：電影名稱

● cast：演員列表，按 | 分隔，最多 5 名演員

● homepage：電影首頁的 URL

● director：導演列表，按 | 分隔，最多 5 名導演

● tagline：電影的標語

● keywords：與電影相關的關鍵字，按 | 分隔，最多 5 個關鍵字

● overview：劇情摘要

● runtime：電影時長

● genres：風格列表，按 | 分隔，最多 5 種風格

● production_companies：製作公司列表，按 | 分隔，最多 5 家公司

● release_date：首次上映日期

● vote_count：評分次數

● vote_average：平均評分

● release_year：發行年份

● budget_adj：根據通貨膨脹調整的預算（2010 年，美元）

● revenue_adj：根據通貨膨脹調整的收入（2010 年，美元）

1、電影票房與電影投資之間的關係

從散點圖中可以看齣電影產出與電影投入成一定的正比，投入高的電影，一般來說產出也相對較高。生活中也一樣，有付出相對來說就有回報，不付出就一定沒回報。

2、排名前十電影類型上映數量

從排名前十的電影風格可以看出，戲劇類型的電影數量最多，其次是喜劇類的。

3、產齣電影數量最多的國家

4、電影每年的收入趨勢圖

5、每年上映的電影數量

6、Paramount Pictures與Universal Pictures對比

7、電影風格隨時間變化趨勢圖

以下是代碼：

1、import numpy as np

import pandas as pd

import json

import matplotlib.pyplot as plt

import seaborn as sns

def load_tmdb_movies(path):

df=pd.read_csv(path)

df[release_date]=pd.to_datetime(df[release_date]).apply(lambda x:x.date())

json_columns=[genres,keywords,production_countries,production_companies,spoken_languages]

for column in json_columns:

df[column]=df[column].apply(json.loads)

return df

def load_tmdb_credits(path):

df=pd.read_csv(path)

json_columns=[cast,crew]

for column in json_columns:

df[column]=df[column].apply(json.loads)

return df

TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES={budget:budget,genres:genres,revenue:gross,title:movie_title,runtime:duration,original_language:language,

keywords:plot_keywords,vote_count:num_voted_users}

IMDB_COLUMNS_TO_REMAP={imdb_score:vote_average}

def safe_access(container,index_values):

result=container

try:

for idx in index_values:

result=result[idx]

return result

except IndexError or KeyError:

return pd.np.nan

def get_director(crew_data):

directors=[x[name]for x in crew_data if x[job]==Director]

return safe_access(directors,[0])

def pipe_flatten_names(keywords):

return|.join([x[name]for x in keywords])

def convert_to_original_format(movies,credits):

tmdb_movies=movies.copy()

tmdb_movies.rename(columns=TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES,inplace=True)

tmdb_movies[title_year]=pd.to_datetime(tmdb_movies[release_date]).apply(lambda x:x.year)

tmdb_movies[country]=tmdb_movies[production_countries].apply(lambda x:safe_access(x,[0,name]))

tmdb_movies[language]=tmdb_movies[spoken_languages].apply(lambda x:safe_access(x,[0,name]))

tmdb_movies[director_name]=credits[crew].apply(get_director)

tmdb_movies[actor_1_name]=credits[cast].apply(lambda x:safe_access(x,[1,name]))

tmdb_movies[actor_2_name]=credits[cast].apply(lambda x:safe_access(x,[2,name]))

tmdb_movies[actor_3_name]=credits[cast].apply(lambda x:safe_access(x,[3,name]))

tmdb_movies[genres]=tmdb_movies[genres].apply(pipe_flatten_names)

tmdb_movies[plot_keywords]=tmdb_movies[plot_keywords].apply(pipe_flatten_names)

tmdb_movies[production_companies]=tmdb_movies[production_companies].apply(pipe_flatten_names)

return tmdb_movies

movies=load_tmdb_movies(r"C:\Users\Administrator\Desktop\TMDB\tmdb_5000_movies.csv")

credits=load_tmdb_credits(r"C:\Users\Administrator\Desktop\TMDB\tmdb_5000_credits.csv")

original_format=convert_to_original_format(movies,credits)

corrdf=original_format.corr()

corrdf

.dataframe thead tr:only-child th {

text-align: right;

}

.dataframe thead th {

text-align: left;

}

.dataframe tbody tr th {

vertical-align: top;

}

budget id popularity gross duration vote_average num_voted_users title_year budget 1.000000 -0.089377 0.505414 0.730823 0.269851 0.093146 0.593180 0.168990 id -0.089377 1.000000 0.031202 -0.050425 -0.153536 -0.270595 -0.004128 0.434943 popularity 0.505414 0.031202 1.000000 0.644724 0.225502 0.273952 0.778130 0.101998 gross 0.730823 -0.050425 0.644724 1.000000 0.251093 0.197150 0.781487 0.090192 duration 0.269851 -0.153536 0.225502 0.251093 1.000000 0.375046 0.271944 -0.166849 vote_average 0.093146 -0.270595 0.273952 0.197150 0.375046 1.000000 0.312997 -0.198499 num_voted_users 0.593180 -0.004128 0.778130 0.781487 0.271944 0.312997 1.000000 0.114212 title_year 0.168990 0.434943 0.101998 0.090192 -0.166849 -0.198499 0.114212 1.000000

#一、投入與收入之間的關係dfmoney=pd.DataFrame()dfmoney=original_format[[budget,gross]]fig=plt.figure()ax=fig.add_subplot(1,1,1)x=dfmoney[budget]y=dfmoney[gross]plt.scatter(x,y,color=blue)plt.title(relationship between gross and budget )plt.xlabel(budget)plt.ylabel(gross)plt.show()

#二、電影類型隨著時間的推移的變化dfstylex=pd.DataFrame()dfstylex=original_format[[genres,title_year]]dfstyle[title_year]=dfstyle[title_year].fillna(2009)df_stylex=dfstyle[genres].str.split(|)#電影類型出現次數最多的list=[]for i in df_style: list.extend(i)listax=pd.Series(list)ax2=ax.value_counts()[0:10].sort_values(ascending=True)plt.subplots(figsize=(8,5))ax3=ax2.plot(kind=barh,width_=0.9)for i,v in enumerate(ax2.values): ax3.text(1,i,v,fontsize=15,color=white,weight=bold)plt.savefig(C:\Users\Administrator\Desktop\TMDB\movies style.jpg)plt.show()C:Program FilesAnaconda3libsite-packagesipykernel\__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

#三、有多少個國家,以餅圖形式展現original_format[country].unique()#填充缺失值（用眾數填充）original_format[country].value_counts()original_format[country]=original_format[country].fillna(United States of America)df_country=original_format[country].groupby(original_format[country]).count()df=original_formatdf_countries=df[title_year].groupby(df[country]).count()df_countries=df_countries.reset_index()df_countries.rename(columns={title_year:count},inplace=True)df_countries=df_countries.sort_values(count,ascending=False)df_countries.reset_index(drop=True,inplace=True)sns.set_context(poster,font_scale=0.6)plt.rc(font,weight=bold)f,ax=plt.subplots(figsize=(11,6))labels=[s[0] if s[1]>80 else for index,s in df_countries[[country,count]].iterrows()]sizes=df_countries[count].valuesexplode=[0.0 if sizes[i]<100 else 0.0 for i in range(len(df_countries))]ax.pie(sizes,explode=explode,labels=labels, autopct=lambda x:{:1.0f}%.format(x)if x>1 else , shadow=False,startangle=45)ax.axis(equal)ax.set_title(% of films per country,bbox={facecolor:k,pad:5},color=w,fontsize=16)plt.savefig(C:\Users\Administrator\Desktop\TMDB\per country.jpg)plt.show()

#四、每年票房變化df_gross=original_format[[gross,title_year]]#求眾數df_gross[title_year].mode()#用眾數填充缺失值df_gross[title_year]=df_gross[title_year].fillna(2009)x=df_gross[title_year]y=df_gross.groupby([title_year])[gross].sum()fig=plt.figure(figsize=(8,6))ax=fig.add_subplot()plt.plot(y,marker=.)plt.title(annual gross)plt.xlabel(year)plt.ylabel(gross)#plt.savefig(C:\Users\Administrator\Desktop\TMDB\gross.jpg)保存圖片plt.show()C:Program FilesAnaconda3libsite-packagesipykernel\__main__.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

#五、每年電影數量變化df_summv=original_format[[movie_title,title_year]]y1=df_summv.groupby(df_summv[title_year]).count()ax1=fig.add_subplot()plt.plot(y1,linestylex=--)plt.title(the number of movies per year)plt.xlabel(year)plt.ylabel(the number of movies)plt.savefig(C:\Users\Administrator\Desktop\TMDB\the number of movies per year.jpg)plt.show()

#六、Paramount Pictures與Universal Pictures對比companies=pd.DataFrame()companies[production_companies]=original_format[production_companies]def get_companies(production_companies): return |.join(x[name]for x in production_companies)companies[production_companies]=companies[production_companies].apply(get_companies)companies.head()

.dataframe thead tr:only-child th {

text-align: right;

}

.dataframe thead th {

text-align: left;

}

.dataframe tbody tr th {

vertical-align: top;

}

list_com=companies[production_companies].str.split(|)list1=[]for i in list_com: list1.extend(i)list1up={}for i in list1: if list1.count(i)>1: up[i]=list1.count(i)a=[Paramount Pictures,Universal Pictures]#up.get(Paramount Pictures)獲取字典中對應的值y=[up.get(Paramount Pictures),up.get(Universal Pictures)]x=range(0,1)plt.bar(range(2),y,align=center,color=blue)plt.xticks(range(2),[Paramount Pictures,Universal Pictures])for i,j in enumerate(y): plt.text(i,j,%.0f%j,ha=center)plt.savefig(C:\Users\Administrator\Desktop\TMDB\paramount &universal.jpg)plt.show()

七、電影風格隨時間變化

df=originalformat[[titleyear,genres]]

df[genres]=df[genres].fillna(comedy)

df[titleyear]=df[titleyear].fillna(df[title_year].mean())

for i,j in zip(df[genres].index,df[genres]): if j==: df[genres].drop(i,axis=0,inplace=True)C:Program FilesAnaconda3libsite-packagespandascoregeneric.py:2073: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(result)min_year=df[title_year].min()max_year=df[title_year].max()list_genres=set()for s in df[genres].str.split(|): list_genres=set().union(s,list_genres)list_genres=list(list_genres)genres_df=pd.DataFrame(index=range(int(min_year),int(max_year)+1),columns=list_genres)genres_df=genres_df.fillna(value=0)

for genre,titleyear in zip(df[genres],df[titleyear]):

splitgenre=list(genre.split(|))
n=len(splitgenre)

print (splitgenre)
for i in range(n):
j=splitgenre[i]

genresdf.ix[titleyear,j]+=1

genresdf.plot(x=genresdf.index,figsize=(18,9))

plt.xlabel(time)

plt.ylabel(number)

plt.title(movie type over time)

plt.savefig(C:UsersAdministratorDesktopTMDBmovie type over time.jpg)

plt.show()

由於各方面知識的欠缺，以上內容還有待完善。