Kaggle:電影數據分析
本文通過分電影數據,解決以下幾個問題:
1、電影票房與電影投資之間的關係
2、排名前十電影類型上映數量
3、產齣電影數量最多的國家
4、電影每年的收入趨勢圖
5、每年上映的電影數量
6、Paramount Pictures與Universal Pictures對比
7、電影風格隨時間變化趨勢圖
理解數據集:
movies.csv 中的數據,以下是每個欄位(列)的含義介紹:
● id:標識號
● imdb_id:IMDB 標識號
● popularity:在 Movie Database 上的相對頁面查看次數
● budget:預算(美元)
● revenue:收入(美元)
● original_title:電影名稱
● cast:演員列表,按 | 分隔,最多 5 名演員
● homepage:電影首頁的 URL
● director:導演列表,按 | 分隔,最多 5 名導演
● tagline:電影的標語
● keywords:與電影相關的關鍵字,按 | 分隔,最多 5 個關鍵字
● overview:劇情摘要
● runtime:電影時長
● genres:風格列表,按 | 分隔,最多 5 種風格
● production_companies:製作公司列表,按 | 分隔,最多 5 家公司
● release_date:首次上映日期
● vote_count:評分次數
● vote_average:平均評分
● release_year:發行年份
● budget_adj:根據通貨膨脹調整的預算(2010 年,美元)
● revenue_adj:根據通貨膨脹調整的收入(2010 年,美元)
1、電影票房與電影投資之間的關係
從散點圖中可以看齣電影產出與電影投入成一定的正比,投入高的電影,一般來說產出也相對較高。生活中也一樣,有付出相對來說就有回報,不付出就一定沒回報。
2、排名前十電影類型上映數量
從排名前十的電影風格可以看出,戲劇類型的電影數量最多,其次是喜劇類的。
3、產齣電影數量最多的國家
4、電影每年的收入趨勢圖
5、每年上映的電影數量
6、Paramount Pictures與Universal Pictures對比
7、電影風格隨時間變化趨勢圖
以下是代碼:
1、import numpy as np
import pandas as pdimport jsonimport matplotlib.pyplot as pltimport seaborn as snsdef load_tmdb_movies(path): df=pd.read_csv(path)df[release_date]=pd.to_datetime(df[release_date]).apply(lambda x:x.date())
json_columns=[genres,keywords,production_countries,production_companies,spoken_languages] for column in json_columns: df[column]=df[column].apply(json.loads) return dfdef load_tmdb_credits(path): df=pd.read_csv(path) json_columns=[cast,crew] for column in json_columns: df[column]=df[column].apply(json.loads)return df
TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES={budget:budget,genres:genres,revenue:gross,title:movie_title,runtime:duration,original_language:language, keywords:plot_keywords,vote_count:num_voted_users}IMDB_COLUMNS_TO_REMAP={imdb_score:vote_average}def safe_access(container,index_values): result=container try: for idx in index_values: result=result[idx] return resultexcept IndexError or KeyError:
return pd.np.nandef get_director(crew_data): directors=[x[name]for x in crew_data if x[job]==Director] return safe_access(directors,[0])def pipe_flatten_names(keywords): return|.join([x[name]for x in keywords])def convert_to_original_format(movies,credits): tmdb_movies=movies.copy() tmdb_movies.rename(columns=TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES,inplace=True)tmdb_movies[title_year]=pd.to_datetime(tmdb_movies[release_date]).apply(lambda x:x.year)
tmdb_movies[country]=tmdb_movies[production_countries].apply(lambda x:safe_access(x,[0,name])) tmdb_movies[language]=tmdb_movies[spoken_languages].apply(lambda x:safe_access(x,[0,name])) tmdb_movies[director_name]=credits[crew].apply(get_director) tmdb_movies[actor_1_name]=credits[cast].apply(lambda x:safe_access(x,[1,name])) tmdb_movies[actor_2_name]=credits[cast].apply(lambda x:safe_access(x,[2,name])) tmdb_movies[actor_3_name]=credits[cast].apply(lambda x:safe_access(x,[3,name])) tmdb_movies[genres]=tmdb_movies[genres].apply(pipe_flatten_names) tmdb_movies[plot_keywords]=tmdb_movies[plot_keywords].apply(pipe_flatten_names) tmdb_movies[production_companies]=tmdb_movies[production_companies].apply(pipe_flatten_names) return tmdb_moviesmovies=load_tmdb_movies(r"C:\Users\Administrator\Desktop\TMDB\tmdb_5000_movies.csv")credits=load_tmdb_credits(r"C:\Users\Administrator\Desktop\TMDB\tmdb_5000_credits.csv")original_format=convert_to_original_format(movies,credits)corrdf=original_format.corr()corrdf.dataframe thead tr:only-child th {
text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } budget id popularity gross duration vote_average num_voted_users title_year budget 1.000000 -0.089377 0.505414 0.730823 0.269851 0.093146 0.593180 0.168990 id -0.089377 1.000000 0.031202 -0.050425 -0.153536 -0.270595 -0.004128 0.434943 popularity 0.505414 0.031202 1.000000 0.644724 0.225502 0.273952 0.778130 0.101998 gross 0.730823 -0.050425 0.644724 1.000000 0.251093 0.197150 0.781487 0.090192 duration 0.269851 -0.153536 0.225502 0.251093 1.000000 0.375046 0.271944 -0.166849 vote_average 0.093146 -0.270595 0.273952 0.197150 0.375046 1.000000 0.312997 -0.198499 num_voted_users 0.593180 -0.004128 0.778130 0.781487 0.271944 0.312997 1.000000 0.114212 title_year 0.168990 0.434943 0.101998 0.090192 -0.166849 -0.198499 0.114212 1.000000#一、投入與收入之間的關係dfmoney=pd.DataFrame()dfmoney=original_format[[budget,gross]]fig=plt.figure()ax=fig.add_subplot(1,1,1)x=dfmoney[budget]y=dfmoney[gross]plt.scatter(x,y,color=blue)plt.title(relationship between gross and budget )plt.xlabel(budget)plt.ylabel(gross)plt.show()
#二、電影類型隨著時間的推移的變化dfstylex=pd.DataFrame()dfstylex=original_format[[genres,title_year]]dfstyle[title_year]=dfstyle[title_year].fillna(2009)df_stylex=dfstyle[genres].str.split(|)#電影類型出現次數最多的list=[]for i in df_style: list.extend(i)listax=pd.Series(list)ax2=ax.value_counts()[0:10].sort_values(ascending=True)plt.subplots(figsize=(8,5))ax3=ax2.plot(kind=barh,width_=0.9)for i,v in enumerate(ax2.values): ax3.text(1,i,v,fontsize=15,color=white,weight=bold)plt.savefig(C:\Users\Administrator\Desktop\TMDB\movies style.jpg)plt.show()C:Program FilesAnaconda3libsite-packagesipykernel\__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#三、有多少個國家,以餅圖形式展現original_format[country].unique()#填充缺失值(用眾數填充)original_format[country].value_counts()original_format[country]=original_format[country].fillna(United States of America)df_country=original_format[country].groupby(original_format[country]).count()df=original_formatdf_countries=df[title_year].groupby(df[country]).count()df_countries=df_countries.reset_index()df_countries.rename(columns={title_year:count},inplace=True)df_countries=df_countries.sort_values(count,ascending=False)df_countries.reset_index(drop=True,inplace=True)sns.set_context(poster,font_scale=0.6)plt.rc(font,weight=bold)f,ax=plt.subplots(figsize=(11,6))labels=[s[0] if s[1]>80 else for index,s in df_countries[[country,count]].iterrows()]sizes=df_countries[count].valuesexplode=[0.0 if sizes[i]<100 else 0.0 for i in range(len(df_countries))]ax.pie(sizes,explode=explode,labels=labels, autopct=lambda x:{:1.0f}%.format(x)if x>1 else , shadow=False,startangle=45)ax.axis(equal)ax.set_title(% of films per country,bbox={facecolor:k,pad:5},color=w,fontsize=16)plt.savefig(C:\Users\Administrator\Desktop\TMDB\per country.jpg)plt.show()
#四、每年票房變化df_gross=original_format[[gross,title_year]]#求眾數df_gross[title_year].mode()#用眾數填充缺失值df_gross[title_year]=df_gross[title_year].fillna(2009)x=df_gross[title_year]y=df_gross.groupby([title_year])[gross].sum()fig=plt.figure(figsize=(8,6))ax=fig.add_subplot()plt.plot(y,marker=.)plt.title(annual gross)plt.xlabel(year)plt.ylabel(gross)#plt.savefig(C:\Users\Administrator\Desktop\TMDB\gross.jpg)保存圖片plt.show()C:Program FilesAnaconda3libsite-packagesipykernel\__main__.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#五、每年電影數量變化df_summv=original_format[[movie_title,title_year]]y1=df_summv.groupby(df_summv[title_year]).count()ax1=fig.add_subplot()plt.plot(y1,linestylex=--)plt.title(the number of movies per year)plt.xlabel(year)plt.ylabel(the number of movies)plt.savefig(C:\Users\Administrator\Desktop\TMDB\the number of movies per year.jpg)plt.show()
#六、Paramount Pictures與Universal Pictures對比companies=pd.DataFrame()companies[production_companies]=original_format[production_companies]def get_companies(production_companies): return |.join(x[name]for x in production_companies)companies[production_companies]=companies[production_companies].apply(get_companies)companies.head()
.dataframe thead tr:only-child th {
text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } production_companies 0 Ingenious Film Partners|Twentieth Century Fox ... 1 Walt Disney Pictures|Jerry Bruckheimer Films|S... 2 Columbia Pictures|Danjaq|B24 3 Legendary Pictures|Warner Bros.|DC Entertainme... 4 Walt Disney Pictureslist_com=companies[production_companies].str.split(|)list1=[]for i in list_com: list1.extend(i)list1up={}for i in list1: if list1.count(i)>1: up[i]=list1.count(i)a=[Paramount Pictures,Universal Pictures]#up.get(Paramount Pictures)獲取字典中對應的值y=[up.get(Paramount Pictures),up.get(Universal Pictures)]x=range(0,1)plt.bar(range(2),y,align=center,color=blue)plt.xticks(range(2),[Paramount Pictures,Universal Pictures])for i,j in enumerate(y): plt.text(i,j,%.0f%j,ha=center)plt.savefig(C:\Users\Administrator\Desktop\TMDB\paramount &universal.jpg)plt.show()
七、電影風格隨時間變化
df=originalformat[[titleyear,genres]]
df[genres]=df[genres].fillna(comedy)df[titleyear]=df[titleyear].fillna(df[title_year].mean())for i,j in zip(df[genres].index,df[genres]): if j==: df[genres].drop(i,axis=0,inplace=True)C:Program FilesAnaconda3libsite-packagespandascoregeneric.py:2073: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(result)min_year=df[title_year].min()max_year=df[title_year].max()list_genres=set()for s in df[genres].str.split(|): list_genres=set().union(s,list_genres)list_genres=list(list_genres)genres_df=pd.DataFrame(index=range(int(min_year),int(max_year)+1),columns=list_genres)genres_df=genres_df.fillna(value=0)
for genre,titleyear in zip(df[genres],df[titleyear]):
splitgenre=list(genre.split(|)) n=len(splitgenre) print (splitgenre) for i in range(n): j=splitgenre[i] genresdf.ix[titleyear,j]+=1genresdf.plot(x=genresdf.index,figsize=(18,9))plt.xlabel(time)plt.ylabel(number)plt.title(movie type over time)plt.savefig(C:UsersAdministratorDesktopTMDBmovie type over time.jpg)plt.show()由於各方面知識的欠缺,以上內容還有待完善。
推薦閱讀:
※演算法,西瓜切十刀,最多是多少塊?
※Scrapy爬圖片(二)
※《無問西東》豆瓣短評分析
※Python built-in functions (A&B)
※提高運行效率,教你6個竅門。