如何判斷一 部電影值不值得看?
在電影上映前,我們怎麼能說齣電影的偉大之處呢?許多人依靠批評家來衡量電影的質量,而其他人則利用他們的直覺。但在電影上映後,需要時間來獲得合理的評論,人類的本能有時是不可靠的。考慮到每年都有成千上萬的電影製作,我們是否有更好的方式來講述電影的偉大,而不依賴於評論家或我們自己的直覺?
本文以kaggle上IMDB 5000 Movie Dataset為基礎,運用所學R語言知識,探討如何更好地判斷一部電影是否值得看。
一、數據導入:
rm(list=ls())#清除環境變數ninstall.packages("dpylr")ninstall.packages(("ggplot2"))nlibrary(dplyr)nlibrary(ggplot2)nOridata<-read.csv("C:UsersAdministrator.PC-20150613LPHIDesktop大數據社群資料dataExampleIMDB 5000 Movie Datasetmovie_metadata.csv",header=T)n
導入查看有以下欄位可供分析:
二、提出問題:
1、導演的受歡迎程度和主演的受歡迎程度對電影的影響哪個更大?
2、對電影評分高低影響最重要的指標有哪些?
3、電影海報中人臉的數量與電影的收視率有關聯嗎?
4、按不同維度篩選出最值得觀看的影視排行榜
三、導演和電影評分數據:
mymovies<-select(Oridata,n director_facebook_likes,n movie_facebook_likes,n imdb_score,n budget,n title_yearn)n#列名重命名nmymovies<-rename(mymovies,n directorlikes=director_facebook_likes,n movielikes=movie_facebook_likes,n scores=imdb_score,n year=title_year)nmymoviesn#刪除缺失值nmymovies<-filter(mymovies,n !is.na(directorlikes),n !is.na(movielikes),n !is.na(scores),n !is.na(budget),n !is.na(year))n#按scores降序排列nmymovies<-arrange(mymovies,desc(scores))n#數據計算nby_year<-group_by(mymovies,year)nby_yearncomp<-summarise(by_year,n count=n(),n mean_scores=mean(scores,na.rm=TRUE),n mean_directorlikes=mean(directorlikes,na.rm=TRUE),n mean_movielikes=mean(movielikes,na.rm=TRUE),n mean_budget=mean(budget,na.rm=TRUE)n )ncompnlibrary(ggplot2)nggplot(data=comp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_directorlikes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_directorlikes))n
顯示結果如下:
如圖所示,導演的受歡迎程度和電影的受歡迎程度基本呈正相關,所以通過導演來選擇電影不失為一個好的選擇。
四、演員和電影評分數據:
disactor<-select(Oridata,n actor_3_facebook_likes,n actor_2_facebook_likes,n actor_1_facebook_likes,n movie_facebook_likes,n imdb_score,n budget,n title_year)n#列名重命名ndisactor<-rename(disactor,n actor1likes=actor_1_facebook_likes,n actor2likes=actor_2_facebook_likes,n actor3likes=actor_3_facebook_likes,n movielikes=movie_facebook_likes,n scores=imdb_score,n year=title_year)ndisactorn#刪除缺失值ndisactor<-filter(disactor,n !is.na(actor1likes),n !is.na(actor2likes),n !is.na(actor3likes),n !is.na(movielikes),n !is.na(scores),n !is.na(budget),n !is.na(year))n#按scores降序排列ndisactor<-arrange(disactor,desc(scores))n#數據計算nby_actoryear<-group_by(disactor,year)nby_actoryearnactorcomp<-summarise(by_actoryear,n count=n(),n mean_actor1likes=mean(actor1likes,na.rm=TRUE),n mean_actor2likes=mean(actor2likes,na.rm=TRUE),n mean_actor3likes=mean(actor3likes,na.rm=TRUE),n mean_scores=mean(scores,na.rm=TRUE),n mean_movielikes=mean(movielikes,na.rm=TRUE),n mean_budget=mean(budget,na.rm=TRUE)n)nactorcompnlibrary(ggplot2)nggplot(data=actorcomp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_actor1likes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_actor1likes))+ggtitle("主演一對電影受歡迎度的影響")nggplot(data=actorcomp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_actor2likes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_actor2likes))+ggtitle("主演二對電影受歡迎度的影響")nggplot(data=actorcomp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_actor3likes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_actor3likes))+ggtitle("主演三對電影受歡迎度的影響")n
顯示結果如下:
五、電影評分高低的影響因素:
scoresfactor<-select(Oridata,n duration,n genres,n language,n country,n budget,n title_year,n imdb_score,n aspect_ratio,n color)n#刪除缺失值nscoresfactor<-filter(scoresfactor,n !is.na(duration),n !is.na(genres),n !is.na(language),n !is.na(country),n !is.na(title_year),n !is.na(imdb_score),n !is.na(aspect_ratio),n !is.na(color))nlibrary(ggplot2)nggplot(data=scoresfactor)+geom_point(mapping=aes(x=duration,y=imdb_score))+geom_smooth(mapping=aes(x=duration,y=imdb_score))+ggtitle("電影時長和評分的關係")nggplot(data=scoresfactor)+geom_point(mapping=aes(x=title_year,y=imdb_score))+geom_smooth(mapping=aes(x=title_year,y=imdb_score))+ggtitle("上映時間和評分的關係")nggplot(data=scoresfactor)+geom_point(mapping=aes(x=aspect_ratio,y=imdb_score))+geom_smooth(mapping=aes(x=aspect_ratio,y=imdb_score))+ggtitle("收視率和評分的關係")n
顯示結果如下:
由圖可知,電影時長在90-120分鐘獲得的評分是比較高的,時長太短或太長對評分都沒有太大影響。
六、海報中人臉的數量和收視率的關係:
scoresfactor<-filter(scoresfactor,aspect_ratio<4)nggplot(data=scoresfactor)+geom_point(mapping=aes(x=aspect_ratio,y=facenumber_in_poster))+geom_smooth(mapping=aes(x=aspect_ratio,y=facenumber_in_poster))n
顯示結果如下:
七、影片排行榜:
rankmovies<-filter(rankmovies,n !is.na(movie_facebook_likes),n !is.na(genres),n !is.na(budget),n !is.na(color),n !is.na(language),n !is.na(country),n !is.na(title_year),n !is.na(imdb_score),n !is.na(aspect_ratio),n !is.na(movie_title))nrankmovies<-filter(rankmovies,imdb_score>8.0|color="Black and White")#最受歡迎的十部黑白電影n#rankmovies<-filter(rankmovies,imdb_score>8.0|country=="USA")#美國最受歡迎的十部電影n#rankmovies<-filter(rankmovies,imdb_score>8.0|country=="UK")#英國最受歡迎的十部電影n#rankmovies<-filter(rankmovies,imdb_score>8.0|budget>100000000)#預算大於1億美金最受歡迎的十部電影nrankmovies<-arrange(rankmovies,movie_facebook_likes)n#rankmovies<-arrange(rankmovies,imdb_score)#評分最高的10部電影nrankmoviesn
最受歡迎的十部黑白電影:
- Forrest Gump
- Schindlers List
- Memento
- 12 Angry Men
- Amélie
- American History X
- The Artist
- The Pianist
- Psycho
- Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
美國最受歡迎的十部電影:
- Interstellar
- Django Unchained
- The Revenant
- Inception
- The Dark Knight Rises
- The Martian
- The Grand Budapest Hotel
- Her
- Gone Girl
- The Wolf of Wall Street
英國最受歡迎的十部電影:
- The Imitation Game
- Rush
- The Kings Speech
- In Bruges
- Snatch
- 2001: A Space Odyssey
- Alien
- Trainspotting
- Lock, Stock and Two Smoking Barrels
預算大於1億美金最受歡迎的十部電影:
- Interstellar
- Django Unchained
- The Revenant
- Inception
- The Dark Knight Rises
- The Martian
- The Wolf of Wall Street
- The Avengers
- Mad Max: Fury Road
評分最高的10部電影:
- Towering Inferno
- The Shawshank Redemption
- The Godfather
- Dekalog
- Kickboxer: Vengeance
- The Dark Knight
- The Godfather: Part II
- Fargo
- Pulp Fiction
- Schindlers List
推薦閱讀:
※R的矩陣相乘/逆矩陣
※第一關:開啟大數據學習之路
※零基礎學習R語言數據分析
※R語言可視化——REmapH(中心熱度圖)
※RCurl中這麼多get函數,是不是一直傻傻分不清!!!