如何判斷一 部電影值不值得看?

在電影上映前,我們怎麼能說齣電影的偉大之處呢?許多人依靠批評家來衡量電影的質量,而其他人則利用他們的直覺。但在電影上映後,需要時間來獲得合理的評論,人類的本能有時是不可靠的。考慮到每年都有成千上萬的電影製作,我們是否有更好的方式來講述電影的偉大,而不依賴於評論家或我們自己的直覺?

本文以kaggle上IMDB 5000 Movie Dataset為基礎,運用所學R語言知識,探討如何更好地判斷一部電影是否值得看。

一、數據導入:

rm(list=ls())#清除環境變數ninstall.packages("dpylr")ninstall.packages(("ggplot2"))nlibrary(dplyr)nlibrary(ggplot2)nOridata<-read.csv("C:UsersAdministrator.PC-20150613LPHIDesktop大數據社群資料dataExampleIMDB 5000 Movie Datasetmovie_metadata.csv",header=T)n

導入查看有以下欄位可供分析:

二、提出問題:

1、導演的受歡迎程度和主演的受歡迎程度對電影的影響哪個更大?

2、對電影評分高低影響最重要的指標有哪些?

3、電影海報中人臉的數量與電影的收視率有關聯嗎?

4、按不同維度篩選出最值得觀看的影視排行榜

三、導演和電影評分數據:

mymovies<-select(Oridata,n director_facebook_likes,n movie_facebook_likes,n imdb_score,n budget,n title_yearn)n#列名重命名nmymovies<-rename(mymovies,n directorlikes=director_facebook_likes,n movielikes=movie_facebook_likes,n scores=imdb_score,n year=title_year)nmymoviesn#刪除缺失值nmymovies<-filter(mymovies,n !is.na(directorlikes),n !is.na(movielikes),n !is.na(scores),n !is.na(budget),n !is.na(year))n#按scores降序排列nmymovies<-arrange(mymovies,desc(scores))n#數據計算nby_year<-group_by(mymovies,year)nby_yearncomp<-summarise(by_year,n count=n(),n mean_scores=mean(scores,na.rm=TRUE),n mean_directorlikes=mean(directorlikes,na.rm=TRUE),n mean_movielikes=mean(movielikes,na.rm=TRUE),n mean_budget=mean(budget,na.rm=TRUE)n )ncompnlibrary(ggplot2)nggplot(data=comp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_directorlikes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_directorlikes))n

顯示結果如下:

如圖所示,導演的受歡迎程度和電影的受歡迎程度基本呈正相關,所以通過導演來選擇電影不失為一個好的選擇。

四、演員和電影評分數據:

disactor<-select(Oridata,n actor_3_facebook_likes,n actor_2_facebook_likes,n actor_1_facebook_likes,n movie_facebook_likes,n imdb_score,n budget,n title_year)n#列名重命名ndisactor<-rename(disactor,n actor1likes=actor_1_facebook_likes,n actor2likes=actor_2_facebook_likes,n actor3likes=actor_3_facebook_likes,n movielikes=movie_facebook_likes,n scores=imdb_score,n year=title_year)ndisactorn#刪除缺失值ndisactor<-filter(disactor,n !is.na(actor1likes),n !is.na(actor2likes),n !is.na(actor3likes),n !is.na(movielikes),n !is.na(scores),n !is.na(budget),n !is.na(year))n#按scores降序排列ndisactor<-arrange(disactor,desc(scores))n#數據計算nby_actoryear<-group_by(disactor,year)nby_actoryearnactorcomp<-summarise(by_actoryear,n count=n(),n mean_actor1likes=mean(actor1likes,na.rm=TRUE),n mean_actor2likes=mean(actor2likes,na.rm=TRUE),n mean_actor3likes=mean(actor3likes,na.rm=TRUE),n mean_scores=mean(scores,na.rm=TRUE),n mean_movielikes=mean(movielikes,na.rm=TRUE),n mean_budget=mean(budget,na.rm=TRUE)n)nactorcompnlibrary(ggplot2)nggplot(data=actorcomp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_actor1likes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_actor1likes))+ggtitle("主演一對電影受歡迎度的影響")nggplot(data=actorcomp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_actor2likes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_actor2likes))+ggtitle("主演二對電影受歡迎度的影響")nggplot(data=actorcomp)+geom_point(mapping=aes(x=mean_movielikes,y=mean_actor3likes))+geom_smooth(mapping=aes(x=mean_movielikes,y=mean_actor3likes))+ggtitle("主演三對電影受歡迎度的影響")n

顯示結果如下:

五、電影評分高低的影響因素:

scoresfactor<-select(Oridata,n duration,n genres,n language,n country,n budget,n title_year,n imdb_score,n aspect_ratio,n color)n#刪除缺失值nscoresfactor<-filter(scoresfactor,n !is.na(duration),n !is.na(genres),n !is.na(language),n !is.na(country),n !is.na(title_year),n !is.na(imdb_score),n !is.na(aspect_ratio),n !is.na(color))nlibrary(ggplot2)nggplot(data=scoresfactor)+geom_point(mapping=aes(x=duration,y=imdb_score))+geom_smooth(mapping=aes(x=duration,y=imdb_score))+ggtitle("電影時長和評分的關係")nggplot(data=scoresfactor)+geom_point(mapping=aes(x=title_year,y=imdb_score))+geom_smooth(mapping=aes(x=title_year,y=imdb_score))+ggtitle("上映時間和評分的關係")nggplot(data=scoresfactor)+geom_point(mapping=aes(x=aspect_ratio,y=imdb_score))+geom_smooth(mapping=aes(x=aspect_ratio,y=imdb_score))+ggtitle("收視率和評分的關係")n

顯示結果如下:

由圖可知,電影時長在90-120分鐘獲得的評分是比較高的,時長太短或太長對評分都沒有太大影響。

六、海報中人臉的數量和收視率的關係:

scoresfactor<-filter(scoresfactor,aspect_ratio<4)nggplot(data=scoresfactor)+geom_point(mapping=aes(x=aspect_ratio,y=facenumber_in_poster))+geom_smooth(mapping=aes(x=aspect_ratio,y=facenumber_in_poster))n

顯示結果如下:

七、影片排行榜:

rankmovies<-filter(rankmovies,n !is.na(movie_facebook_likes),n !is.na(genres),n !is.na(budget),n !is.na(color),n !is.na(language),n !is.na(country),n !is.na(title_year),n !is.na(imdb_score),n !is.na(aspect_ratio),n !is.na(movie_title))nrankmovies<-filter(rankmovies,imdb_score>8.0|color="Black and White")#最受歡迎的十部黑白電影n#rankmovies<-filter(rankmovies,imdb_score>8.0|country=="USA")#美國最受歡迎的十部電影n#rankmovies<-filter(rankmovies,imdb_score>8.0|country=="UK")#英國最受歡迎的十部電影n#rankmovies<-filter(rankmovies,imdb_score>8.0|budget>100000000)#預算大於1億美金最受歡迎的十部電影nrankmovies<-arrange(rankmovies,movie_facebook_likes)n#rankmovies<-arrange(rankmovies,imdb_score)#評分最高的10部電影nrankmoviesn

最受歡迎的十部黑白電影:

  • Forrest Gump
  • Schindlers List
  • Memento
  • 12 Angry Men
  • Amélie
  • American History X
  • The Artist
  • The Pianist
  • Psycho
  • Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb

美國最受歡迎的十部電影:

  • Interstellar
  • Django Unchained
  • The Revenant
  • Inception
  • The Dark Knight Rises
  • The Martian
  • The Grand Budapest Hotel
  • Her
  • Gone Girl
  • The Wolf of Wall Street

英國最受歡迎的十部電影:

  • The Imitation Game
  • Rush
  • The Kings Speech
  • In Bruges
  • Snatch
  • 2001: A Space Odyssey
  • Alien
  • Trainspotting
  • Lock, Stock and Two Smoking Barrels

預算大於1億美金最受歡迎的十部電影:

  • Interstellar
  • Django Unchained
  • The Revenant
  • Inception
  • The Dark Knight Rises
  • The Martian
  • The Wolf of Wall Street
  • The Avengers
  • Mad Max: Fury Road

評分最高的10部電影:

  • Towering Inferno
  • The Shawshank Redemption
  • The Godfather
  • Dekalog
  • Kickboxer: Vengeance
  • The Dark Knight
  • The Godfather: Part II
  • Fargo
  • Pulp Fiction
  • Schindlers List

推薦閱讀:

R的矩陣相乘/逆矩陣
第一關:開啟大數據學習之路
零基礎學習R語言數據分析
R語言可視化——REmapH(中心熱度圖)
RCurl中這麼多get函數,是不是一直傻傻分不清!!!

TAG:R编程语言 | 数据分析 | 大数据 |