一眨眼《復聯3》來了,你知道每個復仇者的口頭禪嗎?

不用說你也知道,「復仇者聯盟」里每個成員都性格迥異,所以說話用詞都有各自鮮明的特點。那他們說話都愛用哪些詞兒?

國外有幾位漫威的鐵杆粉絲把每個復仇者的說話習慣用 R 語言可視化了出來,圖中每個詞對應的條形長度,代表了他比其他復仇者更愛說這個詞的程度。

我們可以看到,美隊老愛喊別人名字,特別是托尼(emmmmmm...);黑豹經常念叨一些很高大上的詞(比如朋友,國王),不像蜘蛛俠滿嘴嗯啊個不停(比如嘿,啊,呃),還跟個孩子似的;浩克和鷹眼說的最多的是黑寡婦,不過兩人喊得稱呼卻不同(原因你猜);幻視和緋紅女巫很有共同話題,所以這是倆人互生愛慕的原因?果然,雷神念叨最多的還是老弟洛基,而且老是想著「宇宙大事」,說的話都和第三部《無限戰爭》緊密相關;至於洛基嘛,意料之中的經常嗶嗶「權力」「王位」這些,但是跟洛基一樣也渴望權力的奧創卻說話不一樣,人家說的詞就很有詩意。

這麼有意思的可視化圖形是怎麼做出來的呢?秘笈如下:

首先我們會用到以下 R 語言包:

library(dplyr)library(grid)library(gridExtra)library(ggplot2)library(reshape2)library(cowplot)library(jpeg)library(extrafont)

有些人可能認為使用「清除所有」代碼行很不好,但是在腳本頂部用它可以確保在執行腳本時,腳本不會依賴不小心遺留在工作區內的任何對象。

rm(list = ls())

這是包含所有復仇者圖像的文件夾:

dir_images <- "C:\Users\Matt\Documents\R\Avengers"setwd(dir_images)

設置字體

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))

各個復仇者名字的簡化版

character_names <- c("black_panther","black_widow","bucky","captain_america", "falcon","hawkeye","hulk","iron_man", "loki","nick_fury","rhodey","scarlet_witch", "spiderman","thor","ultron","vision")image_filenames <- paste0(character_names, ".jpg")

讀取和簡化版復仇者名字對應的圖像文件的函數

read_image <- function(filename){ char_name <- gsub(pattern = "\.jpg$", "", filename) img <- jpeg::readJPEG(filename) return(img)}

將所有圖像讀取為一個列表

all_images <- lapply(image_filenames, read_image)

為這列圖像分配名字,這樣後面就可以被字元檢索到了

names(all_images) <- character_names

其實使用圖像名字很簡單,比如下面這個例子

# clear the plot windowgrid.newpage()# draw to the plot windowgrid.draw(rasterGrob(all_images[[vision]]))

獲取文本數據

這幾位漫威粉並沒有將他們自己的電影台詞數據集分享出來,不過我們可以在 IMSDB 上下載(imsdb.com/scripts/),然後用文本分析技術稍作處理。如果原作者後面將自己的數據集公開,我們會第一時間分享。

載入本地數據集。

修正人物名字的大小寫

capitalize <- Vectorize(function(string){ substr(string,1,1) <- toupper(substr(string,1,1)) return(string)})proper_noun_list <- c("clint","hydra","steve","tony", "sam","stark","strucker","nat","natasha", "hulk","tesseract", "vision", "loki","avengers","rogers", "cap", "hill")# Run the capitalization functionword_data <- word_data %>% mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>% mutate(word = ifelse(word == "jarvis", "JARVIS", word))

注意前面的簡化版人物名字,不要匹配文本數據框中已經處理好格式的人物名字。

unique(word_data$Speaker)## [1] "Black Panther" "Black Widow" "Bucky"## [4] "Captain America" "Falcon" "Hawkeye"## [7] "Hulk" "Iron Man" "Loki"## [10] "Nick Fury" "Rhodey" "Scarlet Witch"## [13] "Spiderman" "Thor" "Ultron"## [16] "Vision"

製作一個查詢表,將簡寫的文件名轉換為美觀的人物名字

character_labeler <- c(`black_panther` = "Black Panther", `black_widow` = "Black Widow", `bucky` = "Bucky", `captain_america` = "Captain America", `falcon` = "Falcon", `hawkeye` = "Hawkeye", `hulk` = "Hulk", `iron_man` = "Iron Man", `loki` = "Loki", `nick_fury` = "Nick Fury", `rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch", `spiderman`="Spiderman", `thor`="Thor", `ultron` ="Ultron", `vision` ="Vision")

獲得兩個不同版本的人物名字

其中一個版本用來展示(因為美觀),另一個版本用於簡單的組織和引用圖像文件(因為簡單)。

convert_pretty_to_simple <- Vectorize(function(pretty_name){ # pretty_name = "Vision" simple_name <- names(character_labeler)[character_labeler==pretty_name] # simple_name <- as.vector(simple_name) return(simple_name)})# convert_pretty_to_simple(c("Vision","Thor"))# just for fun, the inverse of that functionconvert_simple_to_pretty <- function(simple_name){ # simple_name = "vision" pretty_name <- character_labeler[simple_name] %>% as.vector() return(pretty_name)}# exampleconvert_simple_to_pretty(c("vision","black_panther"))## [1] "Vision" "Black Panther"

為文本數據框添加簡化版人物名字。

word_data$character <- convert_pretty_to_simple(word_data$Speaker)

為每個人物分配一個主要顏色。

character_palette <- c(`black_panther` = "#51473E", `black_widow` = "#89B9CD", `bucky` = "#6F7279", `captain_america` = "#475D6A", `falcon` = "#863C43", `hawkeye` = "#84707F", `hulk` = "#5F5F3F", `iron_man` = "#9C2728", `loki` = "#3D5C25", `nick_fury` = "#838E86", `rhodey` = "#38454E",`scarlet_witch` ="#620E1B", `spiderman`="#A23A37", `thor`="#323D41", `ultron` ="#64727D", `vision` ="#81414F" )

製作水平方向的條形圖

avengers_bar_plot <- word_data %>% group_by(Speaker) %>% top_n(5, amount) %>% ungroup() %>% mutate(word = reorder(word, amount)) %>% ggplot(aes(x = word, y = amount, fill = character))+ geom_bar(stat = "identity", show.legend = FALSE)+ scale_fill_manual(values = character_palette)+ scale_y_continuous(name ="Log Odds of Word", breaks = c(0,1,2)) + theme(text = element_text(family = "Franklin"), # axis.title.x = element_text(size = rel(1.5)), panel.grid = element_line(colour = NULL), panel.grid.major.y = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "white", colour = "white"))+ # theme(strip.text.x = element_text(size = rel(1.5)))+ xlab("")+ coord_flip()+ facet_wrap(~Speaker, scales = "free_y")avengers_bar_plot

看起來很不錯。

但是我們想畫個更酷炫的圖:用每個復仇者的照片來填充條形圖

也就是說我們只在條形圖區域內展示出復仇者的照片,在條形區域以外的地方則不展示(如下圖所示)。

如果想做到這點,我們需要顯示一個透明的條形,然後在條形的末尾畫一個白色的條形,延伸至圖像邊緣覆蓋人物照片的剩餘部分。

在數據框部分,我們現在想用所需的值的餘數來補充數字值,以實現整體最大化,這樣當把值和餘數相加時,所有數值都會增加到同一最大數值,以同樣的格式將不同行組合到一起。

max_amount <- max(word_data$amount)word_data$remainder <- (max_amount - word_data$amount) + 0.2

只提取每個復仇者說的最多的5個詞

word_data_top5 <- word_data %>% group_by(character) %>% arrange(desc(amount)) %>% slice(1:5) %>% ungroup()

將數量&餘數轉換為長格式

這樣能保證每個人物和所說詞語的匹配關係有兩個 entry,一個用以真實數量(「amount」),一個用以選擇在哪裡結束,達到常見的最大值(「remainder」)。

這會將「amount」和「remainder」重疊為一個單獨的列稱為「variable」,表示是什麼值,而另一個列「value」包含來自這些值中每一個值的數字。

word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount","remainder"))

Variable 是一個值是真實數量還是補充數量的標記。

現在我們按順序將它們放在一起,和在melt函數中的確定它們的順序相反。否則「amount」和「remainder」會以相反的順序展現在圖形中。

word_data_top5_m$variable2 <- factor(word_data_top5_m$variable, levels = rev(levels(word_data_top5_m$variable)))

為一個人物展示前 5 個詞語數據的函數

以簡單的形式聲明人物名字,比如用 black_panther 而不是 Black Panther。

plot_char <- function(character_name){ # example: character_name = "black_panther" # plot details that we might want to fiddle with # thickness of lines between bars bar_outline_size <- 0.5 # transparency of lines between bars bar_outline_alpha <- 0.25 # # The function takes the simple character name, # but here, we convert it to the pretty name, # because well want to use that on the plot. pretty_character_name <- convert_simple_to_pretty(character_name) # Get the image for this character, # from the list of all images. temp_image <- all_images[character_name] # Make a data frame for only this character temp_data <- word_data_top5_m %>% dplyr::filter(character == character_name) %>% mutate(character = character_name) # order the words by frequency # First, make an ordered vector of the most common words # for this character ordered_words <- temp_data %>% mutate(word = as.character(word)) %>% dplyr::filter(variable == "amount") %>% arrange(value) %>% `[[`(., "word") # order the words in a factor, # so that they plot in this order, # rather than alphabetical order temp_data$word = factor(temp_data$word, levels = ordered_words) # Get the max value, # so that the image scales out to the end of the longest bar max_value <- max(temp_data$value) fill_colors <- c(`remainder` = "white", `value` = "white") # Make a grid object out of the characters image character_image <- rasterGrob(all_images[[character_name]], width = unit(1,"npc"), height = unit(1,"npc")) # make the plot for this character output_plot <- ggplot(temp_data)+ aes(x = word, y = value, fill = variable2)+ # add image # draw it completely bottom to top (x), # and completely from left to the the maximum log-odds value (y) # note that x and y are flipped here, # in prep for the coord_flip() annotation_custom(character_image, xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) + geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha), size = bar_outline_size, width = 1)+ scale_fill_manual(values = fill_colors)+ theme_classic()+ coord_flip(expand = FALSE)+ # use a facet strip, # to serve as a title, but with color facet_grid(. ~ character, labeller = labeller(character = character_labeler))+ # figure out color swatch for the facet strip fill # using character name to index the color palette # color= NA means theres no outline color. theme(strip.background = element_rect(fill = character_palette[character_name], color = NA))+ # other theme elements theme(strip.text.x = element_text(size = rel(1.15), color = "white"), text = element_text(family = "Franklin"), legend.position = "none", panel.grid = element_blank(), axis.text.x = element_text(size = rel(0.8)))+ # omit the axis title for the individual plot, # because well have one for the entire ensemble theme(axis.title = element_blank()) return(output_plot)}

將 X 軸名稱用為所有復仇者主圖像的名稱。

plot_x_axis_text <- paste("Tendency to use this word more than other characters do", "(units of log odds ratio)", sep = "
")

下面是函數在這裡的工作示例

sample_plot <- plot_char("black_panther")+ theme(axis.title = element_text())+ # x lab is still declared as y lab # because of coord_flip() ylab(plot_x_axis_text)sample_plot

為何我們這裡的水平軸上還帶著非常奇怪的「對數差異比」?

因為隨著數字增大,差異也會隨之增大(具體數學知識這裡不再講述);將它們轉換為對數尺度,可以約束變化幅度的大小,方便我們在屏幕上展示。

如果想將這些對數差異轉化為簡單的概率形式,可以用如下函數:

logit2prob <- function(logit){ odds <- exp(logit) prob <- odds / (1 + odds) return(prob)}

這樣處理後水平軸會如下所示:

logit2prob(seq(0, 2.5, 0.5))## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418

注意此序列中連續項目之間的差異在慢慢消失:

diff(logit2prob(seq(0, 2.5, 0.5)))## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474

Okay,現在我們製作出了一個圖···

我們接著將函數應用到列表中所有復仇者身上,將所有繪圖放入一個列表對象。

all_plots <- lapply(character_names, plot_char)

從繪圖中提取軸名稱的函數

不僅僅是文本,還有其它畫出的信息。

你可以選擇提取 X 軸名稱還是 Y 軸名稱:

get_axis_grob <- function(plot_to_pick, which_axis){ # plot_to_pick <- sample_plot tmp <- ggplot_gtable(ggplot_build(plot_to_pick)) # tmp$grobs # find the grob that looks like # it would be the x axis axis_x_index <- which(sapply(tmp$grobs, function(x){ # for all the grobs, # return the index of the one # where you can find the text # "axis.title.x" or "axis.title.y" # based on input argument `which_axis` grepl(paste0("axis.title.",which_axis), x)} )) axis_grob <- tmp$grobs[[axis_x_index]] return(axis_grob)}

提取軸名稱 Grob

px_axis_x <- get_axis_grob(sample_plot, "x")px_axis_y <- get_axis_grob(sample_plot, "y")

下面是使用這些提取的軸的方法:

grid.newpage()grid.draw(px_axis_x)

將所有繪圖排成一個對象

big_plot <- arrangeGrob(grobs = all_plots)

將 X 軸嵌入繪圖的底部,因為每個圖並沒有 X 軸,而我們想讓它們都有 X 軸。

注意這時繪圖會看著很不協調,高度差不多是寬度的十倍。

big_plot_w_x_axis_title <- arrangeGrob(big_plot, px_axis_x, heights = c(10,1))grid.newpage()grid.draw(big_plot_w_x_axis_title)

繪圖所佔的空間大小不一,因為每個圖的辭彙長度不同。

這樣看起來有些混亂。

通常我們會用 facet_grid() 或 facet_wrap() 來確保繪圖整潔有序,但這裡卻不能使用因為每個圖的背景圖各不相同,無法像數據框中的其它列一樣映射到平面上(因為背景圖像實際上並非數據框的一部分)。

使用 cowplot 而非 arrangeGrob

這樣繪圖的軸會垂直對齊:

big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = v, nrow = 4)

和之前一樣,將X軸名稱添加至繪圖對齊後網格的下方。

big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned, px_axis_x, heights = c(10,1))

下面是將整體效果圖繪製在屏幕上的方法:

grid.newpage()grid.draw(big_plot_w_x_axis_title_aligned)

很好!

保存最終圖像:

ggsave(big_plot_w_x_axis_title_aligned, file = "Avengers_Word_Usage.png", width = 12, height = 6.3)

這樣,我們就可視化出了《復聯》中各個復仇者都最愛說那些話!


參考資料:

mattwinn.com/Plot_Aveng

推薦閱讀:

數據可視化的開源方案: Superset vs Redash vs Metabase (二)
藉助WebGL三維可視化技術檢索3D動態圖像
人民日報中央廚房獲評「2017年大數據優秀應用案例」
【R圖秀-7】中美俄軍事實力對比
R語言互動式可視化包CanvasXpress

TAG:復仇者聯盟電影 | 景略集智 | 數據可視化 |