Kaggle-紐約的士旅程數據簡要分析

01-28

前言

本文數據來自於kaggle一個還在進行中的playground級別競賽，詳見

New York City Taxi Trip Duration

選用train.csv中145萬餘條的數據記錄進行相關數據分析的基礎練習，使用工具為R

參考該項目下Kernels中一些大神的思路方法，我會在這裡分享代碼和一些圖表展示

一定會有紕漏和不足，也希望各位朋友老師能夠給出相關意見，歡迎私信，感謝點贊

總述

先看看官方給出的這145萬餘條數據的欄位信息描述：

id - a unique identifier for each trip
vendor_id - a code indicating the provider associated with the trip record
pickup_datetime - date and time when the meter was engaged
dropoff_datetime - date and time when the meter was disengaged
passenger_count - the number of passengers in the vehicle (driver entered value)
pickup_longitude - the longitude where the meter was engaged
pickup_latitude - the latitude where the meter was engaged
dropoff_longitude - the longitude where the meter was disengaged
dropoff_latitude - the latitude where the meter was disengaged
store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
trip_duration - duration of the trip in seconds

一共11個欄位，其中包括上下車時間、上下車經緯度、旅程時長、乘客人數、數據記錄發送類別（儲存發送還是直接發送）、數據提供者的id編號

正文

#用到的包

library(dplyr)nlibrary(lubridate)nlibrary(data.table)nlibrary(geosphere)nlibrary(ggplot2)nlibrary(gridExtra)n

第一部分：數據的基本概況和處理

#1.數據讀取

train <-tbl_df(fread("train.csv"))n

由於數據量較大，使用fread函數進行讀取，並轉化為tbl對象類型

#2.數據的基本情況

class(train)n

glimpse(train)n

summary(train)n

可以看到：
（1）上下車時間為chr類型，需轉化為時間類型
（2）vendor_id有可能不是1就是2，是不是可以把這個欄位理解為計程車公司的編號（公司1、公司2）
（3）旅程時長最長的為3526282秒，我的天，將近41天？？
（4）乘客數目的中位數為1，平均數為1.665，可以先推測一下，是不是單人打車的情況最多見？

#3.看看有沒有缺失值####

sum(is.na(train))n

這簡直是個福音，沒有缺失值的數據集是多麼的可愛

#4.把上下車時間、計程車公司的ID、乘客數改一下欄位類型

train <- train %>%n mutate(pickup_datetime = ymd_hms(pickup_datetime),n dropoff_datetime = ymd_hms(dropoff_datetime),n vendor_id = factor(vendor_id),n passenger_count = factor(passenger_count))n

#5.trip_duration和上下車時間計算出來的數據是否一致

train %>%n mutate(check = abs(int_length(interval(pickup_datetime,dropoff_datetime)) n- trip_duration) == 0) %>% n select(check,pickup_datetime,dropoff_datetime,trip_duration) %>%n group_by(check) %>%n count()n

可以看到，時長和上下車時間計算出來的結果保持一致

#6.添加一些欄位

#（1）從上車時間可以提取出星期幾、月份、時段，添加這三個欄位

train <- train %>%n mutate(weekday = wday(pickup_datetime,label = TRUE),n month = month(pickup_datetime,label = TRUE),n hour = hour(pickup_datetime))n

#（2）從上下車經緯度可以整合出兩點之間的距離(km)，除去路程時長，可以得出速度，添加這兩個欄位

pickuplocation <- train %>%n select(pickup_longitude,pickup_latitude)ndropofflocation <- train %>%n select(dropoff_longitude,dropoff_latitude)n

#距離，單位為km

train <- train %>%n mutate(distance = distHaversine(pickuplocation,dropofflocation)/1000) n

#速度，單位km/h

train <- train %>%n mutate(speed = distance/trip_duration*3600)n

注意，這裡的距離使用的是distHaversine函數計算得出兩點的最短距離，該距離建立在haversine模型上得出，這裡就不深入討論了，有興趣的朋友可以自行google

第二部分數據可視化展示（單欄位）

經過第一部分對原始數據簡單的了解及處理，進行相對應的數據可視化展示，先單獨看看各個欄位的數據情況：

（1）觀測vendor_id的情況

p1 <- train %>%n ggplot(aes(vendor_id,fill = vendor_id)) +n geom_bar() +n theme(legend.position = "null")n

發現和之前的假設相同，只有兩種情況，且vendor_id為2的情況比vendor_id為1的情況多了10W條左右

（2）#觀測上車、下車時間的分布情況;組合對比一下

p2 <- train %>%n ggplot(aes(pickup_datetime)) +n geom_histogram(bins = 250,fill="orange")np3 <- train %>%n ggplot(aes(dropoff_datetime)) +n geom_histogram(bins = 250,fill="skyblue")np23 <- grid.arrange(p2,p3,nrow =2)n

發現：
（1）所有的打的記錄都在1-6月份
（2）大體上看去整個分布還是比較均勻的，但是在1月底2月初之間出現了一個很明顯的回落趨勢，來放大這個區間看看情況：

大概是23-25號之間，打車記錄的數目出現了一個很明顯的下降，打的作為一項平日里很稀鬆平常的事情，每日的訂單量應該是相對平均的，這是為什麼呢？
是那幾天計程車漲價？還是類似於停電一小時的活動，呼籲不打的活動？還是天災人禍？
google一下，答案是暴！風！雪！

（3）看看乘客數量的一個分布

p4 <- train%>%n ggplot(aes(passenger_count,fill = passenger_count)) +n geom_bar() +theme(legend.position = "null") +n scale_y_sqrt()n

發現：
（1）單人打車的情況最多，在145萬多條的記錄中，有100W+是單人打車的情況
（2）出現了乘客為0的打的記錄
（3）原來美國計程車還可以坐四個以上的乘客，而且5、6名乘客情況還不少，中國的的士有可以坐4個以上乘客的嗎？

（4）看看記錄是否第一時間發送伺服器的情況分布

p5 <- train %>%n ggplot(aes(store_and_fwd_flag,fill = store_and_fwd_flag)) +n geom_bar() +theme(legend.position = "null") + scale_y_log10()n

可以看到not a store and forward trip的情況要佔大多數

（5）看看旅行時長的分布

p6 <- train %>%n ggplot(aes(trip_duration)) +n geom_histogram(bins = 10000,fill="red") + coord_cartesian(x=c(1,6000))n

看出，主要的時長還是集中在1000秒左右，即15分鐘這樣，當然還有一些時間非常短或則特別長的情況發生

（6）接下來按時間分布來觀測下打車情況，哪個時段打車多？星期幾計程車最忙？

#看看每天打的數量的分布趨勢

p7<- train %>%n group_by(weekday) %>%n count() %>%n ggplot(aes(weekday,n,group=1)) +n geom_line(size=1.5,color="lightblue") +n geom_point(size=1.5,shape =17)n

#看看每月打的數量的分布趨勢

p8<- train %>%n group_by(month) %>%n count() %>%n ggplot(aes(month,n,group=1)) +n geom_line(size=2,color="lightblue") +n geom_point(size=2,shape =17)n

#看看每個時段打的數量的分布趨勢

p9<- train %>%n mutate(hour= hour(pickup_datetime)) %>%n group_by(hour) %>%n count() %>%n ggplot(aes(hour,n)) +n geom_line(size=2,color = "lightblue") +n geom_point(size=2,shape=17)nnp789 <- grid.arrange(p7,p8,p9,ncol=1)n

看來
（1）NYC的計程車最忙的時間集中在周五、周六
（2）3-5月份打的的人最多，和旅遊旺季有一定關係嗎？
（3）對於每天而言，還是符合我們的日常認識的，0-5點打車的人越來越少，到了晚上，嗯，call a taxi ，get high!

（7）看看路程距離速度的分布，組合

p10 <- train %>%n ggplot(aes(distance)) +n geom_histogram(bins = 4000,fill="red") +coord_cartesian(x=c(0,30))np11 <- train %>%n ggplot(aes(speed)) +n geom_histogram(bins = 3000,fill="orange") +coord_cartesian(x=c(0,100))np1011 <- grid.arrange(p10,p11,ncol=1)n

發現：
（1）主要的路程距離集中在1-3公里這個範圍內
（2）行駛的速度集中在13-15km/h，我的天！這麼堵？

第三部分數據可視化展示（組合欄位）

接下來，試著組合一些欄位進行可視化的展示，看看還能得出什麼信息

（1）看一下7個weekday不同時段的打車情況、6個月每個時段的打車情況，組合一下

p12 <-train %>%n group_by(weekday,hour) %>%n count() %>%n ggplot(aes(hour,n,color=weekday)) +n geom_line(size=2) +n scale_color_brewer(palette = "Set1")np13 <- train %>%n group_by(month,hour) %>%n count() %>%n ggplot(aes(hour,n,color=month)) +n geom_line(size=2) +n scale_color_brewer(palette = "Set2")np1213 <- grid.arrange(p12,p13,nrow=2)n

看出
（1）周六、周日凌晨，打車的情況會比其他工作日多不少，而周日深夜時段打車的數量是一周7天中最少的，收收心，第二天上班？
（2）正常工作日，早上打的的數量明顯要比周末來的多，符合出行規律
（3）第二張圖，除了看出3月份打車情況比其它月份來的多，除此並沒有其它大的區別

（2）按乘客數目分組，看看旅行時長的分布

p14<- train %>%n ggplot(aes(passenger_count,trip_duration,color=passenger_count)) +n geom_boxplot() +n scale_y_log10() +n theme(legend.position = "null")n

除去數據量很小的0、7、8、9幾個類別外，
好像，並看不出什麼...看來並不會因為乘客數量的多少而影響時長

（3）看看是否第一時間儲存記錄和vendor_id的關係

p14 <- train %>%nggplot(aes(vendor_id,fill=store_and_fwd_flag)) +n geom_bar(position = "dodge")n

可以看出，vendor_id為2的情況下，是沒有store and forward trip的情況的

總述

這是在kaggle上學習後的第一篇實踐輸出，也參考了不少其它大神的代碼和思路，主要是對R中dplyr、lubirdate、ggplot2包實戰使用進行練習記錄

深深感覺到自己的缺陷與不足，不能進行更深入系統的數據分析

希望能在知乎上認識更多學習數據分析的朋友，歡迎私信，感謝點贊

最後，路漫漫其修遠兮，平常心！

Kaggle-紐約的士旅程數據簡要分析

第一部分：數據的基本概況和處理

第二部分 數據可視化展示（單欄位）

第三部分 數據可視化展示（組合欄位）

總述

第二部分數據可視化展示（單欄位）

第三部分數據可視化展示（組合欄位）