【數據分析·實戰】北京的霧霾是大風吹走的嗎
前言
2013年9月,國務院專門出台治理大氣污染的條例,王安順代表北京市與中央簽訂責任狀,立下壯士斷腕的決心。「也是生死狀,因為中央領導說,2017年實現不了空氣治理就『提頭來見』。既是玩笑話,也說明了這句話的分量很重。」
1月3日,北京市環保局發布了2016年北京市全年空氣質量狀況報告,「2016年我市空氣質量達標天數198天,其中,一級優68天,二級良130天,達標天數較2015年增加12天;2016年共發生重污染39天,其中O3重污染1天,PM2.5重污染38天,較2015年減少7天。」
網友們看到這份報告紛紛調侃,「不過是今年多颳了幾天風罷了」。
空氣質量的改善是政府的功績,還是「風」的垂憐呢?本文將基於天氣以及空氣質量數據,用R語言一探究竟!
一、準備工作
library(RCurl)#現成的數據找不到,需要用爬蟲爬網頁上的數據library(XML)#提取網頁中的表格library(dplyr)#數據清理library(ggplot2)#畫圖library(stringr)#文本處理
二、數據收集
本文的空氣質量數據來自2016年12月北京空氣質量指數查詢(AQI)_12月份北京PM2.5歷史數據查詢_天氣後報,天氣數據來自北京歷史天氣預報查詢_2016年12月份北京天氣記錄_北京2016年12月份天氣情況_天氣後報。
時間跨度為2014-01-01至2016-12-31,總共應該是1096天,但2014年的空氣質量數據有4天缺失,本文根據2014年08月北京空氣質量指數AQI_PM2.5日歷史數據_中國空氣質量在線監測分析平台歷史數據提供的數據予以補全(分別是2014-01-23、2014-03-24、2014-08-08、2014-08-22)。
空氣質量數據和天氣數據的爬蟲代碼:
#Air Qualityairdata <- matrix(data = NA, nrow = 1, ncol = 10)airdata <- data.frame(airdata)colnames(airdata) <- c("date", "airrank", "aqi", "aqirank", "pm2.5", "pm10","no2","so2", "co", "o3")months <- sprintf("%02d", 1:12)years <- 2014:2016for(i in years) { for(j in months) { print(paste(i, j, sep = "")) url_a="http://www.tianqihoubao.com/aqi/beijing-" url_b=".html" url_all=paste(url_a, i, j, url_b, sep = "") temp <- getURL(url_all, .encoding="GBK") temp2 <- iconv(temp,"gb2312","UTF-8") doc <- htmlParse(temp2, asText=T, encoding="UTF-8") tables <-readHTMLTable(doc) airdatatemp <- as.data.frame(tables) colnames(airdatatemp) <- c("date", "airrank", "aqi", "aqirank", "pm2.5", "pm10","no2","so2", "co", "o3") airdata <- rbind(airdata, airdatatemp) }}airdata1 <- c("2014-01-23", "重度污染", "271", "182", "261","263","125", "118", "4.06", "10")airdata2 <- c("2014-03-24", "重度污染", "270", "188", "220","249","93", "53", "1.99", "144")airdata3 <- c("2014-08-08", "輕度污染", "107", "178", "64", "103","45", "8", "1.08", "220")airdata4 <- c("2014-08-22", "良", "80", "98", "51", "80", "58", "3", "0.80", "93")airdata <- rbind(airdata, airdata1, airdata2, airdata3, airdata4)#Windwinddata <- matrix(data = NA, nrow = 1, ncol = 4)winddata <- data.frame(winddata)colnames(winddata) <- c("date", "weather", "temperature","wind")months <- sprintf("%02d", 1:12)years <- 2014:2016for(i in years) { for(j in months) { url_a="http://www.tianqihoubao.com/lishi/beijing/month/" url_b=".html" url_all=paste(url_a, i, j, url_b, sep = "") temp <- getURL(url_all, .encoding="GBK") temp2 <- iconv(temp,"gb2312","UTF-8") doc <- htmlParse(temp2, asText=T, encoding="UTF-8") tables <-readHTMLTable(doc) winddatatemp <- as.data.frame(tables) colnames(winddatatemp) <- c("date", "weather", "temperature","wind") winddata <- rbind(winddata, winddatatemp) }}
三、數據清洗與整理
#根據日期合併兩組數據airdata$date <- as.Date(airdata$date)winddata$date <- as.Date(winddata$date, "%Y年%m月%d日")mydata <- merge(airdata, winddata, by="date")mydata <- na.omit(mydata)mydata <- mydata[-which(duplicated(mydata$date, )),]airdata$date <- as.Date(airdata$date)mydata[,3:10] <- as.numeric(unlist(mydata[,3:10]))#日期等數據整理mydata$weekday <- weekdays(mydata$date, abbreviate = T)mydata$weekday <- factor(mydata$weekday, levels = c("周一","周二","周三","周四","周五","周六","周日"))mydata$month <- months(mydata$date, abbreviate = T)mydata$month <- factor(x = mydata$month, levels = c("1月","2月","3月","4月","5月","6月","7月","8月","9月","10月","11月","12月"))mydata$year <- format(mydata$date, "%Y")mydata$airrank <- factor(x = mydata$airrank, levels =c("優","良","輕度污染","中度污染","重度污染","嚴重污染"))#氣溫數據整理mydata$temp.max <- substr(mydata$temperature, start = 1, stop=str_locate(mydata$temperature, "/")[,1]-1)mydata$temp.min <- substr(mydata$temperature, start = str_locate(mydata$temperature, "/")[,1]+1, stop = nchar(mydata$temperature))mydata$temp.max <- as.numeric(substr(mydata$temp.max, start = 1, stop = nchar(mydata$temp.max)-43))mydata$temp.min <- as.numeric(substr(mydata$temp.min, start = 43, stop =nchar(mydata$temp.min)-1))mydata$temp.avg <- (mydata$temp.max + mydata$temp.min) / 2 #風速數據整理mydata$wind.day <- substr(mydata$wind, start = 1, stop=str_locate(mydata$wind, "/")[,1]-1)mydata$wind.night <- substr(mydata$wind, start =str_locate(mydata$wind, "/")[,1]+1, stop = nchar(mydata$wind))mydata$wind.day[which(is.na(mydata$wind.day))] <- mydata$wind[which(is.na(mydata$wind.day))]mydata$wind.night[which(is.na(mydata$wind.night))] <- mydata$wind[which(is.na(mydata$wind.night))]mydata$wind.day.direction <- str_split_fixed(mydata$wind.day, pattern = " ", n = 2)[,1]mydata$wind.night.direction <- str_split_fixed(mydata$wind.night, pattern = " ", n = 2)[,1]mydata$wind.day.speed <- str_split_fixed(mydata$wind.day, pattern = " ", n = 2)[,2]mydata$wind.day.speed <- str_split_fixed(mydata$wind.day.speed, pattern = "
", n = 2)[,1]mydata$wind.day.speed[which(mydata$wind.day.speed=="")] <- "≤3級"mydata$wind.night.speed <- str_split_fixed(mydata$wind.night, pattern = " ", n = 2)[,2]mydata$wind.night.speed <- str_split_fixed(mydata$wind.night.speed, pattern = "
", n = 2)[,1]mydata$wind.night.speed[which(mydata$wind.night.speed=="")] <- "≤3級"
四、繪圖與分析
首先,我們根據所得到的AQI(空氣質量指數)數據繪製了如下圖所示的箱線圖+小提琴圖
ggplot(data = mydata, aes(x = year, y = aqi)) + geom_violin(fill="lightblue") + geom_boxplot(fill="lightgreen", width_=.2) + labs(x="",y="", title="北京AQI分布2014-2016") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))
可以很明顯地看到,2014-2016年北京的空氣質量整體上有所改善,AQI四分位線、平均值線均向下移動,總體的分布也逐漸向下偏。
參與空氣質量評價的主要污染物為細顆粒物、可吸入顆粒物、二氧化硫、二氧化氮、臭氧、一氧化碳等。其中PM2.5最受關注,秋冬時節PM2.5爆表的新聞常見於報端,根據同樣的方法我們構造了PM2.5的趨勢圖,結果與AQI類似。
ggplot(data = mydata, aes(x = year, y = pm2.5)) + geom_violin(fill="lightblue") + geom_boxplot(fill="lightgreen", width_=.2) + labs(x="",y="", title="北京PM2.5分布2014-2016") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))
那我們不禁要問了,近年來北京空氣質量的改善,「風」——這一因素到底起了多大的作用呢?
首先我們考察不同風速情況下的PM2.5分布,看看大風是不是真的吹走了霧霾。顯然,空氣污染(PM2.5>100)主要分布在風速小於等於3級的氣象條件下,風力大於3級的日子空氣污染較少。但如果從各風速看,白天和夜間呈現不同的特徵,白天風速越大,PM2.5均值水平越低;夜間風速越大,PM2.5均值水平變化不太大(5-6級、6-7級均值水平較高,可能與樣本太少有關)。這其中的道理,不太清楚。
ggplot(data = mydata, aes(x = wind.day.speed, y = pm2.5)) + geom_boxplot(fill="cornflowerblue", col=1, na.rm = T) + geom_point(position = "jitter", alpha=0.3, col="blue", na.rm = T) + labs(x="",y="", title="白天風速與PM2.5") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))ggplot(data = mydata, aes(x = wind.night.speed, y = pm2.5)) + geom_boxplot(fill="cornflowerblue", col=1, na.rm = T) + geom_point(position = "jitter", alpha=0.3, col="blue", na.rm = T) + labs(x="",y="", title="夜間風速與PM2.5") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))
進一步地我們對比下不同年份各空氣質量等級天數與各風速天數。可以看到北京的空氣質量整體向好,且趨勢較為明顯:空氣質量為優的天數明顯增加,輕度污染的天數有所增加;良、中度污染的天數大致不變;中度、嚴重污染的天數則有所減少。而反觀214-2016年的整體風速情況,則無法看到的較為明顯的趨勢,相對來說,2014年的整體風速較小,而2015年和2016年則大致相當。
ggplot(data = mydata, aes(x = year, fill=airrank)) + geom_bar(col=1) + scale_fill_brewer(palette = "YlOrRd") + labs(x="",y="", title="空氣質量2014-2016") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold")) ggplot(data = mydata, aes(x = year, fill=wind.day.speed)) + geom_bar(col=1, width = 0.4) + scale_fill_brewer() + labs(x="",y="", title="白天風速2014-2016") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold")) ggplot(data = mydata, aes(x = year, fill=wind.night.speed)) + geom_bar(col=1, width = 0.4) + scale_fill_brewer() + labs(x="",y="", title="夜間風速2014-2016") + theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))
綜合本文的分析來看,2014-2016年北京的空氣質量確實是在逐漸改善,而且這其中「風」並沒有扮演十分重要的角色,所以呢...
推薦閱讀:
※R語言可視化學習筆記之添加p-value和顯著性標記
※左手用R右手Python系列——任務進度管理
※流量結構分布圖——炫酷和弦圖
※[原]海納百川 有容乃大:SparkR與Docker的機器學習實戰