從頭學習大數據培訓課程 數據倉儲工具 hive(五)hive 的 grouping sets、排序、窗口函數用法
1. GROUPING SETS
語法
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, () )等於SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, bUNION ALLSELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION ALLSELECT null, b, SUM( c ) FROM tab1 GROUP BY bUNION ALLSELECT null, null, SUM( c ) FROM tab1示例
select country,gpcategory,count(1) from user_install_status_mid where dt=20141228group by country,gpcategorygrouping sets((country,gpcategory),(country),(gpcategory),()) limit 10;等於select * from (
select country,gpcategory,count(1) from user_install_status_mid where dt=20141228group by country,gpcategoryunion allselect country,null,count(1) from user_install_status_mid where dt=20141228group by countryunion allselect null,gpcategory,count(1) from user_install_status_mid where dt=20141228group by ,gpcategoryunion allselect null,null,count(1) from user_install_status_mid where dt=20141228
) a limit 10;grouping sets加上if,case用法
selectif(country is not null,country,ALL) as country,case when gpcategory is not null then gpcategory else ALL end as gpcategory,count(1) as numfrom user_install_status_mid where dt=20141228group by country,gpcategorygrouping sets((country,gpcategory),(country),(gpcategory),()) limit 100;
with cube是group by中所有key的組合
selectif(country is not null,country,ALL) as country,case when gpcategory is not null then gpcategory else ALL end as gpcategory,count(1) as num,from user_install_status_mid where dt=20141228group by country,gpcategorywith cube limit 100;等於
selectif(country is not null,country,ALL) as country,case when gpcategory is not null then gpcategory else ALL end as gpcategory,count(1) as numfrom user_install_status_mid where dt=20141228group by country,gpcategorygrouping sets((country,gpcategory),(country),(gpcategory),()) limit 100;with rollup是按右側遞減的順序組合
selectif(country is not null,country,ALL) as country,
case when gpcategory is not null then gpcategory else ALL end as gpcategory,count(1) as numfrom user_install_status_mid where dt=20141228group by country,gpcategorywith ROLLUP;等於selectif(country is not null,country,ALL) as country,case when gpcategory is not null then gpcategory else ALL end as gpcategory,count(1) as num
from user_install_status_mid where dt=20141228group by country,gpcategorygrouping sets((country,gpcategory),(country)) limit 100;例如group by a,b,c,d,e,f,g
就等於grouping sets((a,b,c,d,e,f,g),(a,b,c,d,e,f),(a,b,c,d,e),(a,b,c,d),(a,b,c),(a,b),(a))2. 排序
order by 會對輸入做全局排序,因此只有一個reducer
sort by不是全局排序,其在數據進入reducer前完成排序。 因此,如果用sort by進行排序,並且設置mapred.reduce.tasks>1, 則sort by只保證每個reducer的輸出有序,不保證全局有序。distribute by (類似於分桶),就是把相同的key分到一個reducer中,根據distribute by指定的欄位對數據進行劃分到不同的輸出reduce 文件中。CLUSTER BY等於distribute by加上sort by
示例
order by只能有一個reducer,設置了reducer也不起作用set mapred.reduce.tasks=42;select * from user_install_status where dt=20141228 order by uptime limit 100;設置的reduce數量不起作用sort by只能保證單個文件內有序,如果設置成一個reducer那作用和order是一樣的
set mapred.reduce.tasks=42;insert overwrite directory /user/hainiu/user_install_status_select_sortselect * from user_install_status where dt=20141228 sort by uptime;
distribute by and sort by = cluster by
set mapred.reduce.tasks=42;insert overwrite directory /user/hainiu/user_install_status_select_clusterselect from user_install_status where dt=20141228 cluster by country;等於select from user_install_status where dt=20141228 distribute byh country sort by country;cat ./000010_0_select_cluster |awk -F 01 {print $5}
在select結果中cluster by可以保證組內是有序的
在桶表中clustered by組內是否有序要根據sort by決定
桶表的3. 窗口函數
ROW_NUMBER()統計當前行號,從1開始;
示例
insert overwrite directory /user/hainiu/user_install_status_over_partitionselect country,row_number() over(partition by country order by uptime) as rnfrom user_install_status where dt=20141228;select a.country,a.gpcategory from
(select from user_install_status where dt=20141228) ainner join(select country,min(uptime) as minu form user_install_status where dt=20141228 group by country) bon a.country = b.country and a.uptime = b.minu;等於select from(select country,gpcategory,row_number() over(partition by country order by uptime) as rnfrom user_install_status where dt=20141228) awhere a.rn=1;數據集
建表
查詢
select country,gpcategory,n1,n2,row_number() over(partition by country order by n1 desc) as rn,row_number() over(partition by country order by n2 desc) as rn1from test;結果over中partition by和distribute by區別
示例insert overwrite directory /user/hainiu/user_install_status_over_distributeselect country,row_number() over(distribute by country sort by uptime) as rnfrom user_install_status where dt=20141228;partition by [key..] order by [key..]只能在窗口函數中使用,而distribute by [key...] sort by [key...]在窗口函數和select中都可以使用
窗口函數中兩者是沒有區別的where後面不能用partition by示例select * from user_install_status where dt=20141228 partition by country order by uptime;RANK()為排名統計,若進行排名的值相同則後面的值和前面的值保持一樣,當出現不同時就從真實的行數開始計算
DENSE_RANK()與RANK()類似,只是出現不同時按位數計算結果示例select country,n,RANK() over (partition by country order by n) as rank,DENSE_RANK() over (partition by country order by n) as drankfrom(select country,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a;NTILE()分片函數,將數據等分為N份,示例如下,將整個表按照count從大到小進行排序並切分成5份。
示例select country,n,NTILE(5) over (order by n desc) t from(select country,count(1) n from user_install_status_mid where dt=20141228 group by country) a;驗證是否等分
示例select t,count(1) from(select country,n,NTILE(5) over (order by n desc) t from(select country,count(1) n from user_install_status_mid where dt=20141228 group by country) a) b group by t;加上partition就是在窗口內進行分片
示例select country,n,NTILE(2) over (partition by country order by n desc) t,gpcategory from(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a;CUME_DIST和PERCENT_RANK函數,用來計算分組中的位置,數值介於0和1之間。分組歸一化
CUME_DIST的計算方式為分組所在位置/分組大小。PERCENT_RANK的計算方式為(分組中順序-1)/(分組大小-1)ROWS窗口函數中的行選擇器 rows between [n|unbounded preceding]|[n|unbounded following]|[current row] and [n|unbounded preceding]|[n|unbounded following]|[current row]
參數解釋:n行數unbounded不限行數preceding在前N行following在後N行current row當前行示例
select country,n,sum(n) over (partition by country order by n asc ROWS BETWEEN 1 preceding AND 1 following) t,gpcategory from(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a;select country,n,sum(n) over (partition by country order by n asc ROWS between unbounded preceding and current row) t,gpcategory from
(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a where a.country=CN;select country,n,sum(n) over (partition by country order by n desc ROWS between current row and unbounded following) t,gpcategory from
(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a where a.country=CN;select country,n,sum(n) over (partition by country order by n desc ROWS between unbounded preceding and current row) t,gpcategory from
(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a where a.country=CN;推薦閱讀:
※大數據雙創行動丨校園裡誕生的大數據創客
※定位大數據分析平台,Kyligence憑開源優勢謀全球業務擴張
※下一次工業革命來了,你知道他是誰么?
※大數據的狂妄和南牆 | 方承志
※別把我們的勞動,當作你們的資本——對用戶數據性質的新探討