從頭學習大數據培訓課程 數據倉儲工具 hive(五)hive 的 grouping sets、排序、窗口函數用法

1. GROUPING SETS

語法

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, () )

等於

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION ALL

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

UNION ALL

SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

UNION ALL

SELECT null, null, SUM( c ) FROM tab1

示例

select country,gpcategory,count(1) from user_install_status_mid where dt=20141228

group by country,gpcategory

grouping sets((country,gpcategory),(country),(gpcategory),()) limit 10;

等於

select * from (

select country,gpcategory,count(1) from user_install_status_mid where dt=20141228

group by country,gpcategory

union all

select country,null,count(1) from user_install_status_mid where dt=20141228

group by country

union all

select null,gpcategory,count(1) from user_install_status_mid where dt=20141228

group by ,gpcategory

union all

select null,null,count(1) from user_install_status_mid where dt=20141228

) a limit 10;

grouping sets加上if,case用法

select

if(country is not null,country,ALL) as country,

case when gpcategory is not null then gpcategory else ALL end as gpcategory,

count(1) as num

from user_install_status_mid where dt=20141228

group by country,gpcategory

grouping sets((country,gpcategory),(country),(gpcategory),()) limit 100;

with cube是group by中所有key的組合

select

if(country is not null,country,ALL) as country,

case when gpcategory is not null then gpcategory else ALL end as gpcategory,

count(1) as num,

from user_install_status_mid where dt=20141228

group by country,gpcategory

with cube limit 100;

等於

select

if(country is not null,country,ALL) as country,

case when gpcategory is not null then gpcategory else ALL end as gpcategory,

count(1) as num

from user_install_status_mid where dt=20141228

group by country,gpcategory

grouping sets((country,gpcategory),(country),(gpcategory),()) limit 100;

with rollup是按右側遞減的順序組合

select

if(country is not null,country,ALL) as country,

case when gpcategory is not null then gpcategory else ALL end as gpcategory,

count(1) as num

from user_install_status_mid where dt=20141228

group by country,gpcategory

with ROLLUP;

等於

select

if(country is not null,country,ALL) as country,

case when gpcategory is not null then gpcategory else ALL end as gpcategory,

count(1) as num

from user_install_status_mid where dt=20141228

group by country,gpcategory

grouping sets((country,gpcategory),(country)) limit 100;

例如group by a,b,c,d,e,f,g

就等於

grouping sets((a,b,c,d,e,f,g),(a,b,c,d,e,f),(a,b,c,d,e),(a,b,c,d),(a,b,c),(a,b),(a))

2. 排序

order by 會對輸入做全局排序,因此只有一個reducer

sort by不是全局排序,其在數據進入reducer前完成排序。 因此,如果用sort by進行排序,並且設置mapred.reduce.tasks>1, 則sort by只保證每個reducer的輸出有序,不保證全局有序。

distribute by (類似於分桶),就是把相同的key分到一個reducer中,根據distribute by指定的欄位對數據進行劃分到不同的輸出reduce 文件中。

CLUSTER BY等於distribute by加上sort by

示例

order by只能有一個reducer,設置了reducer也不起作用

set mapred.reduce.tasks=42;

select * from user_install_status where dt=20141228 order by uptime limit 100;

設置的reduce數量不起作用

sort by只能保證單個文件內有序,如果設置成一個reducer那作用和order是一樣的

set mapred.reduce.tasks=42;

insert overwrite directory /user/hainiu/user_install_status_select_sort

select * from user_install_status where dt=20141228 sort by uptime;

distribute by and sort by = cluster by

set mapred.reduce.tasks=42;

insert overwrite directory /user/hainiu/user_install_status_select_cluster

select from user_install_status where dt=20141228 cluster by country;

等於

select from user_install_status where dt=20141228 distribute byh country sort by country;

cat ./000010_0_select_cluster |awk -F 01 {print $5}

在select結果中cluster by可以保證組內是有序的

在桶表中clustered by組內是否有序要根據sort by決定

桶表的

3. 窗口函數

ROW_NUMBER()統計當前行號,從1開始;

示例

insert overwrite directory /user/hainiu/user_install_status_over_partition

select country,row_number() over(partition by country order by uptime) as rn

from user_install_status where dt=20141228;

select a.country,a.gpcategory from

(select from user_install_status where dt=20141228) a

inner join

(select country,min(uptime) as minu form user_install_status where dt=20141228 group by country) b

on a.country = b.country and a.uptime = b.minu;

等於

select from(

select country,gpcategory,row_number() over(partition by country order by uptime) as rn

from user_install_status where dt=20141228) a

where a.rn=1;

數據集

建表

查詢

select country,gpcategory,n1,n2,

row_number() over(partition by country order by n1 desc) as rn,

row_number() over(partition by country order by n2 desc) as rn1

from test;

結果

over中partition by和distribute by區別

示例

insert overwrite directory /user/hainiu/user_install_status_over_distribute

select country,row_number() over(distribute by country sort by uptime) as rn

from user_install_status where dt=20141228;

partition by [key..] order by [key..]只能在窗口函數中使用,而distribute by [key...] sort by [key...]在窗口函數和select中都可以使用

窗口函數中兩者是沒有區別的

where後面不能用partition by

示例

select * from user_install_status where dt=20141228 partition by country order by uptime;

RANK()為排名統計,若進行排名的值相同則後面的值和前面的值保持一樣,當出現不同時就從真實的行數開始計算

DENSE_RANK()與RANK()類似,只是出現不同時按位數計算

結果示例

select country,n,

RANK() over (partition by country order by n) as rank,

DENSE_RANK() over (partition by country order by n) as drank

from(

select country,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory

) a;

NTILE()分片函數,將數據等分為N份,示例如下,將整個表按照count從大到小進行排序並切分成5份。

示例

select country,n,NTILE(5) over (order by n desc) t from

(select country,count(1) n from user_install_status_mid where dt=20141228 group by country) a;

驗證是否等分

示例

select t,count(1) from

(

select country,n,NTILE(5) over (order by n desc) t from

(select country,count(1) n from user_install_status_mid where dt=20141228 group by country) a

) b group by t;

加上partition就是在窗口內進行分片

示例

select country,n,NTILE(2) over (partition by country order by n desc) t,gpcategory from

(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a;

CUME_DIST和PERCENT_RANK函數,用來計算分組中的位置,數值介於0和1之間。分組歸一化

CUME_DIST的計算方式為分組所在位置/分組大小。

PERCENT_RANK的計算方式為(分組中順序-1)/(分組大小-1)

ROWS窗口函數中的行選擇器 rows between [n|unbounded preceding]|[n|unbounded following]|[current row] and [n|unbounded preceding]|[n|unbounded following]|[current row]

參數解釋:

n行數

unbounded不限行數

preceding在前N行

following在後N行

current row當前行

示例

select country,n,sum(n) over (partition by country order by n asc ROWS BETWEEN 1 preceding AND 1 following) t,gpcategory from

(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a;

select country,n,sum(n) over (partition by country order by n asc ROWS between unbounded preceding and current row) t,gpcategory from

(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a where a.country=CN;

select country,n,sum(n) over (partition by country order by n desc ROWS between current row and unbounded following) t,gpcategory from

(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a where a.country=CN;

select country,n,sum(n) over (partition by country order by n desc ROWS between unbounded preceding and current row) t,gpcategory from

(select country,gpcategory,count(1) n from user_install_status_mid where dt=20141228 group by country,gpcategory) a where a.country=CN;


推薦閱讀:

大數據雙創行動丨校園裡誕生的大數據創客
定位大數據分析平台,Kyligence憑開源優勢謀全球業務擴張
下一次工業革命來了,你知道他是誰么?
大數據的狂妄和南牆 | 方承志
別把我們的勞動,當作你們的資本——對用戶數據性質的新探討

TAG:Hive | Hadoop | 大數據 |