從頭學習大數據培訓課程數據倉儲工具 hive（三）hive 的 orc 文件、bucket 使用、表操作、關聯查詢

02-09

1. ORC文件

一、定義

　　ORC File，它的全名是Optimized Row Columnar (ORC) file，其實就是對RCFile做了一些優化。據官方文檔介紹，這種文件格式可以提供一種高效的方法來存儲Hive數據。它的設計目標是來克服Hive其他格式的缺陷。運用ORC File可以提高Hive的讀、寫以及處理數據的性能。

和RCFile格式相比，ORC File格式最主要的優點是在文件中存儲了一些輕量級的索引數據

二、ORC File文件結構

　　ORC File包含一組組的行數據，稱為stripes，除此之外，ORC File的file footer還包含一些額外的輔助信息。在ORC File文件的最後，有一個被稱為postscript的區，它主要是用來存儲壓縮參數及壓縮頁腳的大小。在默認情況下，一個stripe的大小為250MB。大尺寸的stripes使得從HDFS讀數據更高效。

　　在file footer裡面包含了該ORC File文件中stripes的信息，每個stripe中有多少行，以及每列的數據類型。當然，它裡面還包含了列級別的一些聚合的結果，比如：count, min, max, and sum。下圖顯示出可ORC File文件結構：

ORC File Format

三、Stripe結構

　　從上圖我們可以看出，每個Stripe都包含index data、row data以及stripe footer。Stripe footer包含流位置的目錄；Row data在表掃描的時候會用到。

　　Index data包含每列的最大和最小值以及每列所在的行。行索引裡面提供了偏移量，它可以跳到正確的壓縮塊位置。具有相對頻繁的行索引，使得在stripe中快速讀取的過程中可以跳過很多行，儘管這個stripe的大小很大。在默認情況下，最大可以跳過10000行。

建ORC表

CREATE EXTERNAL TABLE user_install_status_other(

aid string COMMENT from deserializer,

pkgname string COMMENT from deserializer,

uptime bigint COMMENT from deserializer,

type int COMMENT from deserializer,

country string COMMENT from deserializer,

gpcategory string COMMENT from deserializer)

PARTITIONED BY (dt string)

ROW FORMAT SERDE org.apache.hadoop.hive.ql.io.orc.OrcSerde

STORED AS INPUTFORMAT org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

OUTPUTFORMAT org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

LOCATION hdfs://ns1/data/hainiu/user_install_status_other

TBLPROPERTIES (orc.compress=SNAPPY, orc.create.index=true);

導入數據

alter table user_install_status_other add if not exists partition (dt=20141228);

insert overwrite table user_install_status_other partition(dt=20141228)

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

再建一個txt格式的表進行速度對比

CREATE EXTERNAL TABLE user_install_status_txt(

aid string,

pkgname string,

uptime bigint,

type int,

country string,

gpcategory string)

PARTITIONED BY (dt string)

STORED AS TEXTFILE

LOCATION hdfs://ns1/data/hainiu/user_install_status_txt;

alter table user_install_status_txt add if not exists partition (dt=20141228);

insert overwrite table user_install_status_txt partition(dt=20141228)

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

查詢對比

select * from user_install_status_other where country = CN;

select * from user_install_status_txt where country = CN;

容量對比

2. Bucket表使用

CREATE EXTERNAL TABLE buckets_table(

col ....)

COMMENT This is the buckets_table table

PARTITIONED BY (dt string)

CLUSTERED BY(col1) [SORTED BY(col2 [asc|desc])] INTO 2 BUCKETS

LOCATION .....

select country,count(1) n from user_install_status_other where dt=20141228 group by country order by n desc;

示例：

CREATE EXTERNAL TABLE user_install_status_buckets(

aid string,

pkgname string,

uptime bigint,

type int,

country string,

gpcategory string)

COMMENT This is the buckets_table table

PARTITIONED BY (dt string)

CLUSTERED BY(country) SORTED BY(uptime desc) INTO 42 BUCKETS

LOCATION hdfs://ns1/data/hainiu/user_install_status_buckets;

alter table user_install_status_buckets add if not exists partition (dt=20141228);

insert overwrite table user_install_status_buckets partition(dt=20141228)

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

生成42個reducer因為桶的數設置的是42

任務會成形成42個文件

拿其中最小的看一下，是否包含了某個國家的全部數據

比如GM的全部條數就都在這個桶文件中

注意

Hive 0.6.0以前

set hive.enforce.bucketing=true;

可以根據bucket的個數的設置reducer的數量

set hive.enforce.sorting=true;

有排序的桶需要設置一下

列印全部參數用set

抽樣函數用法：

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

select * from buckets_table tablesample(bucket 11 out of 42 on country)

注意：

TABLESAMPLE語句可以跟在select…from…之後

桶的編號x從1開始計數

舉例：

select count(1) from user_install_status_buckets tablesample(bucket 11 out of 42 on country);

select count(1) from user_install_status_txt tablesample(bucket 11 out of 42 on country);

3. 表SQL操作

通過select數據集創建表

create table table_name stored as orc

select ......

只能是內部表，不支持分區，分桶

示例：

create table user_install_status_txt1 as

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

select count(1) from user_install_status_txt1;

複製一個空表

CREATE TABLE empty_key_value_store

LIKE key_value_store;

顯示所有表：

SHOW TABLES;

按正條件（正則表達式）顯示錶，

SHOW TABLES .*s;

刪除表

DROP TABLE [IF EXISTS] table_name [PURGE];

drop table user_install_status_txt2;

對於內部表來說，刪除表的操作的本質是刪除表的元數據，然後hadoop fs -rm(將表數據挪到回收站目錄， .Trash/Current)，如果增加PURGE則不進回收站數據直接刪除，不能恢復；對外部表而言只刪除元數據，不刪除數據目錄，加PURGE也不刪。

清除表

TRUNCATE TABLE table_name [PARTITION partition_spec];

可以刪除表和刪除分區數據，和drop的區別是不刪除元數據，只刪除數據,外部表是不能truncate操作的。

刪除partitions

truncate table user_install_status_txt partition(dt=20141228);

外部表不支持使用TRUNCATE語句

常用修改表操作

改表名

ALTER TABLE table_name RENAME TO new_table_name;

ALTER TABLE user_install_status_txt1 RENAME TO user_install_status_txt2;

內部表修改了表名之後，表對應的存儲文件地址也跟著改

ALTER TABLE user_install_status_txt RENAME TO user_install_status_txt1;

外部表不會改對應的location地址

修改表屬性

ALTER TABLE table_name SET TBLPROPERTIES table_properties;

示例：

alter table user_install_status_txt1 set tblproperties (comment = txt table);

增加表分區

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION location1] partition_spec [LOCATION location2] ...;

分區表如果不是sql load數據，數據目錄在另一個環節產生則需要手工增加分區操作。比如user_dat 每天由ETL服務生成數據，放到相應日期目錄，需要sql增加分區才能查詢使用。

alter table user_install_status_txt1 add if not exists partition (dt=20170602) location 20170602 partition (dt=20170601) location 20170601;

刪除分區

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...];

alter table user_install_status_txt1 drop if exists partition (dt=20170602), partition (dt=20170601);

外部表刪除分區不刪除分區對應的目錄

修改表或分區的路徑

ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";

alter table user_install_status_txt1 PARTITION (dt=20141228) set location "hdfs://ns1/data/hainiu/user_install_status_other/dt=20141228";

通過查看hive在mysql的元數據發現分區對應的HDFS地址已經改變

select t.TBL_NAME,p.PART_NAME,s.LOCATION,d.NAME as dbname from PARTITIONS p

inner join SDS s on s.SD_ID = p.SD_ID

inner join TBLS t on p.TBL_ID=t.TBL_ID

inner join DBS d on t.DB_ID = d.DB_ID

where s.LOCATION like %other%;

因為這個表的formatclass是text再，新location目錄下的是orc格式的，所以數據讀出來格式是亂的

分區重命名

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

alter table user_install_status_txt1 partition (dt=20141228) rename to partition (dt=20170602);

增加欄位

ALTER TABLE table_name [PARTITION partition_spec] ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

alter table user_install_status_txt1 add columns (test_col string);

ADD COLUMNS 可以在表列的最後和分區欄位前面增加欄位

REPLACE COLUMNS 使用新的列集合替換現有的數據列

alter table user_install_status_txt1 REPLACE columns (test_col string);

alter table user_install_status_txt1 REPLACE columns (aid string,pkgname string,uptime bigint,typeint,country string,gpcategory string);

修改欄位

ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type

[COMMENT col_comment] [FIRST|AFTER column_name]

alter table user_install_status_txt1 CHANGE COLUMN aid new_iad string;

視圖操作

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]

[COMMENT view_comment]

[TBLPROPERTIES (property_name = property_value, ...)]

AS SELECT ...;

視圖列和類型可以不指定，如果不指定則有select集合列名和類型決定。視圖只能有隻讀操作不能進行 LOAD/INSERT/ALTER操作。 alter view 和重建效果一致。

示例：

創建視圖

CREATE VIEW user_install_status_view

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

SQL1：select from user_install_status_view limit 10;

SQL2：select from (

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228) limit 10;

SQL1使用視圖查詢與SQL2的含義是一樣的

數據載入

LOAD DATA [LOCAL] INPATH filepath [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

hive 的數據載入不會對數據文件做任何處理，只是將文件或目錄中的所有文件拷貝到表定義的數據目錄。

注意事項：

filepath 不能包含子目錄

指定LOCAL本地文件上傳，如果沒有指定LOCAL，則使用表一樣的文件系統 (本地文件不刪除)

文件載入hive沒有做嚴格校驗，文件格式和壓縮選項等匹配需要用戶自己保證

分區表要指定具體載入數據分區

如果指定OVERWRITE會覆蓋相應表數據或分區數據，相當於 rm 原有目錄數據，然後上傳新數據文件

HDFS執行MV操作

load data INPATH hdfs://ns1/data/hainiu/user_install_status_other/dt=20141228 overwrite into table user_install_status_txt1 PARTITION (dt=20150506);

load data LOCAL INPATH /home/qingniu/000010_0 overwrite into table user_install_status_txt1 PARTITION (dt=20150507);

通過select載入數據

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

示例：

當加了 IF NOT EXISTS，如果存在分區就跳過下面的select語句

insert overwrite table user_install_status_txt1 partition(dt=20170602) IF NOT EXISTS

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

insert into table user_install_status_txt1 partition(dt=20170602)

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

INSERT OVERWRITE 會覆蓋表或分區數據，但覆蓋分區時增加IF NOT EXISTS，如果分區已經存在則不會覆蓋。 INSERT INTO 向表或分區追加數據，不影響歷史數據。

動態分區

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table table1 partition (ds, hr)

select key, value, ds, hr FROM table2 WHERE ds is not nullj

將數據寫入文件系統

INSERT OVERWRITE [LOCAL] DIRECTORY directory1

[ROW FORMAT row_format] [STORED AS file_format]

SELECT ... FROM ...

示例：

insert overwrite directory hdfs://ns1/data/hainiu/OVERWRITE_DIR STORED AS AVRO

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

變成avro的格式

insert overwrite local directory /home/qingniu/insert/avro STORED AS AVRO

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

from user_install_status

where dt=20141228;

多層目錄會自動幫你創建

from user_install_status

insert overwrite directory hdfs://ns1/data/hainiu/OVERWRITE_DIR STORED AS AVRO

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

where dt=20141228

insert overwrite local directory /home/qingniu/insert/avro STORED AS AVRO

select

aid,

pkgname,

uptime,

type,

country,

gpcategory

where dt=20141228;

Hive extension (multiple inserts):

FROM from_statement

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 row_format

select_statement1 where

[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 row_format

select_statement2 where ] ...

row_format

: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

[NULL DEFINED AS char]

導出到文件系統的數據都序列化成text，非原始類型欄位會序列化成json，導出文件以^A分隔
結尾的文本數據。

INSERT OVERWRITE 到HDFS目錄，可以通過MR job實現並行寫入。這樣在集群上抽取數據不僅速度快，而且還很方便。

4. 關聯查詢

select示例

計算新增用戶

select count(1) from

(select aid from user_install_status where dt=20141117 group by aid) a

right join

(select aid from user_install_status where dt=20141228 group by aid) b

on b.aid=a.aid

where a.aid is null;

算每個國家記錄數的百分比

select a.country,a.cn,round(http://a.cn/b.cn*100,6) from

(select country,count(1) cn,nn as joinc from user_install_status where dt=20141228 group by country) a

inner join

(select count(1) cn,nn as joinc from user_install_status where dt=20141228) b on a.joinc=b.joinc;

從頭學習大數據培訓課程 數據倉儲工具 hive（三）hive 的 orc 文件、bucket 使用、表操作、關聯查詢

1. ORC文件

2. Bucket表使用

3. 表SQL操作

4. 關聯查詢

從頭學習大數據培訓課程數據倉儲工具 hive（三）hive 的 orc 文件、bucket 使用、表操作、關聯查詢