從頭學習大數據培訓課程數據倉儲工具 hive（七）hive 自定義 UDTF

02-25

1. UDTF

繼承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF,實現initialize, process, close三個方法。

UDTF首先會調用initialize方法，此方法返回UDTF的返回行的信息（返回個數，類型）。

初始化完成後，會調用process方法,真正的處理過程在process函數中，在process中，每一次forward()調用產生一行；如果產生多列可以將多個列的值放在一個數組中，然後將該數組傳入到forward()函數。

最後close()方法調用，對需要清理的方法進行清理。

測試數據

根據這個數據創建表

create table udtftest (id int,strsplit string)

row format delimited fields terminated by ;

把數據拷貝到內部表中

創建函數

CREATE TEMPORARY FUNCTION hainiu_udtf_split AS com.hainiu.hive.function.UDTFHainiuSplit;

測試

set hive.cli.print.header=true;

select hainiu_udtf_split(strsplit) from udtftest;

select hainiu_udtf_split(strsplit) as (name,age) from udtftest;

select id,hainiu_udtf_split(strsplit) from udtftest;

select udtftest.id,udtffunction.name,udtffunction.age from udtftest lateral view hainiu_udtf_split(strsplit) udtffunction as name,age;

select sum(age),count(id),count(distinct id) from (

select udtftest.id as id,udtffunction.name,udtffunction.age as age from udtftest lateral view hainiu_udtf_split(strsplit) udtffunction as name,age) a;

多行時，根據每行數據的情況切分

數據修改為：

結果：

用法規則：

(1).直接select中使用

select hainiu_split(valuse) from src;

select hainiu_split(valuse) as (col1,col2) from src;

(2).和lateral view一起使用

select udtf_test.valuse, mytable.col1, mytable.col2 from udtf_test lateral view hainiu_split(valuse) mytable as col1, col2;

(3).不能使用場景

不可以添加其他欄位使用

select valuse, hainiu_split(valuse) as (col1,col2) from udtf_test ;

不可以嵌套調用

select hainiu_split(hainiu_split(valuse)) from udtf_test;

不可以和group by/cluster by/distribute by/sort by一起使用

select hainiu_split(valuse) as (col1,col2) from udtf_test group by col1, col2;

2. HBASE概述：

HBase是一個分散式的、面向列的開源資料庫，參考 Google 的 Bigtable 實現。屬於KV結構數據，原生不支持標準SQL

key cloume_family1 cloume_family2

1 a b c e f g ee aa bb dd 234 44 54 7563

2 1:a d c

2:a

3:c

3 e t y

4 a

HBase不同於一般的關係資料庫，它是一個適合於非結構化數據存儲的資料庫，是HBase基於列的而不是基於行的模式

主要特點

大：一個表可以有數十億行，上百萬列

無模式：每行都有一個可排序的主鍵和任意多的列，列可以根據需要動態的增加，同一張表中不同的行可以有截然不同的列

面向列：面向列（族）的存儲和許可權控制，列（族）獨立檢索

稀疏：空（null）列並不佔用存儲空間，表可以設計的非常稀疏

數據多版本：每個單元中的數據可以有多個版本，默認情況下版本號自動分配，是單元格插入時的時間戳

數據類型單一：Hbase中的數據都是字元串，沒有類型

從頭學習大數據培訓課程 數據倉儲工具 hive（七）hive 自定義 UDTF

1. UDTF

2. HBASE概述：

從頭學習大數據培訓課程數據倉儲工具 hive（七）hive 自定義 UDTF