訪客路徑分析-Druid實踐

05-11

一、背景

訪客分析是常見數據分析的一種，通過如上圖（Google Analytics）以比較直觀的方式展現用戶達到網站後各條訪問路徑的流失情況，幫助網站優化減少流失率。

訪客路徑分析有如下幾個關鍵點：

用戶訪問的路徑通常有多級，默認展開包含著陸頁在內的5級路徑，支持往後每點擊一次展開一級路徑（最高支持到10級，再往後意義不大）。
每級只展示top 5訪問數的網頁，每級路徑網頁之間連接線表示跳轉情況。
指標包含top 5網頁的會話數、流失數和剩餘網頁的會話數。

通過上述分析，要實現訪客路徑分析需要完成如下幾項工作：

計算每一級所有網頁的會話總數。
計算每一級會話數top 5的網頁。
計算每一級兩兩網頁之間的跳轉訪問數。

本文提出一種基於druid的實現方案，將上述3個查詢轉化為druid中的Timeseries（求總數）、TopN（求前5）、GroupBy（求兩兩關聯）查詢。

二、技術方案

數據清洗（ETL）將用戶pv流水根據，聚合成一個session會話。session會話內用戶的訪問流水按時間排序，取前11個分別放於維度landing_page ~ path10，ETL處理後的數據表格示例如下：

數據入Druid供查詢，schema設計如下

{ "type" : "index_hadoop", "spec" : { "ioConfig" : { "type" : "hadoop", "inputSpec" : { "type" : "static", "paths" : "" } }, "dataSchema" : { "dataSource" : "", "granularitySpec" : { "type" : "uniform", "segmentGranularity" : {"type":"period","period":"P1D","timeZone":"Asia/Shanghai"}, "queryGranularity" : {"type":"period","period":"P1D","timeZone":"Asia/Shanghai"}, "intervals" : [] }, "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "dimensionsSpec" : { "dimensions": [ "host", "landing_page", "path1", ... "path10" ] }, "timestampSpec" : { "format" : "auto", "column" : "time" } } }, "metricsSpec": [ { "name": "count", "type": "count" } ] }, "tuningConfig" : { "type" : "hadoop", "partitionsSpec" : { "type" : "hashed", "targetPartitionSize" : 5000000 }, "indexSpec" : { "bitmap" : { "type" : "roaring"}, "dimensionCompression":"LZ4", "metricCompression" : "LZ4", "longEncoding" : "auto" } } } }

三、具體實踐

查詢語句示例

計算每一級所有網頁的會話總數（默認展示前5級），過濾掉為null的情況（用戶只訪問到上一級就跳出）。

{ "queryType": "timeseries", "dataSource": "visit_path_analysis", "granularity": "all", "filter": { "type": "and", "fields": [{"type": "selector", "dimension": "host", "value": "www.xxx.com"}] }, "aggregations": [ { "type": "filtered", "filter": { "type": "not", "field": { "type": "selector", "dimension": "landing_page", "value": null } }, "aggregator": { "type": "longSum", "name": "count0", "fieldName": "count" } }, { "type": "filtered", "filter": { "type": "not", "field": { "type": "selector", "dimension": "path1", "value": null } }, "aggregator": { "type": "longSum", "name": "count1", "fieldName": "count" } }, { "type": "filtered", "filter": { "type": "not", "field": { "type": "selector", "dimension": "path2", "value": null } }, "aggregator": { "type": "longSum", "name": "count2", "fieldName": "count" } }, { "type": "filtered", "filter": { "type": "not", "field": { "type": "selector", "dimension": "path3", "value": null } }, "aggregator": { "type": "longSum", "name": "count3", "fieldName": "count" } }, { "type": "filtered", "filter": { "type": "not", "field": { "type": "selector", "dimension": "path4", "value": null } }, "aggregator": { "type": "longSum", "name": "count4", "fieldName": "count" } } ], "intervals": [] }

計算每一級會話數top5的網頁，過濾掉為null的情況（用戶只訪問到上一級就跳出）。

{ "queryType": "topN", "dataSource": "visit_path_analysis", "granularity": "all", "dimension": "landing_page", "filter": { "type": "and", "fields": [ {"type": "selector", "dimension": "host", "value": "www.xxx.com"}, { "type": "not", "field": { "type": "selector", "dimension": "landing_page", "value": null } } ] }, "threshold": 5, "metric": { "type": "numeric", "metric": "count" }, "aggregations": [{ "type": "longSum", "name": "count", "fieldName": "count" }], "intervals": [] }

計算每一級兩兩網頁之間的跳轉訪問數，後一級的null用來計算流水數。

{ "queryType": "groupBy", "dataSource": "visit_path_analysis", "granularity": "all", "dimensions": ["landing_page", "path1"], "filter": { "type": "and", "fields": [ {"type": "selector", "dimension": "host", "value": "www.xxx.com"}, { "type": "in", "dimension": "landing_page", "values": ["/a", "/b", "/c", "/d", "e"] }, { "type": "in", "dimension": "path1", "values": ["/f", "/g", "/h", "/i", "/j", null] } ] }, "aggregations": [{ "type": "longSum", "name": "count", "fieldName": "count" }], "intervals": [] }

四、總結分析

本文提出基於Druid來做訪客路徑分析的方案需由多個請求來完成。

計算每一級所有網頁的會話總數和計算每一級會話數top5的網頁，在默認展示的時候可以先並行向druid發起請求。獲取每級總會話數後再減去top5的會話數就是剩餘其他網頁的會話數。
當得到每一級top5的路徑後，只需要相鄰兩級路徑做GroupBy查詢即可獲得轉化數與流水數。
當需要展示往後一級路徑流轉時，只需要基於當前最後一級的top5與下一級別top5做GroupBy計算即可。
從數據分布來看，大部分流水集中在前幾步，往後有數據級的差距。
該方案最大挑戰來著對Druid的並發請求，一個頁面展示會擴大為多個Druid並發語句請求。