Low-Latency Video Semantic Segmentation

05-25

來自專欄計算機視覺與深度學習

CVPR2018的Spotlight.

這篇文章主要著重探索視頻分割的速度問題, low-latency低延遲. 有兩個components:

1) 自適應的空間變體卷積的feature propagation module, 用來和之前key frame的feature融合, 減少了per-frame的計算量;

2) 基於accuracy prediction的關鍵幀自適應調度.

Both components work together to ensure low latency while maintaining high segmentation quality.

文章並沒有在幀間引入optical flow的信息，因為比較難獲得(costly). 而是充分利用了low-level feature, low-level feature比較容易獲得且包含rich的information.

pipeline

lower part Sl and the higher-part Sh, Sl提取的low-level feature用來select key frame, 並且控制high-level feature的傳播。大致的過程如下：（簡言之就是通過low-level feature來判斷是否把當前幀作為關鍵幀，是的話就計算high-level semantic feature, 不是的話就通過一種特殊的卷積直接傳之前key-frame的feature）這裡也是比較了下後者的時間，即Note that the combined cost of kernel prediction and spatially variant convolution is dramatically lower than computing the high-level features from Fl (38 ms vs 299 ms).

Adaptive Selection of Key Frames

之前的選擇關鍵幀的策略大多數都是選擇最frequently的，這樣當視頻靜止場景比較多的時候會更節省時間。但根據上面的描述，在這裡選擇key-frame的機制是當前幀segmentation map和之前幀的deviation（距離）.diviation較大時，說明場景有了顯著的改變，需要設定當前幀為關鍵幀(在這裡計算距離是用的low-level的feature, 作者在此處做了實驗low-level feature也能改去代表這一度量)。這裡設定了一個predefined-threshold, > threshold就set key frame.

Adaptive Feature Propagation

之前的幀間關係傳播很多用光流，文章提到了兩個缺點：

1.獲得代價昂貴

2.point-to-point的mapping太限制.

之前的幀間關係還可以用translation-invariant conv.

這裡提出了spatially variant conolution,

W(i,j)的參數是由Fl(k)和Fl(t)作為輸入預測出來的。

low-latency scheduling

在key-frame引入了fast-track。其實就是假設t時刻是關鍵幀, 輸出還按照feature propagation走; 同時另一個線程去通過low和high的網路來計算特徵，然後替換之前key-frame的cache。這兩個過程是並行的，沒有block主線程. 算是一個trick.

最後看一下效果: