My solutions for `Google TensorFlow Speech Recognition Challenge`

前段時間利用業餘時間參加了 Google Brain 在 Kaggle 平台上舉辦的 TensorFlow Speech Recognition Challenge,最終在 1315 個 team 中排名 58th:

Top10 留影。。

這個比賽並不是通常意義上說的 Speech Recognition 任務,專業點的說法是一個 keyword spotting 的活。從各 team 的解決方案,以及在該數據集上發表的學術論文來看,絕大部分是將其當作機器學習領域的典型分類任務: 「給定固定長度數據(1 second 音頻),判斷類別 - 喚醒詞 yes,no,up,down,left,right,on,off,stop,go + silence + unknown」

TensorFlow Speech Recognition Challenge - DATAwww.kaggle.com

訓練數據包含 60000 多條各喚醒詞的音頻,LeaderBoard 數據含有150000 條數據(與訓練數據的不同: 說話人有變化,錄音環境有變化)。

基於 speech_commands 分別實現了文章 [2, 3, 4, 5] 的模型,針對語音場景做了些許調整與優化,代碼放在了 github 上:lifeiteng/TF_SpeechRecoChallenge

install

(sudo) pip install -e .n

model

典型的分類 pipeline:

audio -> Mfcc / Fbank -> [FeatureScale] -> deep neural networks [dnn / convnets ] -> class labels


model - baseline conv LeaderBoard Accuracy 0.77

[1] Sainath T N, Parada C. Convolutional neural networks for small-footprint keyword spotting[C]//Sixteenth Annual Conference of the International Speech Communication Association. 2015.

speech_commands 跑出來的基線系統


model - resnet LeaderBoard Accuracy from 0.85 -> 0.89

[2] Tang R, Lin J. Deep Residual Learning for Small-Footprint Keyword Spotting[J]. arXiv preprint arXiv:1710.10361, 2017.

  • No FeatureScale + BatchNorm (The papers architecture) + Mfcc 40: LB < 0.1
  • No FeatureScale + remove BatchNorm + Mfcc 40: LB 0.84
  • No FeatureScale + Add BN after input + BatchNorm + Mfcc 40: LB 0.86

如果不做 FeatureScale,根據論文 [2] 寫出的模型,在 LB數據集上完全不work(從訓練數據集合分出來的eval / test Accuracy 可以到 95%左右),在我苦逼的發了條微博之後,想到可以在 input 之後加層 BatchNorm 模擬 FeatureScale

  • FeatureScale(centered mean) + BatchNorm + Mfcc 40: LB 0.85
    • 實現 Feature Scale

big improvement (cmvn + Fbank80):

  • FeatureScale(cmvn) + BatchNorm + Fbank80: LB 0.89333

cmvn: centered mean, div variance

其他修改:LB 0.88617 (在比賽時只顯示到 0.88,賽後才看到具體分數,上面的config是最佳的)

tcommit: [922aa0d85](improve resnet)

t+ Fbank 80

t+ dropout

t+ feature scale: centered mean (e.g. cmn)

t+ [First Conv: kernel_size=(3, 10), strides=(1, 4)](github.com/lifeiteng/TF)

t+ use [MaxPool + AvgPool](github.com/lifeiteng/TF)


model densenet LeaderBoard from 0.86 -> 0.88

[3] Huang G, Liu Z, Weinberger K Q, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 1(2): 3.

使用 Fbank 80 + CMN,Accuracy 從 0.86 提高到 0.88


model - ensemble LB 0.89745

沒怎麼花時間在上面,實際上 Kaggle 比賽都是拼 ensemble 拉開最後的差距,改比賽的第一名亦是靠 ensemble 獲得。

不使用 Distilling (Ensemble probs as labels)、 Ensemble,論壇上其他 team 報出的單模型準確率在 0.86 左右。


data argument

[4] Jaitly N, Hinton G E. Vocal tract length perturbation (VTLP) improves speech recognition[C]//Proc. ICML Workshop on Deep Learning for Audio, Speech and Language. 2013, 117.

[5] Ko T, Peddinti V, Povey D, et al. Audio augmentation for speech recognition[C]//Sixteenth Annual Conference of the International Speech Communication Association. 2015.

  • tried changing speed / pitch, but got a little achievement.

總結:

  1. resnet / densenet 獲得的 絕對 > 10 % Accuracy 提升,讓我感受到了語音屆的相對保守,儘管語音屆在 2016 年就眼紅 ConvNets 在計算機視覺的突飛猛進[6, 7, ...],當然 ASR 也不是簡單的分類任務;
  2. TensoFlow 帶來的生產力,實現 resnet / densenet / mibilenet 都沒有花太多時間,讀完文章 1 - 2 兩小時就能寫個初始版本(後續調優,仍需仔細讀文章核對細節);
  3. ensemble 仍然是 Kaggle 比賽的銀彈;
  4. keyword spotting 的典型玩法 [8, 9, ...] 探索不足,最後一天的時間從零寫了個 CTC - keyword spotting 系統,看了下統計的 score,基本不靠譜。。

[6] Xiong W, Droppo J, Huang X, et al. Achieving human parity in conversational speech recognition[J]. arXiv preprint arXiv:1610.05256, 2016.

[7] Yu D, Li J. Recent progresses in deep learning based acoustic models[J]. IEEE/CAA Journal of Automatica Sinica, 2017, 4(3): 396-409.

[8] Lengerich C, Hannun A. An end-to-end architecture for keyword spotting and voice activity detection[J]. arXiv preprint arXiv:1611.09405, 2016.

[9] He Y, Prabhavalkar R, Rao K, et al. Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models[J]. arXiv preprint arXiv:1710.09617, 2017.

推薦閱讀:

TAG:TensorFlow | 语音识别 | 深度学习DeepLearning |