深度學習系統性綜述

04-18

寫在前面

這是學校里畢業要求的八篇學術報告之一（唯一自己寫的）。參考了很多很多大佬的博客、知乎等等，用自己的思路和理解順了一遍，因為大部分參考是半年前閱讀過留下的印象，所以也不記得是從哪裡看的啦哈哈哈。如有侵犯某些權益請聯繫我，我補上參考鏈接。

參考鏈接補充傳送門：

（暫空）

寫作目的：除開學校要求，主要還是方便以後論文的檢索，以及一名弱雞（我）對萌新們整理的資料吧，也許可以讓大家更理解深度學習的前世今生，以及對深度學習能做什麼有個大致的把握。（羞澀）

宜讀人群：萌新，正在寫深度學習相關論文introduction的我

廣告位：我以前的一篇知乎文，講LSTM的理解的：對LSTM中M（Memory）的再思考

聲明：轉載引用需告知

以下正文

本次深度學習系統性綜述學術報告結合歷史文獻主要介紹了深度學習的發展史和深度學習的應用前景。

1. 深度學習發展史

1.1 人工神經網路的歷史

深度學習，更具體來說，深度神經網路是源於人工神經網路（ANN）的研究。人工神經網路的發展初期可以分為4個階段：人工神經網路的啟蒙時期始於1890年美國著名心理學家W.James關於人腦結構與功能的研究，1943年McCulloch和Pitts[1]創建了「M-P神經元模型」，該模型用激活函數的形式對神經元進行了簡單表示。然而，Minsky[2]在1969年出版的《感知機》一書中用詳細的數學證明此方法對於異或這樣的簡單分類任務都無法解決，由於其巨大的影響力與書中所呈現的悲觀態度使神經網路的研究陷入低潮期。低潮期直到1982年，J.J.Hopefield發表的《神經網路和物理系統》這一突破性研究論文提出單層反饋神經網路（Hopfield Net）並與1985年用模擬電子線路成功實現了該網路使神經網路進入了復興時期。1986年D.E.Rumelhart和J.L McCelland領導的研究小組論文《並行分散式處理》[3]將整個神經網路研究推向了高潮，並在1987年開展了首屆國際人工神經網路學術會議，至此，人工神經網路的研究進入了高潮時期。

1.2 深度學習的黎明

2000年左右，SVM、決策樹等淺層學習演算法取得了成果，人人都棄暗投明，改做其他機器學習研究，人工神經網路被打入冷宮。2006年，Geoffrey Hinton 發表論文[4]將神經網路用用於降維提出自編碼網路，並於同年在《自然》雜誌發表論文《A fast learning algorithm for deep belief nets》[5]提出將反向傳播演算法應用於神經網路，並提出一種深度神經網路的模型：深度置信網路（DBN），並在著名的手寫字數據集MNIST中達到錯誤率1.25%的驚人成績（SVM：1.4%，ANN：2.95%）。至此，深度學習開始登上舞台。

1.3 深度學習的現狀

2012年和2013年，Alex分別使用深度卷積神經網路（DCNN）和深度循環神經網路（DRNN）應用於圖像識別方面和語音識別方面獲得成功[6][7]，創建了圖像和語音處理的新方法。2012年開始，對於深度學習演算法的研究工作如雨後春筍一般的冒了出來，Hinton在2012年[8]與2014年[9]提出Dropout方法來防止因數據集太小帶來的過擬合問題；2015年Ioffe，Sergey[10]提出批正則化（batch normalization, BN）的方法防止模型在訓練過程中「梯度彌散」，該論文榮獲2015年傑出研究；2016年Hitton小組發文[11]提出層正則化（layer normalization）更新了BN研究方法。在梯度下降優化方面，Sutskever,Ilya[12]提出基於動量的優化方法；Kingma等[13]提出Adam梯度下降方式，該方式是目前應用最廣的梯度下降優化方式。

在模型結構方面，主要分為卷積神經網路和循環神經網路兩大陣營，卷積神經網路由Alex在論文[6]中提出AlexNet開始，分別在2014年出現層數更深的VGGNet[14]；2015年谷歌[15]提出inception模型GoogLeNet並證明更深的網路能獲取更好的特徵；以及2015年CVPR最佳論文[16]中提出的ResNet，其利用特別深的神經網路和特殊的殘差結構至今在圖像分類上應用非常廣泛。如果說卷積神經網路對於空域圖像輸入處理表現好的話，那麼循環神經網路則對於輸入數據為時域序列的時候更勝一籌；由循環神經網路發展而來的長短時記憶網路（LSTM）[17]通過加入門單元的約束獲得了驚人的效果；2014年出現了一批對於序列對序列（Sequence-to-Sequence）的研究工作[18][19]，這些研究在機器翻譯[20]和聊天機器人[21]的這樣的文字處理應用非常廣泛。

2. 深度學習的應用前景

視覺追蹤：視覺追蹤是指對圖像序列中的運動目標進行檢測、提取、識別和跟蹤，獲得運動目標的運動參數，如位置、速度、加速度和運動軌跡等，從而進行下一步的處理與分析，實現對運動目標的行為理解，以完成更高一級的檢測任務。2013年，第一篇使用深度學習做視覺追蹤的論文[22]被發表，論文設計並實現了DLT Tracker用於完成對圖像序列中的目標檢測；2015年他們又在上一篇文章的基礎上改進設計了SO-DLT追蹤器[23]；同年，使用全卷積神經網路（FCNT）來實現視覺追蹤問題也被提出[24]；2016年出現了SiameseFC[25]實時物體追蹤領域的最新前沿技術；值得一提的是TCNN[26]成為了VOT2016獲獎論文。

物體檢測：物體檢測是指探尋圖像中某種物體是否存在的檢測任務。2013年Szegedy、Christian等[27]首次提出是用深度學習做物體檢測任務；2014年出現了頗具影響力的RCNN[28]與SPPNet[29]，並在2015年將RCNN模型進一步提高為Fast R-CNN[30]與Faster R-CNN[31]；同年提出的傑出研究YOLO模型[32]在物體檢測任務上非常具有使用價值。

圖像標註：圖像標註任務主要是使用一段文字來描述圖像的內容，其中包括通過圖像產生文字（圖像理解）和通過文字產生圖像（圖像生成）兩個方向。其主流方法是先通過深度卷積神經網路從圖像提取特徵，再使用循環神經網路將提取的特徵轉換為文字。其代表文獻有：[33][34][35][36]，其中李飛飛團隊[37][38]對此做出里程碑的貢獻；值得注意的是，2015的文章[39]將注意力模型加入了圖像標註任務中，取得了不錯的成績。

自然語言處理：自然語言處理是計算機科學領域與人工智慧領域中的一個重要方向。它研究能實現人與計算機之間用自然語言進行有效通信的各種理論和方法。自然語言處理方面的研究始於[40]，2013年Mikolov[41]提出文字的分散式表徵將自然語言處理問題轉化為了序列學習問題；2014年Sutskever等[42]將序列對序列的神經網路學習機制應用與自然語言處理中；2015年出現了基於動態記憶網路的自然語言處理方式[43]與特徵激活式的語言模型[44]；同年還出現了一些應用級別的文章，[45]使用卷積神經網路嘗試解決每日郵報中完形填空風格的問題；2016年[46]則提出了一種應用與文本分類的前沿技術。

參考文獻

[1] McCulloch W S, Pitts W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5(4): 115-133.

[2] Minsky M, Papert S. Perceptrons[J]. 1969.

[3] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. Cognitive modeling, 1988, 5(3): 1.

[4] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.

[5] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.

[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[7] Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal Processing Magazine 29.6 (2012)

[8] Hinton, Geoffrey E., et al. "Improving neural networks by preventing co-adaptation of feature detectors." arXiv preprint arXiv:1207.0580 (2012).

[9] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of Machine Learning Research 15.1 (2014)

[10] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

[11] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016).

[12] Sutskever, Ilya, et al. "On the importance of initialization and momentum in deep learning." ICML (3) 28 (2013): 1139-1147.

[13] Kingma, Diederik, and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).

[14] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

[15] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[16] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[17] Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013).

[18] Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013).

[19] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

[20] Bahdanau, Dzmitry, KyungHyun Cho, and Yoshua Bengio. "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv preprint arXiv:1409.0473 (2014).

[21] Vinyals, Oriol, and Quoc Le. "A neural conversational model." arXiv preprint arXiv:1506.05869 (2015).

[22] Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." Advances in neural information processing systems. 2013.

[23] Wang, Naiyan, et al. "Transferring rich feature hierarchies for robust visual tracking." arXiv preprint arXiv:1501.04587 (2015).

[24] Wang, Lijun, et al. "Visual tracking with fully convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.

[25] Bertinetto, Luca, et al. "Fully-Convolutional Siamese Networks for Object Tracking." arXiv preprint arXiv:1606.09549 (2016).

[26] Nam, Hyeonseob, Mooyeol Baek, and Bohyung Han. "Modeling and Propagating CNNs in a Tree Structure for Visual Tracking." arXiv preprint arXiv:1608.07242 (2016).

[27] Szegedy, Christian, Alexander Toshev, and Dumitru Erhan. "Deep neural networks for object detection." Advances in Neural Information Processing Systems. 2013.

[28] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

[29] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." European Conference on Computer Vision. Springer International Publishing, 2014.

[30] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.

[31] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[32] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).

[33] Farhadi,Ali,etal. "Every picture tells a story: Generating sentences from images". In Computer VisionECCV 2010. Springer Berlin Heidelberg:15-29, 2010.

[34] Kulkarni, Girish, et al. "Baby talk: Understanding and generating image descriptions". In Proceedings of the 24th CVPR, 2011.

[35] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator". In arXiv preprint arXiv:1411.4555, 2014.

[36] Donahue, Jeff, et al. "Long-term recurrent convolutional networks for visual recognition and description". In arXiv preprint arXiv:1411.4389 ,2014.

[37] Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions". In arXiv preprint arXiv:1412.2306, 2014.

[38] Karpathy, Andrej, Armand Joulin, and Fei Fei F. Li. "Deep fragment embeddings for bidirectional image sentence mapping". In Advances in neural information processing systems, 2014.

[39] Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention". In arXiv preprint arXiv:1502.03044, 2015.

[40] Antoine Bordes, et al. "Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing." AISTATS(2012)

[41] Mikolov, et al. "Distributed representations of words and phrases and their compositionality." ANIPS(2013): 3111-3119

[42] Sutskever, et al. 「Sequence to sequence learning with neural networks." ANIPS(2014)

[43] Ankit Kumar, et al. 「Ask Me Anything: Dynamic Memory Networks for Natural Language Processing." arXiv preprint arXiv:1506.07285(2015)

[44] Yoon Kim, et al. "Character-Aware Neural Language Models." NIPS(2015) arXiv preprint arXiv:1508.06615(2015)

[45] Karl Moritz Hermann, et al. "Teaching Machines to Read and Comprehend." arXiv preprint arXiv:1506.03340(2015)

[46] Alexis Conneau, et al. "Very Deep Convolutional Networks for Natural Language Processing." arXiv preprint arXiv:1606.01781(2016)