如何用DNN從2D圖片重建3D結構？

06-26

如何用DNN從2D圖片重建3D結構？

來自專欄一周一paper

deepmind 今天的paper

視頻版本在這裡

https://deepmind.com/blog/neural-scene-representation-and-rendering/?

deepmind.com

原文在這裡

https://deepmind.com/documents/211/Neural_Scene_Representation_and_Rendering_preprint.pdf?

deepmind.com

To this end, we introduce the Generative Query Network (GQN), a framework within which machines learn to represent scenes using only their own sensors

一個場景，你給幾個不同的視角，它可以從這幾個視角裡面把整個場景3D重建。需要注意的是這個跟之前所有point cloud的演算法不一樣，深度信息是NN自己算出來的。

Modern artificial vision systems are based on deep neural networks that consume large, labelled datasets to learn functions that map images to human-generated scene descriptions.

這一部分提到了很多之前重建3d場景的paper，有興趣的話可以去原文看。deepmind的人覺得既然人類可以用NN重建3d場景，那麼模型也可以。

To that end, we present the Generative Query Network (GQN). In this framework, as an agent navigates a 3D scene i, it collects K images $x_i^k$ from 2D viewpoints $v_i^k$ .... The agent passes these observations to a GQN composed of two main parts: a representation network f and a generation network g

f 把圖片輸入壓縮，輸入到g裡面。g裡面有渲染的部分可以輸出不同角度對這個場景的預測

The two networks are trained jointly, in an end-to-end fashion, to maximize the likelihood of generating the ground-truth image that would be observed from the query viewpoint. ... [GQN] will produce scene representations that contain all information necessary for the generator to make accurate image predictions (e.g., capturing object identities, positions, colours, counts and room layout).

文中還特地講到跟Point Cloud 的差別

the GQN will learn by itself what these factors are, as well as how to extract them from pixels. Moreover, the generator internalizes any statistical regularities that are common across different scenes .... In contrast, voxel (12–15) or point-cloud (16) methods (as typically obtained by classical structure-from-motion) employ literal representations and therefore typically scale poorly with scene complexity and size and are also difficult to apply to non-rigid objects (e.g., animals, vegetation, or cloth).

訓練的方法是讓agent在一個空間裡面拍攝不同的照片，然後嘗試去預測換一個角度這個模型應該是怎樣的。在第一個例子裡面，用的數據和參數都是非常小的

With this representation, which can be as small as 256 dimensions, the generator』s predictions at query viewpoints are highly accurate and mostly indistinguishable from ground-truth (Fig. 2A)

Notably, the model only ever observes only a small number of images from each scene during training (in this experiment, fewer than 5), yet it is capable of rendering unseen training or test scenes from arbitrary viewpoints.

為什麼這個模型有這麼高的壓縮能力呢？這個模型用來表現物體的概念叫做 t-SNE

t-SNE is a method for non-linear dimensionality reduction that approximately preserves the metric properties of the original high-dimensional data. Each dot represents a different view of a different scene, with colour indicating scene identity. Whereas the VAE clusters images mostly on the basis of wall angles, GQN clusters images of the same scene, independent of view (scene representations computed from each image individually).

t-SNE 的原文在這裡 http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf。文章中提到的VAE之前的 Google Brain 的文章有提到，有興趣可以去那裡看 [Google Brain] 智能體能在自己的夢裡學習么？。

這個agent還要負責控制相機，讓這個agent知道自己收集到了足夠多的信息。這個agent有一個優勢就是它知道自己的位置信息，所以訓練過程是：

we first trained a GQN and used it to succinctly represent the observations. A policy was then trained to control the arm directly from these representations. In this setting, the representation network must learn to communicate only the arm』s joint angles, the position and colour of the object, and the colours of the walls for the generator to be able to predict new views.

paper裡面還有跟之前的方法有什麼不同，這段我就直接跳過去了，有興趣可以自己去看原文。我覺得這個paper最大的貢獻就是在unsupervise的情況下可以3D渲染圖片，但是場景是toy example，比我之前看到 UW的 point cloud 重建斗獸場的例子 Building Rome in a Day ，不知道在同樣複雜度的情況下還能不能合理的渲染出來。