For further thought...: October 2016

Koninklijk Theater Carré (in Amsterdam), where the main conference was held, is regularly used for theatrical and circus performances. The main stage was home to all the oral and most of the poster presentations during the week. This both meant that (i) speakers were performers with their audience looming above them from all sides and balconies, and that (ii) poster sessions from a birds-eye-view looked like a simulation of particles moving through a viscous liquid, trapped within the confines of the stage (scroll to the end of this post for a demonstration).

Due to this unusual set-up, audience questions could not be solicited in the usual manner of line ups in front of a microphone (try climbing over all those people, and out of a balcony). Instead, given a tech crowd, it was expected that technology could easily come to the rescue... the results of which can be summarized by comments made on separate occasions by the respective session chairs:

"Please post your questions on twitter and we will ask them on your behalf [...]
But neither of us have twitter, so we will ask our own questions in this session."

"It seems the community is composed of two groups:

those that have questions, and those that know how to use twitter

- we’re still hoping there will be an intersection at some point."

There was little to complain about otherwise: the venue was quite beautiful; there were many comfortable corners all around the building that were quite favorable to getting some paper reading done; the little baked parmesan palmiers that waiters carried around on trays all throughout the day were impeccable; and the city surrounding the conference was bursting with energy and canals.

Main topics:

During the welcome, the general chairs put up some statistics about the topic areas that represented ECCV this year. The top ones include:

deep learning

3D modeling

events, actions

object class detection

semantic image

object tracking

de-blurring

scene understanding

image indexing

face recognition

segmentation

Topics like sparse coding are going down in paper representation. High acceptance rate topics are confounded by the size of those topics: smaller topics have a larger relative percentage of that are accepted (e.g. model-based reconstruction, 3D representation, etc.). Popular reviewer subject areas mostly follow the top topic areas above - specifically: 3D modeling, deep learning, object class detection, events, face recognition, object class detection, scene understanding, etc.

Summary notes:

My summary notes on the presentations that I attended can be found here (covers ~70% of the oral sessions): https://docs.google.com/document/d/175ORVlLMdjOscJ7-93WIt0bieUiu21vtlL7J-7-7qBI/pub

Some general research trends*:

* disclaimer: very biased by my own interests, observations, and opinions

(which tend to revolve around perception, cognition, attention, and language)

for an objective summary, go instead to the summary notes linked above

Nobody asks anymore: "is this done with CNNs too?", and more and more research is digging into the depths of the not-so-black* box of CNNs. The remaining fruits are now a little higher than they were before, and we are beginning to see more reaching - in the form of innovations in architectures, evaluations, internal representations, transfer learning, integration with new sensors/robotics, and unsupervised approaches. More about some of these below.

* With some notable exceptions -> Chair: “did you train with stochastic gradient descent?” Speaker: “we trained with caffe”

We're seeing old ideas come back in new architectural forms: new ways of encoding long-thought-about constraints and relations. If one can open an old vision paper and reformulate the proposed pipeline as an end-to-end network, encode constraints and heuristics as appropriate loss functions, and leverage different task knowledge by designing a corresponding training procedure, then a new paper is in the making (e.g. active vision for recognition).

http://www.eccv2016.org/files/posters/S-1B-05.pdf

Themes that we are beginning to see more and more of: unsupervised learning, semi-supervised learning, and self-supervised learning (with varying degrees of overlap, depending on how you define them). The main idea being that with the deep and powerful architectures we have now, solving each new problem in an end-to-end fashion would require an Imagenet-scale dataset. Because this is not always possible, transferring knowledge, labels, and classifications across tasks, datasets, and individual frames/images is the sought-after approach.

Video is a popular modality: temporal information can provide a strong supervisory signal for propagating labels across frames or for learning to do object detection from unlabeled video (e.g., Walker et al., Long et al.). Key frames of an action or an event can serve as targets for the rest of the frames. For instance, Zhao et al. perform facial expression recognition using peak facial expressions as a supervisory signal, by matching the internal representations (i.e. network features) of peak and non-peak facial expressions in order to build more robustness and invariance into the recognition pipeline. Similarly, photo sequences or collections provide loose temporal relationships that can be harnessed as a self-supervisory cue for predicting relevant/future photos (e.g, Sigurdsson, Chen & Gupta). As a side note, there is a lot more work on multi-dimensional inputs (3D, video, image sequences/collections) than single-images. Even with single images, there is a lot more temporal processing (e.g., via attention modules, more about this below). In other words, tasks that can be summarized as "image in" -> "single-label prediction out" have pretty much been exhausted.

An Uncertain Future: Forecasting from Static Images using Variational Autoencoders, Jacob Walker, Carnegie Mellon University; Carl Doersch, Carnegie Mellon University; Abhinav Gupta, ; Martial Hebert, Carnegie Mellon University

Learning Image Matching by Simply Watching Video, Gucan Long, NUDT; Laurent Kneip, Australian National University; Jose M. Alvarez, Data61 / CSIRO; Hongdong Li, ; Xiaohu Zhang, NUDT; Qifeng Yu, NUDT

Peak-Piloted Deep Network for Facial Expression Recognition, Xiangyun Zhao, University of California, San Diego; Xiaodan Liang, Sun Yat-sen University; Luoqi Liu, Qihoo/360; Teng Li, Anhui University; Yugang Han, 360 AI Institute; Nuno Vasconcelos, ; Shuicheng Yan

Learning Visual Storylines with Skipping Recurrent Neural Networks, Gunnar Sigurdsson, Carnegie Mellon University; Xinlei Chen, CMU; Abhinav Gupta

Language is another powerful supervisory signal: images that share tags or words in their respective descriptions (think also: comments in the context of social media) can be used to train network representations to cluster such images closer together or further apart (e.g., Yang et al.). Some further examples of self-supervision by language include the works of Rohrbach and Lu. Other examples of cues/tasks used as self-supervision to learn useful internal representations for other tasks: co-occurrence, denoising, colorization, sound, egomotion, context, and video. Ways of leveraging existing images, modifying them and then learning the mapping back to the original images can be used as free training data (e.g., colorization, discussed more below, or image scrambling: Noroozi & Favaro).

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations, Hao Yang, NTU; Joey Tianyi Zhou, IHPC; Jianfei Cai, NTU

Grounding of Textual Phrases in Images by Reconstruction, Anna Rohrbach; Marcus Rohrbach, UC Berkeley; Ronghang Hu, UC Berkeley; Trevor Darrell, UC Berkeley; Bernt Schiele

Visual Relationship Detection with Language Priors, Cewu Lu, Stanford University; Ranjay Krishna, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, Mehdi Noroozi, University of Bern; Paolo Favaro

http://www.eccv2016.org/files/posters/O-1B-04.pdf

Works that demonstrate new unsupervised approaches will typically do evaluation in one of the following ways: (i) show that useful intermediate features emerge, by visualizing what neurons learn to fire on (as in Owens', Zhang's work, based on the approach introduced by Zhou et al.), or (ii) show that the learned internal representation provides good initialization for other tasks - i.e. that it is amenable to transfer learning (see Larsson et al. or Zhang's work for more examples). A great example of this self-supervised learning approach is the work by Pinto et al. who showed that a physical robot that grasped, pushed, and poked a whole bunch of objects a whole bunch of times could learn useful visual representations for other tasks. Demonstrating that a learned representation is useful can be done by fixing the network and using computed features to directly cluster/retrieve images, or learning a classifier on top of the computed features for a new task, or using the learned representation only as an initialization while retraining with new data. The latter approach is especially useful if the task for which the network needs to be retrained does not have enough training data for complete end-to-end learning, and the unsupervised approach can bootstrap some of the feature learning.

Ambient sound provides supervision for visual learning, Andrew Owens, MIT; Jiajun Wu, MIT; Josh Mcdermott, MIT; Antonio Torralba, MIT; William Freeman, MIT

Colorful Image Colorization, Richard Zhang, UC Berkeley; Phillip Isola, MIT; Alexei Efros

Learning Representations for Automatic Colorization, Gustav Larsson, University of Chicago; Michael Maire, Toyota Technological Institute at Chicago; Greg Shakhnarovich, TTI Chicago, USA

The Curious Robot: Learning Visual Representations via Physical Interactions, Lerrel Pinto, Carnegie Mellon University; Dhiraj Gandhi, ; Yuanfeng Han, ; Yong-Lae Park, ; Abhinav Gupta

http://www.eccv2016.org/files/posters/O-4B-04.pdf

This also touches on an important trend: we are starting to see more integration with robotics. We are coming back to active vision (e.g., Jayaraman & Grauman). New architectures and compute power are providing us with the capabilities of learning structure from (relatively unstructured) interactions. This area of research will likely see tremendous growth in the next few years.
Deep is coming to a robotics lab near you.

Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion, Dinesh Jayaraman, UT Austin; Kristen Grauman, University of Texas at Austin

http://www.eccv2016.org/files/posters/P-3B-17.pdf

Language continues to be a hot topic. This includes image captioning (and variants, like Zeng's "title generation"), and related tasks like visual question answering - VQA (e.g., Mallya & Lazebnik, Lin & Parikh), referring expressions (e.g., Hu et al., Yu et al.), explanation generation (Hendricks et al.), semantic tagging, and leveraging language as a supervisory cue for other visual recognition tasks (as discussed above). Attention modules are also beginning to pop up more frequently: here, "attention" is used to refer to a modulation of (visual) features - a reweighing of which features, at which spatial locations, are used most at a given timestep (e.g., Zhang et al.). Often, attention modules go hand-in-hand with recurrent neural networks (RNNs, e.g., LSTMs) that can encode temporal relationships. In this case, processing of the visual input at one time step influences processing at the next time step. For instance, captioning systems may "attend" to different image regions in sequence, while generating caption words sequentially. VQA systems may use a similar iterative procedure to refine the location of the image that can provide an answer to the question or aid with localizing a referring expression (e.g. Rohrbach et al., Xu & Saenko).

Title Generation for User Generated Videos, Kuo-Hao Zeng, National Tsing Hua University; Tseng-Hung Chen, National Tsing Hua University; Juan Carlos Niebles, Stanford University; Min Sun, National Tsing Hua University

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering, Arun Mallya, UIUC; Svetlana Lazebnik

Leveraging Visual Question Answering for Image-Caption Ranking, Xiao Lin, Virginia Tech; Devi Parikh, Virginia Tech

Segmentation from Natural Language Expressions, Ronghang Hu, UC Berkeley; Marcus Rohrbach, UC Berkeley; Trevor Darrell

Modeling Context in Referring Expressions, Licheng Yu, University of North Carolina; Patrick Poirson, ; Shang Yang, ; Alex Berg, ; Tamara Berg, University on North Carolina

Generating Visual Explanations, Lisa Anne Hendricks, UC Berkeley; Zeynep Akata, ; Marcus Rohrbach, UC Berkeley; Jeff Donahue, UC Berkeley; Bernt Schiele, ; Trevor Darrell

Top-down Neural Attention by Excitation Backprop, Jianming Zhang; Zhe Lin, Adobe Systems, Inc.; Jonathan Brandt; Xiaohui Shen, Adobe; Stan Sclaroff, Boston University

Grounding of Textual Phrases in Images by Reconstruction, Anna Rohrbach; Marcus Rohrbach, UC Berkeley; Ronghang Hu, UC Berkeley; Trevor Darrell, UC Berkeley; Bernt Schiele

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, Huijuan Xu, UMass Lowell; Kate Saenko, University of Massachusetts Lowel

In general, many more works are using RNNs - and this is because some portion of the input or required output can be interpreted as a sequence: e.g. a sequence of frames, a sequence of images in a collection, or a sequence of words (in the input question or output caption). RNNs have also been shown to provide effective iterative refinement (e.g. Liang et al.). An "attention module" can similarly be used to parse an image or image features as a sequence (e.g. Xiao et al., Peng et al., Ye et al.). What this accomplishes is some simulation of bottom-up combined with top-down reasoning. And by the way, we talked a bit about attention and how it can be used to leverage other vision tasks in our Saturday tutorial.

Semantic Object Parsing with Graph LSTM, Xiaodan Liang, Sun Yat-sen University; Xiaohui Shen, Adobe; Jiashi Feng, NUS; Liang Lin, Sun Yat-sen University; Shuicheng Yan, NUS

Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks, Shengtao Xiao, National University of Singapore; Jiashi Feng, NUS; Junliang Xing, Chinese Academy of Sciences; Hanjiang Lai, SUN YAT-SEN UNIVERSITY; Shuicheng Yan, National University of Singapore; Ashraf Kassim, National University of Singapore

A Recurrent Encoder-Decoder Network for Sequential Face Alignment, Xi Peng, Rutgers University; Rogerio Feris, IBM Research Center, USA; Xiaoyu Wang, Snapchat Research; Dimitris Metaxas, Rutgers University

Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation, Qi Ye, ; Shanxin Yuan, Imperial College London; Tae-Kyun Kim, Imperial College London

With regards to image understanding and language, beyond scene recognition and object detection, we are also seeing increasing interest in interaction and relationship detection (e.g. Mallya & Lazebnik, Lu et al., Nagaraja et al.). I also found quite interesting the applications of language to non-natural images - specifically, diagrams (Kembhavi et al., Siegel et al.).

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering, Arun Mallya, UIUC; Svetlana Lazebnik

Visual Relationship Detection with Language Priors, Cewu Lu, Stanford University; Ranjay Krishna, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University

Modeling Context Between Objects for Referring Expression Understanding, Varun Nagaraja, University of Maryland; Vlad Morariu, University of Maryland; Larry Davis, University of Maryland

A Diagram Is Worth A Dozen Images, Aniruddha Kembhavi, AI2; Michael Salvato, Allen Institute for Artificial; Eric Kolve, Allen Institute for AI; Minjoon Seo, University of Washington; Hannaneh Hajishirzi, University of Washington; Ali Farhadi, University of Washington

FigureSeer: Parsing Result-Figures in Research Papers, Noah Siegel, ; Zachary Horvitz, ; Roie Levin, ; Santosh Kumar Divvala, Allen Institute for Artificial Intelligence; Ali Farhadi, University of Washington

We continue to see interesting innovations in neural network architectures - for instance, alternatives to convolution filters (Liu et al., Danelljan et al.), integration of CRFs with NNs (Arnab et al., Gadde et al., Chandra & Kokkinos), and nice tricks to facilitate training like stochastic depth (Huang et al.), to mention just a few.

Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network, Sifei Liu, UC Merced; Jinshan Pan, UC Merced; Ming-Hsuan Yang, UC Merced

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking, Martin Danelljan, Linköping University; Andreas Robinson, Linköping University; Fahad Khan, Linkoping University, Sweden; Michael Felsberg, Link_ping University

Higher Order Conditional Random Fields in Deep Neural Networks, Anurag Arnab, University of Oxford; Sadeep Jayasumana, University of Oxford; Shuai Zheng, University of Oxford; Philip Torr, Oxford University

Superpixel Convolutional Networks using Bilateral Inceptions, Raghudeep Gadde, Ecole des Ponts Paris Tech; Varun Jampani, MPI-IS; Martin Kiefel, MPI for Intelligent Systems; Daniel Kappler, MPI Intelligent Systems; Peter Gehler

Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs, Siddhartha Chandra, INRIA; Iasonas Kokkinos, INRIA

Deep Networks with Stochastic Depth, Gao Huang, Cornell University; Yu Sun, Cornell University; Zhuang Liu, Tsinghua University; Daniel Sedra, Cornell University; Kilian Weinberger, Cornell University

http://www.eccv2016.org/files/posters/S-3A-08.pdf

Among very specific topics with unproportionally many papers this year: 11 papers on person re-identification, 6 papers on object counting (5 of which use CNNs), 3 papers with colorization applications (Zhang, Larsson, Liu), and over 20 papers on segmentation and variations on segmentation (like portrait or scene matting, e.g., Shen et al.). For instance, there were many improvements in semantic segmentation, and some domain-specific (e.g. biomedical) segmentation approaches presented (e.g. Liu et al.).

Colorful Image Colorization, Richard Zhang, UC Berkeley; Phillip Isola, MIT; Alexei Efros

Learning Representations for Automatic Colorization, Gustav Larsson, University of Chicago; Michael Maire, Toyota Technological Institute at Chicago; Greg Shakhnarovich, TTI Chicago, USA

Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network, Sifei Liu, UC Merced; Jinshan Pan, UC Merced; Ming-Hsuan Yang, UC Merced

SSHMT: Semi-supervised Hierarchical Merge Tree for Electron Microscopy Image Segmentation, Ting Liu, University of Utah; Miaomiao Zhang, MIT; Mehran Javanmardi, University of Utah; Nisha Ramesh, University of Utah; Tolga Tasdizen, University of Utah

Deep Automatic Portrait Matting, Xiaoyong Shen, CUHK; Xin Tao, CUHK; Hongyun Gao, CUHK; Chao Zhou, ; Jiaya Jia, Chinese University of Hong Kong

Interestingly, none of the award-winning papers were about neural networks.

The future of vision conferences?

It is interesting to observe how fast this field evolves, and the impacts this has on researchers, research programs, the publishing pipeline, and the outcome of conferences. In particular, it is now common for papers to be hanging up on arxiv for over half a year before they are presented at a conference. Occasionally this can lead to confusion, with researchers scratching their heads, surprised to stumble upon a particular paper at the conference (hasn't this paper already been published for a while? hasn't it already appeared in the mass media?) By the time the conference rolls around, other researchers may already be familiar with the paper, and may have even innovated on top of it.

With the speed of innovation, at the same conference you might find both papers that build upon previous architectures to improve their pitfalls, and other papers that completely replace the original architectures with large performance gains. Small improvements are likely to be quickly overstepped by more significant leaps that leave the small improvements quickly forgotten. Lasting work requires qualitatively new approaches.

It was interesting to see that a number of researchers presented their original published results (from the camera ready version of the paper) alongside new results obtained since, in an attempt to stay current - after all, half a year of additional innovations can change many numbers. Some of these additional innovations are a result of building upon recently-arxived work. Some presenters even explicitly make reference to an extension of the presented work that is already available on arxiv or is published in another venue.

This might explain some of the proliferation of computer vision research to other conferences. To get innovations out fast enough for them to remain relevant, it might make sense to publish them in the nearest upcoming venue than to wait for the next computer vision conference to roll around. We're seeing related papers pop up in satellite workshops, and other conferences in machine learning, graphics, robotics, and language (take your favorite computer vision researcher and check which venues they've most recently published in).

It has become common to hear: "This was state of the art at the time of submission... But since then, we have been surpassed by multiple methods".

This leads to an interesting conundrum: arxived work is not peer-reviewed, but creeps into presentations of peer-reviewed work at conferences. This is one way that presented work is made more current and relevant. Is this a symptom of the progress in this field outrunning the current conference structure? In some other fields (physics, biology, neuroscience, etc.), conference presentations are based on submitted abstracts, and publications are disentangled from conferences. However, I don't believe there are precedents of a field moving this fast. This is a difficult question.

But on the topic of modernizing conferences, something needs to be done about the overcrowding situation around posters (especially with attendance growing considerably). It's quite hard to find a spot to stand in front of a poster presenter, within audible distance and without occlusion. Up in the balcony of the Theater Carre, filming the craziness below, I daydreamed of staying comfortably seated while flying a drone to a perfectly-selected location in full view of a desired poster and presenter. Perhaps that kind of swarm behavior could be much more efficiently optimized using some future conference logistics software ;) In the meantime, here's my birds-eye-view:

For further thought...

Thursday, 20 October 2016

ECCV in a theatrical setting

Main topics:

Some general research trends*:

Interestingly, none of the award-winning papers were about neural networks.

The future of vision conferences?