Thursday, 20 October 2016

ECCV in a theatrical setting

Koninklijk Theater Carré (in Amsterdam), where the main conference was held, is regularly used for theatrical and circus performances. The main stage was home to all the oral and most of the poster presentations during the week. This both meant that (i) speakers were performers with their audience looming above them from all sides and balconies, and that (ii) poster sessions from a birds-eye-view looked like a simulation of particles moving through a viscous liquid, trapped within the confines of the stage (scroll to the end of this post for a demonstration).

Due to this unusual set-up, audience questions could not be solicited in the usual manner of line ups in front of a microphone (try climbing over all those people, and out of a balcony). Instead, given a tech crowd, it was expected that technology could easily come to the rescue... the results of which can be summarized by comments made on separate occasions by the respective session chairs:

"Please post your questions on twitter and we will ask them on your behalf [...]
But neither of us have twitter, so we will ask our own questions in this session."

"It seems the community is composed of two groups:
those that have questions, and those that know how to use twitter
- we’re still hoping there will be an intersection at some point."

There was little to complain about otherwise: the venue was quite beautiful; there were many comfortable corners all around the building that were quite favorable to getting some paper reading done; the little baked parmesan palmiers that waiters carried around on trays all throughout the day were impeccable; and the city surrounding the conference was bursting with energy and canals.

Main topics:

During the welcome, the general chairs put up some statistics about the topic areas that represented ECCV this year. The top ones include:

  • deep learning
  • 3D modeling
  • events, actions
  • object class detection
  • semantic image
  • object tracking
  • de-blurring
  • scene understanding
  • image indexing
  • face recognition
  • segmentation

Topics like sparse coding are going down in paper representation. High acceptance rate topics are confounded by the size of those topics: smaller topics have a larger relative percentage of that are accepted (e.g. model-based reconstruction, 3D representation, etc.). Popular reviewer subject areas mostly follow the top topic areas above - specifically: 3D modeling, deep learning, object class detection, events, face recognition, object class detection, scene understanding, etc.

Summary notes:

My summary notes on the presentations that I attended can be found here (covers ~70% of the oral sessions):

Some general research trends*:

* disclaimer: very biased by my own interests, observations, and opinions 
(which tend to revolve around perception, cognition, attention, and language)
for an objective summary, go instead to the summary notes linked above

Nobody asks anymore: "is this done with CNNs too?", and more and more research is digging into the depths of the not-so-black* box of CNNs. The remaining fruits are now a little higher than they were before, and we are beginning to see more reaching - in the form of innovations in architectures, evaluations, internal representations, transfer learning, integration with new sensors/robotics, and unsupervised approaches. More about some of these below.
* With some notable exceptions -> Chair: “did you train with stochastic gradient descent?” Speaker: “we trained with caffe”

We're seeing old ideas come back in new architectural forms: new ways of encoding long-thought-about constraints and relations. If one can open an old vision paper and reformulate the proposed pipeline as an end-to-end network, encode constraints and heuristics as appropriate loss functions, and leverage different task knowledge by designing a corresponding training procedure, then a new paper is in the making (e.g. active vision for recognition).
Themes that we are beginning to see more and more of: unsupervised learning, semi-supervised learning, and self-supervised learning (with varying degrees of overlap, depending on how you define them). The main idea being that with the deep and powerful architectures we have now, solving each new problem in an end-to-end fashion would require an Imagenet-scale dataset. Because this is not always possible, transferring knowledge, labels, and classifications across tasks, datasets, and individual frames/images is the sought-after approach.

Video is a popular modality: temporal information can provide a strong supervisory signal for propagating labels across frames or for learning to do object detection from unlabeled video (e.g., Walker et al., Long et al.). Key frames of an action or an event can serve as targets for the rest of the frames. For instance, Zhao et al. perform facial expression recognition using peak facial expressions as a supervisory signal, by matching the internal representations (i.e. network features) of peak and non-peak facial expressions in order to build more robustness and invariance into the recognition pipeline. Similarly, photo sequences or collections provide loose temporal relationships that can be harnessed as a self-supervisory cue for predicting relevant/future photos (e.g, Sigurdsson, Chen & Gupta). As a side note, there is a lot more work on multi-dimensional inputs (3D, video, image sequences/collections) than single-images. Even with single images, there is a lot more temporal processing (e.g., via attention modules, more about this below). In other words, tasks that can be summarized as "image in" -> "single-label prediction out" have pretty much been exhausted.
Language is another powerful supervisory signal: images that share tags or words in their respective descriptions (think also: comments in the context of social media) can be used to train network representations to cluster such images closer together or further apart (e.g., Yang et al.). Some further examples of self-supervision by language include the works of Rohrbach and Lu. Other examples of cues/tasks used as self-supervision to learn useful internal representations for other tasks: co-occurrence, denoising, colorization, sound, egomotion, context, and video. Ways of leveraging existing images, modifying them and then learning the mapping back to the original images can be used as free training data (e.g., colorization, discussed more below, or image scrambling: Noroozi & Favaro).
Works that demonstrate new unsupervised approaches will typically do evaluation in one of the following ways: (i) show that useful intermediate features emerge, by visualizing what neurons learn to fire on (as in Owens', Zhang's work, based on the approach introduced by Zhou et al.), or (ii) show that the learned internal representation provides good initialization for other tasks - i.e. that it is amenable to transfer learning (see Larsson et al. or Zhang's work for more examples). A great example of this self-supervised learning approach is the work by Pinto et al. who showed that a physical robot that grasped, pushed, and poked a whole bunch of objects a whole bunch of times could learn useful visual representations for other tasks. Demonstrating that a learned representation is useful can be done by fixing the network and using computed features to directly cluster/retrieve images, or learning a classifier on top of the computed features for a new task, or using the learned representation only as an initialization while retraining with new data. The latter approach is especially useful if the task for which the network needs to be retrained does not have enough training data for complete end-to-end learning, and the unsupervised approach can bootstrap some of the feature learning.
This also touches on an important trend: we are starting to see more integration with robotics. We are coming back to active vision (e.g., Jayaraman & Grauman). New architectures and compute power are providing us with the capabilities of learning structure from (relatively unstructured) interactions. This area of research will likely see tremendous growth in the next few years.
Deep is coming to a robotics lab near you.
Language continues to be a hot topic. This includes image captioning (and variants, like Zeng's "title generation"), and related tasks like visual question answering - VQA (e.g., Mallya & Lazebnik, Lin & Parikh), referring expressions (e.g., Hu et al., Yu et al.), explanation generation (Hendricks et al.), semantic tagging, and leveraging language as a supervisory cue for other visual recognition tasks (as discussed above). Attention modules are also beginning to pop up more frequently: here, "attention" is used to refer to a modulation of (visual) features - a reweighing of which features, at which spatial locations, are used most at a given timestep (e.g., Zhang et al.). Often, attention modules go hand-in-hand with recurrent neural networks (RNNs, e.g., LSTMs) that can encode temporal relationships. In this case, processing of the visual input at one time step influences processing at the next time step. For instance, captioning systems may "attend" to different image regions in sequence, while generating caption words sequentially. VQA systems may use a similar iterative procedure to refine the location of the image that can provide an answer to the question or aid with localizing a referring expression (e.g. Rohrbach et al., Xu & Saenko).
In general, many more works are using RNNs - and this is because some portion of the input or required output can be interpreted as a sequence: e.g. a sequence of frames, a sequence of images in a collection, or a sequence of words (in the input question or output caption). RNNs have also been shown to provide effective iterative refinement (e.g. Liang et al.). An "attention module" can similarly be used to parse an image or image features as a sequence (e.g. Xiao et al., Peng et al., Ye et al.). What this accomplishes is some simulation of bottom-up combined with top-down reasoning. And by the way, we talked a bit about attention and how it can be used to leverage other vision tasks in our Saturday tutorial.
With regards to image understanding and language, beyond scene recognition and object detection, we are also seeing increasing interest in interaction and relationship detection (e.g. Mallya & Lazebnik, Lu et al., Nagaraja et al.). I also found quite interesting the applications of language to non-natural images - specifically, diagrams (Kembhavi et al., Siegel et al.).
We continue to see interesting innovations in neural network architectures - for instance, alternatives to convolution filters (Liu et al., Danelljan et al.), integration of CRFs with NNs (Arnab et al., Gadde et al., Chandra & Kokkinos), and nice tricks to facilitate training like stochastic depth (Huang et al.), to mention just a few.
Among very specific topics with unproportionally many papers this year: 11 papers on person re-identification, 6 papers on object counting (5 of which use CNNs), 3 papers with colorization applications (ZhangLarsson, Liu), and over 20 papers on segmentation and variations on segmentation (like portrait or scene matting, e.g., Shen et al.). For instance, there were many improvements in semantic segmentation, and some domain-specific (e.g. biomedical) segmentation approaches presented (e.g. Liu et al.).

Interestingly, none of the award-winning papers were about neural networks.

The future of vision conferences?

It is interesting to observe how fast this field evolves, and the impacts this has on researchers, research programs, the publishing pipeline, and the outcome of conferences. In particular, it is now common for papers to be hanging up on arxiv for over half a year before they are presented at a conference. Occasionally this can lead to confusion, with researchers scratching their heads, surprised to stumble upon a particular paper at the conference (hasn't this paper already been published for a while? hasn't it already appeared in the mass media?) By the time the conference rolls around, other researchers may already be familiar with the paper, and may have even innovated on top of it. 

With the speed of innovation, at the same conference you might find both papers that build upon previous architectures to improve their pitfalls, and other papers that completely replace the original architectures with large performance gains. Small improvements are likely to be quickly overstepped by more significant leaps that leave the small improvements quickly forgotten. Lasting work requires qualitatively new approaches. 

It was interesting to see that a number of researchers presented their original published results (from the camera ready version of the paper) alongside new results obtained since, in an attempt to stay current - after all, half a year of additional innovations can change many numbers. Some of these additional innovations are a result of building upon recently-arxived work. Some presenters even explicitly make reference to an extension of the presented work that is already available on arxiv or is published in another venue. 

This might explain some of the proliferation of computer vision research to other conferences. To get innovations out fast enough for them to remain relevant, it might make sense to publish them in the nearest upcoming venue than to wait for the next computer vision conference to roll around. We're seeing related papers pop up in satellite workshops, and other conferences in machine learning, graphics, robotics, and language (take your favorite computer vision researcher and check which venues they've most recently published in). 

It has become common to hear: "This was state of the art at the time of submission... But since then, we have been surpassed by multiple methods".
This leads to an interesting conundrum: arxived work is not peer-reviewed, but creeps into presentations of peer-reviewed work at conferences. This is one way that presented work is made more current and relevant. Is this a symptom of the progress in this field outrunning the current conference structure? In some other fields (physics, biology, neuroscience, etc.), conference presentations are based on submitted abstracts, and publications are disentangled from conferences. However, I don't believe there are precedents of a field moving this fast. This is a difficult question.

But on the topic of modernizing conferences, something needs to be done about the overcrowding situation around posters (especially with attendance growing considerably). It's quite hard to find a spot to stand in front of a poster presenter, within audible distance and without occlusion. Up in the balcony of the Theater Carre, filming the craziness below, I daydreamed of staying comfortably seated while flying a drone to a perfectly-selected location in full view of a desired poster and presenter. Perhaps that kind of swarm behavior could be much more efficiently optimized using some future conference logistics software ;) In the meantime, here's my birds-eye-view: