Due to this unusual set-up, audience questions could not be solicited in the usual manner of line ups in front of a microphone (try climbing over all those people, and out of a balcony). Instead, given a tech crowd, it was expected that technology could easily come to the rescue... the results of which can be summarized by comments made on separate occasions by the respective session chairs:
"Please post your questions on twitter and we will ask them on your behalf [...]
But neither of us have twitter, so we will ask our own questions in this session."
"It seems the community is composed of two groups:
those that have questions, and those that know how to use twitter
- we’re still hoping there will be an intersection at some point."
There was little to complain about otherwise: the venue was quite beautiful; there were many comfortable corners all around the building that were quite favorable to getting some paper reading done; the little baked parmesan palmiers that waiters carried around on trays all throughout the day were impeccable; and the city surrounding the conference was bursting with energy and canals.
Main topics:
During the welcome, the general chairs put up some statistics about the topic areas that represented ECCV this year. The top ones include:
- deep learning
- 3D modeling
- events, actions
- object class detection
- semantic image
- object tracking
- de-blurring
- scene understanding
- image indexing
- face recognition
- segmentation
Topics like sparse coding are going down in paper representation. High acceptance rate topics are confounded by the size of those topics: smaller topics have a larger relative percentage of that are accepted (e.g. model-based reconstruction, 3D representation, etc.). Popular reviewer subject areas mostly follow the top topic areas above - specifically: 3D modeling, deep learning, object class detection, events, face recognition, object class detection, scene understanding, etc.
Summary notes:
My summary notes on the presentations that I attended can be found here (covers ~70% of the oral sessions): https://docs.google.com/document/d/175ORVlLMdjOscJ7-93WIt0bieUiu21vtlL7J-7-7qBI/pub
Some general research trends*:
* disclaimer: very biased by my own interests, observations, and opinions
(which tend to revolve around perception, cognition, attention, and language)
for an objective summary, go instead to the summary notes linked above
Nobody asks anymore: "is this done with CNNs too?", and more and more research is digging into the depths of the not-so-black* box of CNNs. The remaining fruits are now a little higher than they were before, and we are beginning to see more reaching - in the form of innovations in architectures, evaluations, internal representations, transfer learning, integration with new sensors/robotics, and unsupervised approaches. More about some of these below.
* With some notable exceptions -> Chair: “did you train with stochastic gradient descent?” Speaker: “we trained with caffe”
We're seeing old ideas come back in new architectural forms: new ways of encoding long-thought-about constraints and relations. If one can open an old vision paper and reformulate the proposed pipeline as an end-to-end network, encode constraints and heuristics as appropriate loss functions, and leverage different task knowledge by designing a corresponding training procedure, then a new paper is in the making (e.g. active vision for recognition).
http://www.eccv2016.org/files/posters/S-1B-05.pdf |
Video is a popular modality: temporal information can provide a strong supervisory signal for propagating labels across frames or for learning to do object detection from unlabeled video (e.g., Walker et al., Long et al.). Key frames of an action or an event can serve as targets for the rest of the frames. For instance, Zhao et al. perform facial expression recognition using peak facial expressions as a supervisory signal, by matching the internal representations (i.e. network features) of peak and non-peak facial expressions in order to build more robustness and invariance into the recognition pipeline. Similarly, photo sequences or collections provide loose temporal relationships that can be harnessed as a self-supervisory cue for predicting relevant/future photos (e.g, Sigurdsson, Chen & Gupta). As a side note, there is a lot more work on multi-dimensional inputs (3D, video, image sequences/collections) than single-images. Even with single images, there is a lot more temporal processing (e.g., via attention modules, more about this below). In other words, tasks that can be summarized as "image in" -> "single-label prediction out" have pretty much been exhausted.
Language is another powerful supervisory signal: images that share tags or words in their respective descriptions (think also: comments in the context of social media) can be used to train network representations to cluster such images closer together or further apart (e.g., Yang et al.). Some further examples of self-supervision by language include the works of Rohrbach and Lu. Other examples of cues/tasks used as self-supervision to learn useful internal representations for other tasks: co-occurrence, denoising, colorization, sound, egomotion, context, and video. Ways of leveraging existing images, modifying them and then learning the mapping back to the original images can be used as free training data (e.g., colorization, discussed more below, or image scrambling: Noroozi & Favaro).
- An Uncertain Future: Forecasting from Static Images using Variational Autoencoders, Jacob Walker, Carnegie Mellon University; Carl Doersch, Carnegie Mellon University; Abhinav Gupta, ; Martial Hebert, Carnegie Mellon University
- Learning Image Matching by Simply Watching Video, Gucan Long, NUDT; Laurent Kneip, Australian National University; Jose M. Alvarez, Data61 / CSIRO; Hongdong Li, ; Xiaohu Zhang, NUDT; Qifeng Yu, NUDT
- Peak-Piloted Deep Network for Facial Expression Recognition, Xiangyun Zhao, University of California, San Diego; Xiaodan Liang, Sun Yat-sen University; Luoqi Liu, Qihoo/360; Teng Li, Anhui University; Yugang Han, 360 AI Institute; Nuno Vasconcelos, ; Shuicheng Yan
- Learning Visual Storylines with Skipping Recurrent Neural Networks, Gunnar Sigurdsson, Carnegie Mellon University; Xinlei Chen, CMU; Abhinav Gupta
- Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations, Hao Yang, NTU; Joey Tianyi Zhou, IHPC; Jianfei Cai, NTU
- Grounding of Textual Phrases in Images by Reconstruction, Anna Rohrbach; Marcus Rohrbach, UC Berkeley; Ronghang Hu, UC Berkeley; Trevor Darrell, UC Berkeley; Bernt Schiele
- Visual Relationship Detection with Language Priors, Cewu Lu, Stanford University; Ranjay Krishna, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University
- Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, Mehdi Noroozi, University of Bern; Paolo Favaro
http://www.eccv2016.org/files/posters/O-1B-04.pdf |
- Ambient sound provides supervision for visual learning, Andrew Owens, MIT; Jiajun Wu, MIT; Josh Mcdermott, MIT; Antonio Torralba, MIT; William Freeman, MIT
- Colorful Image Colorization, Richard Zhang, UC Berkeley; Phillip Isola, MIT; Alexei Efros
- Learning Representations for Automatic Colorization, Gustav Larsson, University of Chicago; Michael Maire, Toyota Technological Institute at Chicago; Greg Shakhnarovich, TTI Chicago, USA
- The Curious Robot: Learning Visual Representations via Physical Interactions, Lerrel Pinto, Carnegie Mellon University; Dhiraj Gandhi, ; Yuanfeng Han, ; Yong-Lae Park, ; Abhinav Gupta
http://www.eccv2016.org/files/posters/O-4B-04.pdf |
Deep is coming to a robotics lab near you.
- Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion, Dinesh Jayaraman, UT Austin; Kristen Grauman, University of Texas at Austin
http://www.eccv2016.org/files/posters/P-3B-17.pdf |
In general, many more works are using RNNs - and this is because some portion of the input or required output can be interpreted as a sequence: e.g. a sequence of frames, a sequence of images in a collection, or a sequence of words (in the input question or output caption). RNNs have also been shown to provide effective iterative refinement (e.g. Liang et al.). An "attention module" can similarly be used to parse an image or image features as a sequence (e.g. Xiao et al., Peng et al., Ye et al.). What this accomplishes is some simulation of bottom-up combined with top-down reasoning. And by the way, we talked a bit about attention and how it can be used to leverage other vision tasks in our Saturday tutorial.
- Title Generation for User Generated Videos, Kuo-Hao Zeng, National Tsing Hua University; Tseng-Hung Chen, National Tsing Hua University; Juan Carlos Niebles, Stanford University; Min Sun, National Tsing Hua University
- Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering, Arun Mallya, UIUC; Svetlana Lazebnik
- Leveraging Visual Question Answering for Image-Caption Ranking, Xiao Lin, Virginia Tech; Devi Parikh, Virginia Tech
- Segmentation from Natural Language Expressions, Ronghang Hu, UC Berkeley; Marcus Rohrbach, UC Berkeley; Trevor Darrell
- Modeling Context in Referring Expressions, Licheng Yu, University of North Carolina; Patrick Poirson, ; Shang Yang, ; Alex Berg, ; Tamara Berg, University on North Carolina
- Generating Visual Explanations, Lisa Anne Hendricks, UC Berkeley; Zeynep Akata, ; Marcus Rohrbach, UC Berkeley; Jeff Donahue, UC Berkeley; Bernt Schiele, ; Trevor Darrell
- Top-down Neural Attention by Excitation Backprop, Jianming Zhang; Zhe Lin, Adobe Systems, Inc.; Jonathan Brandt; Xiaohui Shen, Adobe; Stan Sclaroff, Boston University
- Grounding of Textual Phrases in Images by Reconstruction, Anna Rohrbach; Marcus Rohrbach, UC Berkeley; Ronghang Hu, UC Berkeley; Trevor Darrell, UC Berkeley; Bernt Schiele
- Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, Huijuan Xu, UMass Lowell; Kate Saenko, University of Massachusetts Lowel
With regards to image understanding and language, beyond scene recognition and object detection, we are also seeing increasing interest in interaction and relationship detection (e.g. Mallya & Lazebnik, Lu et al., Nagaraja et al.). I also found quite interesting the applications of language to non-natural images - specifically, diagrams (Kembhavi et al., Siegel et al.).
- Semantic Object Parsing with Graph LSTM, Xiaodan Liang, Sun Yat-sen University; Xiaohui Shen, Adobe; Jiashi Feng, NUS; Liang Lin, Sun Yat-sen University; Shuicheng Yan, NUS
- Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks, Shengtao Xiao, National University of Singapore; Jiashi Feng, NUS; Junliang Xing, Chinese Academy of Sciences; Hanjiang Lai, SUN YAT-SEN UNIVERSITY; Shuicheng Yan, National University of Singapore; Ashraf Kassim, National University of Singapore
- A Recurrent Encoder-Decoder Network for Sequential Face Alignment, Xi Peng, Rutgers University; Rogerio Feris, IBM Research Center, USA; Xiaoyu Wang, Snapchat Research; Dimitris Metaxas, Rutgers University
- Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation, Qi Ye, ; Shanxin Yuan, Imperial College London; Tae-Kyun Kim, Imperial College London
- Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering, Arun Mallya, UIUC; Svetlana Lazebnik
- Visual Relationship Detection with Language Priors, Cewu Lu, Stanford University; Ranjay Krishna, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University
- Modeling Context Between Objects for Referring Expression Understanding, Varun Nagaraja, University of Maryland; Vlad Morariu, University of Maryland; Larry Davis, University of Maryland
- A Diagram Is Worth A Dozen Images, Aniruddha Kembhavi, AI2; Michael Salvato, Allen Institute for Artificial; Eric Kolve, Allen Institute for AI; Minjoon Seo, University of Washington; Hannaneh Hajishirzi, University of Washington; Ali Farhadi, University of Washington
- FigureSeer: Parsing Result-Figures in Research Papers, Noah Siegel, ; Zachary Horvitz, ; Roie Levin, ; Santosh Kumar Divvala, Allen Institute for Artificial Intelligence; Ali Farhadi, University of Washington
We continue to see interesting innovations in neural network architectures - for instance, alternatives to convolution filters (Liu et al., Danelljan et al.), integration of CRFs with NNs (Arnab et al., Gadde et al., Chandra & Kokkinos), and nice tricks to facilitate training like stochastic depth (Huang et al.), to mention just a few.
- Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network, Sifei Liu, UC Merced; Jinshan Pan, UC Merced; Ming-Hsuan Yang, UC Merced
- Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking, Martin Danelljan, Linköping University; Andreas Robinson, Linköping University; Fahad Khan, Linkoping University, Sweden; Michael Felsberg, Link_ping University
- Higher Order Conditional Random Fields in Deep Neural Networks, Anurag Arnab, University of Oxford; Sadeep Jayasumana, University of Oxford; Shuai Zheng, University of Oxford; Philip Torr, Oxford University
- Superpixel Convolutional Networks using Bilateral Inceptions, Raghudeep Gadde, Ecole des Ponts Paris Tech; Varun Jampani, MPI-IS; Martin Kiefel, MPI for Intelligent Systems; Daniel Kappler, MPI Intelligent Systems; Peter Gehler
- Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs, Siddhartha Chandra, INRIA; Iasonas Kokkinos, INRIA
- Deep Networks with Stochastic Depth, Gao Huang, Cornell University; Yu Sun, Cornell University; Zhuang Liu, Tsinghua University; Daniel Sedra, Cornell University; Kilian Weinberger, Cornell University
http://www.eccv2016.org/files/posters/S-3A-08.pdf |
- Colorful Image Colorization, Richard Zhang, UC Berkeley; Phillip Isola, MIT; Alexei Efros
- Learning Representations for Automatic Colorization, Gustav Larsson, University of Chicago; Michael Maire, Toyota Technological Institute at Chicago; Greg Shakhnarovich, TTI Chicago, USA
- Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network, Sifei Liu, UC Merced; Jinshan Pan, UC Merced; Ming-Hsuan Yang, UC Merced
- SSHMT: Semi-supervised Hierarchical Merge Tree for Electron Microscopy Image Segmentation, Ting Liu, University of Utah; Miaomiao Zhang, MIT; Mehran Javanmardi, University of Utah; Nisha Ramesh, University of Utah; Tolga Tasdizen, University of Utah
- Deep Automatic Portrait Matting, Xiaoyong Shen, CUHK; Xin Tao, CUHK; Hongyun Gao, CUHK; Chao Zhou, ; Jiaya Jia, Chinese University of Hong Kong
Interestingly, none of the award-winning papers were about neural networks.
The future of vision conferences?
It is interesting to observe how fast this field evolves, and the impacts this has on researchers, research programs, the publishing pipeline, and the outcome of conferences. In particular, it is now common for papers to be hanging up on arxiv for over half a year before they are presented at a conference. Occasionally this can lead to confusion, with researchers scratching their heads, surprised to stumble upon a particular paper at the conference (hasn't this paper already been published for a while? hasn't it already appeared in the mass media?) By the time the conference rolls around, other researchers may already be familiar with the paper, and may have even innovated on top of it.
With the speed of innovation, at the same conference you might find both papers that build upon previous architectures to improve their pitfalls, and other papers that completely replace the original architectures with large performance gains. Small improvements are likely to be quickly overstepped by more significant leaps that leave the small improvements quickly forgotten. Lasting work requires qualitatively new approaches.
It was interesting to see that a number of researchers presented their original published results (from the camera ready version of the paper) alongside new results obtained since, in an attempt to stay current - after all, half a year of additional innovations can change many numbers. Some of these additional innovations are a result of building upon recently-arxived work. Some presenters even explicitly make reference to an extension of the presented work that is already available on arxiv or is published in another venue.
This might explain some of the proliferation of computer vision research to other conferences. To get innovations out fast enough for them to remain relevant, it might make sense to publish them in the nearest upcoming venue than to wait for the next computer vision conference to roll around. We're seeing related papers pop up in satellite workshops, and other conferences in machine learning, graphics, robotics, and language (take your favorite computer vision researcher and check which venues they've most recently published in).
It has become common to hear: "This was state of the art at the time of submission... But since then, we have been surpassed by multiple methods".
This leads to an interesting conundrum: arxived work is not peer-reviewed, but creeps into presentations of peer-reviewed work at conferences. This is one way that presented work is made more current and relevant. Is this a symptom of the progress in this field outrunning the current conference structure? In some other fields (physics, biology, neuroscience, etc.), conference presentations are based on submitted abstracts, and publications are disentangled from conferences. However, I don't believe there are precedents of a field moving this fast. This is a difficult question.
But on the topic of modernizing conferences, something needs to be done about the overcrowding situation around posters (especially with attendance growing considerably). It's quite hard to find a spot to stand in front of a poster presenter, within audible distance and without occlusion. Up in the balcony of the Theater Carre, filming the craziness below, I daydreamed of staying comfortably seated while flying a drone to a perfectly-selected location in full view of a desired poster and presenter. Perhaps that kind of swarm behavior could be much more efficiently optimized using some future conference logistics software ;) In the meantime, here's my birds-eye-view:
It is interesting to observe how fast this field evolves, and the impacts this has on researchers, research programs, the publishing pipeline, and the outcome of conferences. In particular, it is now common for papers to be hanging up on arxiv for over half a year before they are presented at a conference. Occasionally this can lead to confusion, with researchers scratching their heads, surprised to stumble upon a particular paper at the conference (hasn't this paper already been published for a while? hasn't it already appeared in the mass media?) By the time the conference rolls around, other researchers may already be familiar with the paper, and may have even innovated on top of it.