For further thought...

Thursday, 20 October 2016

ECCV in a theatrical setting

Koninklijk Theater Carré (in Amsterdam), where the main conference was held, is regularly used for theatrical and circus performances. The main stage was home to all the oral and most of the poster presentations during the week. This both meant that (i) speakers were performers with their audience looming above them from all sides and balconies, and that (ii) poster sessions from a birds-eye-view looked like a simulation of particles moving through a viscous liquid, trapped within the confines of the stage (scroll to the end of this post for a demonstration).

Due to this unusual set-up, audience questions could not be solicited in the usual manner of line ups in front of a microphone (try climbing over all those people, and out of a balcony). Instead, given a tech crowd, it was expected that technology could easily come to the rescue... the results of which can be summarized by comments made on separate occasions by the respective session chairs:

"Please post your questions on twitter and we will ask them on your behalf [...]
But neither of us have twitter, so we will ask our own questions in this session."

"It seems the community is composed of two groups:

those that have questions, and those that know how to use twitter

- we’re still hoping there will be an intersection at some point."

There was little to complain about otherwise: the venue was quite beautiful; there were many comfortable corners all around the building that were quite favorable to getting some paper reading done; the little baked parmesan palmiers that waiters carried around on trays all throughout the day were impeccable; and the city surrounding the conference was bursting with energy and canals.

Main topics:

During the welcome, the general chairs put up some statistics about the topic areas that represented ECCV this year. The top ones include:

deep learning

3D modeling

events, actions

object class detection

semantic image

object tracking

de-blurring

scene understanding

image indexing

face recognition

segmentation

Topics like sparse coding are going down in paper representation. High acceptance rate topics are confounded by the size of those topics: smaller topics have a larger relative percentage of that are accepted (e.g. model-based reconstruction, 3D representation, etc.). Popular reviewer subject areas mostly follow the top topic areas above - specifically: 3D modeling, deep learning, object class detection, events, face recognition, object class detection, scene understanding, etc.

Summary notes:

My summary notes on the presentations that I attended can be found here (covers ~70% of the oral sessions): https://docs.google.com/document/d/175ORVlLMdjOscJ7-93WIt0bieUiu21vtlL7J-7-7qBI/pub

Some general research trends*:

* disclaimer: very biased by my own interests, observations, and opinions

(which tend to revolve around perception, cognition, attention, and language)

for an objective summary, go instead to the summary notes linked above

Nobody asks anymore: "is this done with CNNs too?", and more and more research is digging into the depths of the not-so-black* box of CNNs. The remaining fruits are now a little higher than they were before, and we are beginning to see more reaching - in the form of innovations in architectures, evaluations, internal representations, transfer learning, integration with new sensors/robotics, and unsupervised approaches. More about some of these below.

* With some notable exceptions -> Chair: “did you train with stochastic gradient descent?” Speaker: “we trained with caffe”

We're seeing old ideas come back in new architectural forms: new ways of encoding long-thought-about constraints and relations. If one can open an old vision paper and reformulate the proposed pipeline as an end-to-end network, encode constraints and heuristics as appropriate loss functions, and leverage different task knowledge by designing a corresponding training procedure, then a new paper is in the making (e.g. active vision for recognition).

http://www.eccv2016.org/files/posters/S-1B-05.pdf

Themes that we are beginning to see more and more of: unsupervised learning, semi-supervised learning, and self-supervised learning (with varying degrees of overlap, depending on how you define them). The main idea being that with the deep and powerful architectures we have now, solving each new problem in an end-to-end fashion would require an Imagenet-scale dataset. Because this is not always possible, transferring knowledge, labels, and classifications across tasks, datasets, and individual frames/images is the sought-after approach.

Video is a popular modality: temporal information can provide a strong supervisory signal for propagating labels across frames or for learning to do object detection from unlabeled video (e.g., Walker et al., Long et al.). Key frames of an action or an event can serve as targets for the rest of the frames. For instance, Zhao et al. perform facial expression recognition using peak facial expressions as a supervisory signal, by matching the internal representations (i.e. network features) of peak and non-peak facial expressions in order to build more robustness and invariance into the recognition pipeline. Similarly, photo sequences or collections provide loose temporal relationships that can be harnessed as a self-supervisory cue for predicting relevant/future photos (e.g, Sigurdsson, Chen & Gupta). As a side note, there is a lot more work on multi-dimensional inputs (3D, video, image sequences/collections) than single-images. Even with single images, there is a lot more temporal processing (e.g., via attention modules, more about this below). In other words, tasks that can be summarized as "image in" -> "single-label prediction out" have pretty much been exhausted.

An Uncertain Future: Forecasting from Static Images using Variational Autoencoders, Jacob Walker, Carnegie Mellon University; Carl Doersch, Carnegie Mellon University; Abhinav Gupta, ; Martial Hebert, Carnegie Mellon University

Learning Image Matching by Simply Watching Video, Gucan Long, NUDT; Laurent Kneip, Australian National University; Jose M. Alvarez, Data61 / CSIRO; Hongdong Li, ; Xiaohu Zhang, NUDT; Qifeng Yu, NUDT

Peak-Piloted Deep Network for Facial Expression Recognition, Xiangyun Zhao, University of California, San Diego; Xiaodan Liang, Sun Yat-sen University; Luoqi Liu, Qihoo/360; Teng Li, Anhui University; Yugang Han, 360 AI Institute; Nuno Vasconcelos, ; Shuicheng Yan

Learning Visual Storylines with Skipping Recurrent Neural Networks, Gunnar Sigurdsson, Carnegie Mellon University; Xinlei Chen, CMU; Abhinav Gupta

Language is another powerful supervisory signal: images that share tags or words in their respective descriptions (think also: comments in the context of social media) can be used to train network representations to cluster such images closer together or further apart (e.g., Yang et al.). Some further examples of self-supervision by language include the works of Rohrbach and Lu. Other examples of cues/tasks used as self-supervision to learn useful internal representations for other tasks: co-occurrence, denoising, colorization, sound, egomotion, context, and video. Ways of leveraging existing images, modifying them and then learning the mapping back to the original images can be used as free training data (e.g., colorization, discussed more below, or image scrambling: Noroozi & Favaro).

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations, Hao Yang, NTU; Joey Tianyi Zhou, IHPC; Jianfei Cai, NTU

Grounding of Textual Phrases in Images by Reconstruction, Anna Rohrbach; Marcus Rohrbach, UC Berkeley; Ronghang Hu, UC Berkeley; Trevor Darrell, UC Berkeley; Bernt Schiele

Visual Relationship Detection with Language Priors, Cewu Lu, Stanford University; Ranjay Krishna, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, Mehdi Noroozi, University of Bern; Paolo Favaro

http://www.eccv2016.org/files/posters/O-1B-04.pdf

Works that demonstrate new unsupervised approaches will typically do evaluation in one of the following ways: (i) show that useful intermediate features emerge, by visualizing what neurons learn to fire on (as in Owens', Zhang's work, based on the approach introduced by Zhou et al.), or (ii) show that the learned internal representation provides good initialization for other tasks - i.e. that it is amenable to transfer learning (see Larsson et al. or Zhang's work for more examples). A great example of this self-supervised learning approach is the work by Pinto et al. who showed that a physical robot that grasped, pushed, and poked a whole bunch of objects a whole bunch of times could learn useful visual representations for other tasks. Demonstrating that a learned representation is useful can be done by fixing the network and using computed features to directly cluster/retrieve images, or learning a classifier on top of the computed features for a new task, or using the learned representation only as an initialization while retraining with new data. The latter approach is especially useful if the task for which the network needs to be retrained does not have enough training data for complete end-to-end learning, and the unsupervised approach can bootstrap some of the feature learning.

Ambient sound provides supervision for visual learning, Andrew Owens, MIT; Jiajun Wu, MIT; Josh Mcdermott, MIT; Antonio Torralba, MIT; William Freeman, MIT

Colorful Image Colorization, Richard Zhang, UC Berkeley; Phillip Isola, MIT; Alexei Efros

Learning Representations for Automatic Colorization, Gustav Larsson, University of Chicago; Michael Maire, Toyota Technological Institute at Chicago; Greg Shakhnarovich, TTI Chicago, USA

The Curious Robot: Learning Visual Representations via Physical Interactions, Lerrel Pinto, Carnegie Mellon University; Dhiraj Gandhi, ; Yuanfeng Han, ; Yong-Lae Park, ; Abhinav Gupta

http://www.eccv2016.org/files/posters/O-4B-04.pdf

This also touches on an important trend: we are starting to see more integration with robotics. We are coming back to active vision (e.g., Jayaraman & Grauman). New architectures and compute power are providing us with the capabilities of learning structure from (relatively unstructured) interactions. This area of research will likely see tremendous growth in the next few years.
Deep is coming to a robotics lab near you.

Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion, Dinesh Jayaraman, UT Austin; Kristen Grauman, University of Texas at Austin

http://www.eccv2016.org/files/posters/P-3B-17.pdf

Language continues to be a hot topic. This includes image captioning (and variants, like Zeng's "title generation"), and related tasks like visual question answering - VQA (e.g., Mallya & Lazebnik, Lin & Parikh), referring expressions (e.g., Hu et al., Yu et al.), explanation generation (Hendricks et al.), semantic tagging, and leveraging language as a supervisory cue for other visual recognition tasks (as discussed above). Attention modules are also beginning to pop up more frequently: here, "attention" is used to refer to a modulation of (visual) features - a reweighing of which features, at which spatial locations, are used most at a given timestep (e.g., Zhang et al.). Often, attention modules go hand-in-hand with recurrent neural networks (RNNs, e.g., LSTMs) that can encode temporal relationships. In this case, processing of the visual input at one time step influences processing at the next time step. For instance, captioning systems may "attend" to different image regions in sequence, while generating caption words sequentially. VQA systems may use a similar iterative procedure to refine the location of the image that can provide an answer to the question or aid with localizing a referring expression (e.g. Rohrbach et al., Xu & Saenko).

Title Generation for User Generated Videos, Kuo-Hao Zeng, National Tsing Hua University; Tseng-Hung Chen, National Tsing Hua University; Juan Carlos Niebles, Stanford University; Min Sun, National Tsing Hua University

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering, Arun Mallya, UIUC; Svetlana Lazebnik

Leveraging Visual Question Answering for Image-Caption Ranking, Xiao Lin, Virginia Tech; Devi Parikh, Virginia Tech

Segmentation from Natural Language Expressions, Ronghang Hu, UC Berkeley; Marcus Rohrbach, UC Berkeley; Trevor Darrell

Modeling Context in Referring Expressions, Licheng Yu, University of North Carolina; Patrick Poirson, ; Shang Yang, ; Alex Berg, ; Tamara Berg, University on North Carolina

Generating Visual Explanations, Lisa Anne Hendricks, UC Berkeley; Zeynep Akata, ; Marcus Rohrbach, UC Berkeley; Jeff Donahue, UC Berkeley; Bernt Schiele, ; Trevor Darrell

Top-down Neural Attention by Excitation Backprop, Jianming Zhang; Zhe Lin, Adobe Systems, Inc.; Jonathan Brandt; Xiaohui Shen, Adobe; Stan Sclaroff, Boston University

Grounding of Textual Phrases in Images by Reconstruction, Anna Rohrbach; Marcus Rohrbach, UC Berkeley; Ronghang Hu, UC Berkeley; Trevor Darrell, UC Berkeley; Bernt Schiele

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, Huijuan Xu, UMass Lowell; Kate Saenko, University of Massachusetts Lowel

In general, many more works are using RNNs - and this is because some portion of the input or required output can be interpreted as a sequence: e.g. a sequence of frames, a sequence of images in a collection, or a sequence of words (in the input question or output caption). RNNs have also been shown to provide effective iterative refinement (e.g. Liang et al.). An "attention module" can similarly be used to parse an image or image features as a sequence (e.g. Xiao et al., Peng et al., Ye et al.). What this accomplishes is some simulation of bottom-up combined with top-down reasoning. And by the way, we talked a bit about attention and how it can be used to leverage other vision tasks in our Saturday tutorial.

Semantic Object Parsing with Graph LSTM, Xiaodan Liang, Sun Yat-sen University; Xiaohui Shen, Adobe; Jiashi Feng, NUS; Liang Lin, Sun Yat-sen University; Shuicheng Yan, NUS

Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks, Shengtao Xiao, National University of Singapore; Jiashi Feng, NUS; Junliang Xing, Chinese Academy of Sciences; Hanjiang Lai, SUN YAT-SEN UNIVERSITY; Shuicheng Yan, National University of Singapore; Ashraf Kassim, National University of Singapore

A Recurrent Encoder-Decoder Network for Sequential Face Alignment, Xi Peng, Rutgers University; Rogerio Feris, IBM Research Center, USA; Xiaoyu Wang, Snapchat Research; Dimitris Metaxas, Rutgers University

Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation, Qi Ye, ; Shanxin Yuan, Imperial College London; Tae-Kyun Kim, Imperial College London

With regards to image understanding and language, beyond scene recognition and object detection, we are also seeing increasing interest in interaction and relationship detection (e.g. Mallya & Lazebnik, Lu et al., Nagaraja et al.). I also found quite interesting the applications of language to non-natural images - specifically, diagrams (Kembhavi et al., Siegel et al.).

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering, Arun Mallya, UIUC; Svetlana Lazebnik

Visual Relationship Detection with Language Priors, Cewu Lu, Stanford University; Ranjay Krishna, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University

Modeling Context Between Objects for Referring Expression Understanding, Varun Nagaraja, University of Maryland; Vlad Morariu, University of Maryland; Larry Davis, University of Maryland

A Diagram Is Worth A Dozen Images, Aniruddha Kembhavi, AI2; Michael Salvato, Allen Institute for Artificial; Eric Kolve, Allen Institute for AI; Minjoon Seo, University of Washington; Hannaneh Hajishirzi, University of Washington; Ali Farhadi, University of Washington

FigureSeer: Parsing Result-Figures in Research Papers, Noah Siegel, ; Zachary Horvitz, ; Roie Levin, ; Santosh Kumar Divvala, Allen Institute for Artificial Intelligence; Ali Farhadi, University of Washington

We continue to see interesting innovations in neural network architectures - for instance, alternatives to convolution filters (Liu et al., Danelljan et al.), integration of CRFs with NNs (Arnab et al., Gadde et al., Chandra & Kokkinos), and nice tricks to facilitate training like stochastic depth (Huang et al.), to mention just a few.

Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network, Sifei Liu, UC Merced; Jinshan Pan, UC Merced; Ming-Hsuan Yang, UC Merced

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking, Martin Danelljan, Linköping University; Andreas Robinson, Linköping University; Fahad Khan, Linkoping University, Sweden; Michael Felsberg, Link_ping University

Higher Order Conditional Random Fields in Deep Neural Networks, Anurag Arnab, University of Oxford; Sadeep Jayasumana, University of Oxford; Shuai Zheng, University of Oxford; Philip Torr, Oxford University

Superpixel Convolutional Networks using Bilateral Inceptions, Raghudeep Gadde, Ecole des Ponts Paris Tech; Varun Jampani, MPI-IS; Martin Kiefel, MPI for Intelligent Systems; Daniel Kappler, MPI Intelligent Systems; Peter Gehler

Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs, Siddhartha Chandra, INRIA; Iasonas Kokkinos, INRIA

Deep Networks with Stochastic Depth, Gao Huang, Cornell University; Yu Sun, Cornell University; Zhuang Liu, Tsinghua University; Daniel Sedra, Cornell University; Kilian Weinberger, Cornell University

http://www.eccv2016.org/files/posters/S-3A-08.pdf

Among very specific topics with unproportionally many papers this year: 11 papers on person re-identification, 6 papers on object counting (5 of which use CNNs), 3 papers with colorization applications (Zhang, Larsson, Liu), and over 20 papers on segmentation and variations on segmentation (like portrait or scene matting, e.g., Shen et al.). For instance, there were many improvements in semantic segmentation, and some domain-specific (e.g. biomedical) segmentation approaches presented (e.g. Liu et al.).

Colorful Image Colorization, Richard Zhang, UC Berkeley; Phillip Isola, MIT; Alexei Efros

Learning Representations for Automatic Colorization, Gustav Larsson, University of Chicago; Michael Maire, Toyota Technological Institute at Chicago; Greg Shakhnarovich, TTI Chicago, USA

Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network, Sifei Liu, UC Merced; Jinshan Pan, UC Merced; Ming-Hsuan Yang, UC Merced

SSHMT: Semi-supervised Hierarchical Merge Tree for Electron Microscopy Image Segmentation, Ting Liu, University of Utah; Miaomiao Zhang, MIT; Mehran Javanmardi, University of Utah; Nisha Ramesh, University of Utah; Tolga Tasdizen, University of Utah

Deep Automatic Portrait Matting, Xiaoyong Shen, CUHK; Xin Tao, CUHK; Hongyun Gao, CUHK; Chao Zhou, ; Jiaya Jia, Chinese University of Hong Kong

Interestingly, none of the award-winning papers were about neural networks.

The future of vision conferences?

It is interesting to observe how fast this field evolves, and the impacts this has on researchers, research programs, the publishing pipeline, and the outcome of conferences. In particular, it is now common for papers to be hanging up on arxiv for over half a year before they are presented at a conference. Occasionally this can lead to confusion, with researchers scratching their heads, surprised to stumble upon a particular paper at the conference (hasn't this paper already been published for a while? hasn't it already appeared in the mass media?) By the time the conference rolls around, other researchers may already be familiar with the paper, and may have even innovated on top of it.

With the speed of innovation, at the same conference you might find both papers that build upon previous architectures to improve their pitfalls, and other papers that completely replace the original architectures with large performance gains. Small improvements are likely to be quickly overstepped by more significant leaps that leave the small improvements quickly forgotten. Lasting work requires qualitatively new approaches.

It was interesting to see that a number of researchers presented their original published results (from the camera ready version of the paper) alongside new results obtained since, in an attempt to stay current - after all, half a year of additional innovations can change many numbers. Some of these additional innovations are a result of building upon recently-arxived work. Some presenters even explicitly make reference to an extension of the presented work that is already available on arxiv or is published in another venue.

This might explain some of the proliferation of computer vision research to other conferences. To get innovations out fast enough for them to remain relevant, it might make sense to publish them in the nearest upcoming venue than to wait for the next computer vision conference to roll around. We're seeing related papers pop up in satellite workshops, and other conferences in machine learning, graphics, robotics, and language (take your favorite computer vision researcher and check which venues they've most recently published in).

It has become common to hear: "This was state of the art at the time of submission... But since then, we have been surpassed by multiple methods".

This leads to an interesting conundrum: arxived work is not peer-reviewed, but creeps into presentations of peer-reviewed work at conferences. This is one way that presented work is made more current and relevant. Is this a symptom of the progress in this field outrunning the current conference structure? In some other fields (physics, biology, neuroscience, etc.), conference presentations are based on submitted abstracts, and publications are disentangled from conferences. However, I don't believe there are precedents of a field moving this fast. This is a difficult question.

But on the topic of modernizing conferences, something needs to be done about the overcrowding situation around posters (especially with attendance growing considerably). It's quite hard to find a spot to stand in front of a poster presenter, within audible distance and without occlusion. Up in the balcony of the Theater Carre, filming the craziness below, I daydreamed of staying comfortably seated while flying a drone to a perfectly-selected location in full view of a desired poster and presenter. Perhaps that kind of swarm behavior could be much more efficiently optimized using some future conference logistics software ;) In the meantime, here's my birds-eye-view:

Wednesday, 29 June 2016

Diversifying bias: how dataset bias can hurt and what we can do about it

A very important topic for consideration, the question of dataset bias has been getting into the mainstream more and more recently: e.g. "Why we should expect algorithms to be biased", and
"Artificial Intelligence's white guy problem".

As a research curiosity, dataset bias has been shown to affect model generalizability: a machine learning algorithm trained on one dataset, and receiving flying colors on a particular collection of test images, may have abysmal performance on a different dataset with different image statistics. You can think about this as the case of only ever seeing faces front and center in an image, and then being tested on off-center faces and realizing you are unexpectedly, but miserably, failing at detecting them. Some real-life examples: "HP investigates claims of racist computer", "Camera misses the mark on racial sensitivity".

http://www.selfieresearchers.com/wp-content/uploads/2014/09/CV-Dazzle-antiface.png

A more relevant and pressing example concerns the self-driving car. Ultimately, if trained correctly, it will learn to avoid pedestrians, lamp posts, barriers, and whatever else was meticulously labeled and included into its training set. But will it know to avoid a kangaroo if the most it ever saw in its training data and prior experience is a deer or a cow? For an algorithm to be capable of this type of generalization is a reasonable expectation. So even though the car might not be able to accurately determine what is happily hopping across the street, it will guess that it is a bit of a deer and a bit of a cow and since it is upright, maybe a bit of a pedestrian as well... but overall, and most importantly, whatever its identity, this happy creature should not be run over.

Speaking of cows...

https://www.insidescience.org/content/fitbit-cows/3076

Let me make an aside here and say that the sole fact that our machine learning algorithms (the ones behind the self-driving cars, autonomous appliances and robots of the future) are increasingly relying on data (are "data-driven") and are increasingly likely to be neural networks (which happen to chug through and learn from large amounts of data very well) is not in itself a reason for concern. I do not buy the argument that we should fear our "black box algorithms" because they are parameter-bloated, connected and intertwined networks that are "hard to understand and harder to interpret". Until quite recently, when asked what computer vision was up to, I would sarcastically answer: "detecting cows on fields... but only patched ones, on green fields, in the center of the image, and only if awake". We wouldn't even be having some of these conversations (in the media) about dataset bias if neural networks weren't performing this well (there are bigger problems at stake if even the cows can't be detected). Having large amounts of data and the architectural machinery to deal with it is precisely what is helping us learn and generalize better.

With that aside, it is nevertheless crucial to start thinking more carefully about dataset bias and model generalization. This thinking should not pit us against data-driven algorithms in any way; rather, we should continue to remind ourselves that it is the great prediction potential of the algorithms that is granting us the opportunity to think about these questions in the first place.

There is no doubt that our current, state-of-the-art models are suffering from bias in their datasets. Otherwise, Google wouldn't have made a headline that was centered around how well its algorithm could recognize cats (as opposed to anything else, really) by learning from YouTube videos. DeepDream wouldn't be imbuing every single photo with hallucinogenic dogs. Microsoft's AI chatbot wouldn't have learned in the span of only a day that being a racist asshole gets people's attention. Researchers wouldn't be spending their time fooling the algorithms. The list goes on.

The problem is that this kind of dataset bias is unavoidable - because people are biased and they are the source of the data that we're feeding to our algorithms (and if we can learn anything from the above, it's that people seem to be biased towards putting others down on social media, but then compensating by flooding the net with pictures of cute animals). This means that, unwillingly, we may be imbuing our algorithms with negative qualities, behavior, and biases. We may be perpetuating biases that we would otherwise like to remove from society (see "When algorithms discriminate").

And yet, when we try to actively interfere, we can make the problem worse. More frighteningly, if people know they can influence and have effects on the data, they may use this for their benefit, and this can have either positive or negative consequences for other members of society (as in this Ted Talk on the "Moral bias behind your search results"). We are all responsible for the data that we put on the net (what we upload and what we search), and need to recognize that the biases that are out there are our own.

But if the biases do get out there, should we get rid of them? What would "un-biasing" the data even mean? Who has the right to say that something should or should not appear in a search result? Would a top-down filtering of content, that would change the data everyone sees, even be appropriate? Most parents would disagree with a single, universal parental control for all of the world's children, and this is not much different. Different individuals, cultures, societies have different preferences and norms, different beliefs and taboos.

Which brings me to the importance of diversity in the data that we have. I would not argue for artificially nudging the numbers or tweaking the data to try to eliminate certain kinds of biases, as this can have all sorts of secondary and unintended consequences (as the Twitter example has shown). Instead, the more people that are participating in the data that is being harnessed for training algorithms, the better. This naturally adjusts the data balance to more accurately reflect the population that is using it. One solution is just to put more people on the web, and I think we will get there. Another is to bring more humanities folks into the tech loop. It would be great to have the perspective of anthropologists, sociologists, historians, policy and law makers, and psychologists for insights on cultural sensitivities, historical trends, crowd mentality, virality, societal pressures, etc. so that we can have better expectations of what the data may bring before its on our plates and we have to deal with the consequences. In this case, the suggested approach is to use this knowledge to adjust the data-collection procedures themselves rather than the data after-the-fact.

If you were collecting a survey, a change in wording but not meaning might drastically change who would respond to it. How different cultures look at concepts like success, individuality, and norms are also crucially different and affect how and what they communicate about these topics. Take this simple example: say you collect perceptions about an exam from a group of schoolchildren. You get two answers: "it was easy! [secondary reason: I passed]", "it was ok [I think I got only 98%... where did the 2% go?]". Without knowing the context of the cultures, societies, or families from which these two responses came, you would have a very biased dataset (I'm reminded of Malcolm Gladwell's book Outliers; or this talk). And this extends beyond surveys. The behaviors you elicit (and end up collecting) from a group of users can depend crucially on how information, a task, or a UI are presented. Psychologists and sociologists know this very well. But they are also less likely (currently) to be the ones collecting the large datasets that modern-day computational algorithms are trained on.

Questions of labeling are key. What do you call that thing? If you give it one label over another, a different set of properties or attributes might be retrieved. Consider an example: the labeling of street scenes. Here's a pedestrian, and another one. Here's a bicyclist. What about that person in a wheelchair? Is that a pedestrian or transportation device? How many body parts must be visible and moving for a pedestrian to be labeled as such? This labeling might affect how an algorithm analyzing the scene might predict the future movements of the participants and objects in the scene. This, in turn, might affect the decisions the algorithm (read: autonomous vehicle) makes.

http://fineartamerica.com/featured/psychedelic-city-pop-art-new-york-city-street-scene-miriam-danar.html

Every labeler is biased. Biased by their culture, their society, and their experience. Instead of attempting to unbias the labels, we should introduce even more biased labelers... to compensate (and please, let's throw something intellectual up on the net, at least once for every 100 cats...). We should increase the diversity of the bias until, on average, we get something reasonable. Ensembles work. That's the wisdom of crowds.

And what do we do in the meantime while we wait for the whole world's wisdom to accumulate on the net? We think harder about our data collection strategies and the tasks used; we spend more time debugging and visualizing the algorithms and the trends they pick up on; we consider how to present, display, and use the data; we brainstorm ways to annotate and make explicit whether certain labels, tags, or content are more likely to be controversial or subjective; and we treat predictions in this space with greater care, and importantly, less confidence. Just as we tend to dislike the individuals with the greatest bias but highest confidence, let's not fill our digital world with these personalities.

Saturday, 27 February 2016

My free business lesson from an Uber driver

Want a free business class? Find out more about Uber (or try talking to more Uber drivers). Every Uber driver has at least some kind of opinion to share about the Uber business model (its upsides and downsides for drivers and passengers alike), and some drivers (if you are inquisitive) will provide you with additional business logistics. Occasionally you will receive an outlook on the (potentially sobering) present and future societal impacts of Uber. If you are really lucky, you will have spent the whole ride entertained by a detailed rundown of the business' history and how it has been continuously adapting to novel locations, changing consumer demands, and emerging competition. I found myself the lucky passenger of precisely the latter kind of Uber driver on my last trip from SFO to Palo Alto: a young latin American named Marcos, with a square jaw and an equally square baseball cap bill. I will endeavor to provide a recount of this conversation (really, a monologue punctuated by my occasional requests for additional details). I aim also for my recounting to have the properties of a conversation, in that regardless of the factual accuracy of individual details or the exact temporal sequence of events (potentially tainted by the knowledge of my driver and my interpretation), the general outlines of the high-level picture should nevertheless surface.

source: http://johnbracamontes.com/

From my driver Marcos I learned that Uber sprouted up in SF to fill an existing gap in the market: the need for a professional and, importantly, reliable chauffeuring service. An emerging sentiment at the time was one of dissatisfaction with cab services, passengers having to unwillingly deal with unreliable service and rude or disrespectful drivers. The drivers seemed to have the upper hand in this market and behaved accordingly.

Although I am not sure how widespread this sentiment was, I can attest to the fact that this is the reason I've never liked taking cabs.

Naturally, this was the sort of inconvenience and unpleasantness that the more financially-privileged were willing to pay to avoid. Uber saw this opportunity, and was perfectly positioned to take advantage of it: in a city with (1) a dense population packed in a relatively small (drivable) area, (2) large and growing tech companies providing a continuous supply of financially-privileged individuals, and (3) a traditionally startup-friendly environment, where bold new ideas regularly surface and are picked up by the wave of tech hype. And so, Uber was born (in 2010). As a professional and reliable chauffeuring service with a convenient mobile app interface (and up-to-date updates on driver location), passengers would be picked up in shiny black cars by courteous drivers in formal attire, offering additional frills like water and mints for the on-the-run business man. Sure, this was an expensive alternative to cabs, but to the users of UberBLACK, it was well worth it. Behind the scenes the structure was quite clever as well: Uber provided the cars and phones for the drivers (equipped with Uber app and Google maps), and regular people stepped up to provide themselves as drivers, no formal interviewing procedure required. Marcos dropped out of his community college to take on this new, respectable job that required a full-time commitment. As the success of a personalized on-demand transportation service had shown its colors, it opened up the market for new variants. And this is where Lyft comes into the picture.

source: techcrunch.com

Lyft aimed to capture another SF-based market segment: the young crowd of current students and recent graduates, now employed at local startups. An alternative transportation solution was needed for the kind of person that ate Mexican from a food truck and sported giveaway t-shirts acquired at hackathons and career fairs. Lyft was marketed as an affordable ride-share, the distinctive pink mustache on car bumpers trumpeting the friendly, hip, and easy-going atmosphere that customers would learn to expect from it. In stark contrast to UberBLACK, passengers would sit in the front, engage in conversations with their casually-dressed drivers, and ride in whatever car the driver happened to own. Whereas Uber lent its drivers cars and phones, Lyft sent them giant pink furry mustaches. The latter was more financially viable, allowing prices to drop to student standards, well below cab fees. Importantly, Lyft drivers could work on flexible schedules, squeezing in rides in the free moments of the day, morning, evening, and between activities. Marcos could now go back to college and pick up passengers in his free time.

Uber wisely recognized that much of its infrastructure was already in place to allow its service to be differentiated for different kinds of customers. Uber then branched to provide a new option: UberX. Learning from the successes and failures of the Lyft model, UberX allowed drivers to work flexible hours in flexible attire, operating their own vehicles – provided, and this is important, that the vehicles passed some minimal quality standards (Lyft passengers had begun to complain about the run-down condition of some of the cars). The water and mints were still there. Drivers were encouraged to be friendly and hip.

Uber had a first-player advantage: it had been first in the market and thus enough time to acquire a good reputation and loyal customers through its UberBLACK service. UberX brought in new customers and gave the old ones a flexible option. Provided with the same reliability and courteousness, some of UberBLACK customers now opted for the cheaper, more informal option. It is part of SF culture not to flaunt financial well-being, as evidenced by the casual hoodies and slacks regularly worn by some top tech executives. So black cars became regular cars (that were nevertheless guaranteed not to be run-downs).

As an aside, Uber now has a variant that is intermediate between UberBLACK and UberX. Do you want to get picked up by a casually-dressed Uber driver but in a brand-name car like a BMW or Mercedez for an intermediate price? Well now you can with Uber Select. And if you don't want a fancy car to pull up at your office entrance in SF, you can stick to UberX. Different Uber options happen to be dominant in different cities. For instance, perhaps unsurprisingly, LA tends to prefer the luxurious option.

After the introduction of UberX, Uber's customer pool grew. This meant that the density of ride requests was often higher on Uber than on Lyft. Drivers had more customers overall and could cover smaller distances between ride requests. Marcos and his friends signed back on with Uber.

source: techcrunch.com

New measures had to be taken. Lyft gave its drivers new incentives: “complete X rides and receive a rebate on the hefty commissions paid back to Lyft”. Uber followed suit. The new incentives served an additional purpose: having to complete a minimum number of rides, many drivers could no longer afford enough time to work for both companies and still complete enough rides with each. Choices had to be made. Uber tried to give drivers incentives for accepting all ride requests in a row. Drivers obliged and accepted all that came their way. They accepted requests even if it required going around the whole block just to pick up a passenger directly on the opposite side of the street. Passenger wait times increased. Passengers were not happy. Uber pivoted its incentives structure.

A vicious price war ensued. The water and mints disappeared from Uber cars. With few noticeable differences between the two services from the customer perspective, customers went where prices were lower. Lower prices meant more ride requests and a quicker way to hit the incentive ride minimum. Drivers went where there were more customers.

As Marcos prepared to drop me off in Palo Alto, he got his Lyft app ready. He said he'd take the first request he got - Uber or Lyft. Palo Alto has longer ride distances and fewer customers per square area than SF. Time is costly, and Marcos would not spend it passenger-less. After all, he needed to be in class soon. He let me out. My half-hour, 21-mile ride cost $37.78, including a $3.85 airport surcharge. Uber would take 20-30%, gas would cost Marcos another few dollars, and car depreciation isn't to be forgotten either. Marcos told me that the prices are more expensive in SF than surrounding areas. (In fact, my trip back to SFO from Palo Alto 2 days later cost $28.47). On my Uber app, I gave Marcos 5 stars and left some feedback about what a knowledgable guide he turned out to be. Then again, I don't remember the last time I gave a poor review.

source: http://images.cdn.stuff.tv

Lowering prices means even more burden on the drivers. Already the fraction of a cab fee, Uber fees are reaching new lows. Two days later, I logged onto my Uber app at 5 a.m. to request a car back to the airport. I could see some cars circling around the Googleplex complex, 15 minutes from where I was. After about 2 minutes, an Uber driver accepted my request. Another minute later, he canceled the request. He'd probably gotten a more conveniently-located ride request and would make more money by keeping the distance driven without passengers minimal (and 15 minutes was already pushing it). His car stayed around the Googleplex complex. I placed another request, finding myself irritated that it was taking me longer than 5 minutes to get a car. My last dozen or so Uber trips involved instantaneous request acceptance, with a car picking me up 1-2 minutes later. How spoiled I had become. Finally, after another 3 minutes, my request was accepted by a middle-aged Latin American gentleman named Juan Carlos, and in 15 minutes, he was at my hotel.

I was really thankful to Juan for picking me up. He was surprised to find out there wasn't a swarm of cars ready to take me. Uber cars often outnumber passengers at this early time in the morning, he told me. I was in turn surprised to hear this, having spent that night tossing and turning in bed worried that no Uber drivers would be on the roads so early (I didn't even consider cabs as an alternative anymore). Our differing expectations for what would be the Uber availability situation that morning led me to thinking that there are too many variables at play to fully predict driver behavior. Uber drivers have to somehow optimize ride fares, company incentive structures, passenger availability, and competition with other Uber cars to figure out if a particular ride is going to bring them more than it will cost. Earlier that morning Juan had driven another passenger to San Jose airport - a 20 minute ride that cost the passenger $10, of which Juan would probably get less than $6-7.

I told Juan about one of my recent Uber experiences in Boston. I had decided to try UberPOOL for the first time: a variant where multiple passengers can share the same ride, with different initial and destination locations, as long as the trips are relatively in the same direction. Each passenger pays less in return for the potentially longer ride. If multiple passengers are picked up, the Uber driver can hope to make a sliver more in the same fraction of time by combining the trips. The interesting catch is: you get a guaranteed UberPOOL price regardless of whether another passenger is taken. In other words, you pay a lower price (even lower than UberX) just by agreeing to potentially share the ride. Talking to my other friends in Boston, it is pretty common for no additional passenger to show up. So my friend and I took an UberPOOL. We counted as a single passenger (it would be the same price if only one of us was there), but didn't end up picking up a third passenger on the way. Our ride was 10 minutes from Downtown Boston across the bridge to East Cambridge, and cost us a total of $6. Splitting it, each of us paid $3, almost the price of a subway ticket, but with the walking distance (from subway to house) cut from 15 minutes down to zero.

Who takes the loss when no additional passenger request is made on UberPOOL: the company or the driver? I asked Juan. Turns out, it's the driver (in Lyft's case, the company pays the difference). So if drivers are making so little money, how can Uber remain a viable longterm business model? Without missing a beat, Juan replied that it doesn't need to be viable for longer than a decade at the most. "After all, Uber is building a fleet of self-driving cars. No paid drivers will be needed." Juan paused. But there's a bigger problem: Juan is concerned about the strawberry-picking robots that are now working on farms day and night, 24 hours straight. Soon, there'll be even more robotic farm hands. Juan's family back in South America along with thousands of other people are going to be out of the farm jobs that provided their livelihood. "What happens then?"

Juan Carlos got some fraction of the $28.47 I paid via my Uber app, and 5 stars.

source: http://www.econlife.com

Saturday, 13 February 2016

On effective communication: because it matters.

I've been thinking quite a bit recently about effective communication, party because there were 2 seminars last month at MIT about giving good talks (one by Patrick Winston, one by Jean-luc Doumont), party because we recently published a paper about what makes visualizations effective (for communicating messages), and partly because I've been TA-ing a research course for undergraduates (with a large communication component to it).

I'll summarize here some notes from the talks I went to, as well as my own thoughts and insights. Though I'm sure I'll have lots more to say on this topic in the future.

Patrick Winston started off his talk with the following statement*: "you (the researcher) will be judged first by your speaking, then by your writing, and finally by your ideas". This is a common phenomenon: a great communicator can sell you on the simplest ideas and make you see beauty in them; a poor communicator can obscure the most beautiful of ideas. Both examples regularly occur in lectures, in research talks, and in business presentations (but I'll focus on the researchers, here). It really is a shame when beautiful ideas don't come to light because the researchers behind them lack in explanatory artistry. It is an art, this whole communication business - which is why it is not commonly taught in a formal manner. Aside from the occasional seminar, the occasional resource exchanged among students, and the occasional tip given by one researcher to another during a practice talk, aspiring researchers (e.g. students) get no formal coaching and are told to "just do good work". Feedback and tips from advisors can be quite uneven, depending on the experience of the advisors themselves. (luckily, MIT professors are very good at selling their research, judging by the content on the front page of MIT news every morning; as Winston puts it: "your ideas should have the wrapping that they deserve")

The point is: many (esp. young) researchers need formal communication coaching, and often they underestimate how important it is for their careers (it pains me to hear yet another graduate student proclaim: "boy, these talks and posters I have to present are such a waste of my time"*). I would like to applaud MIT's initiative: the new EECS communication lab (and similar ones in other departments) for providing resources, training and advisors to students, when they need them. Additionally, I think MIT's SuperUROP course for undergraduates is a super valuable experience (essentially a how-to guide to being a researcher), where alongside a year's worth of academic research, students practice and receive feedback on important communication skills: writing research abstracts, proposals, and papers; performing peer reviews, creating academic posters, and giving research pitches and presentations. And yes, as a TA in the course, I sometimes hear the same excuses ("boy, all these written assignments are such a waste of my time, why can't I just do the research"). But when you're in an environment where industry representatives, senior researchers, and MIT faculty are following what you're doing (as is the case for these students), being able to sell your work can mean a lot for your future career. Last semester, for instance, the students participated in a large poster session, where they presented their work to all the aforementioned parties. I gathered some advice, common mistakes, and helpful suggestions in the linked-to set of slides.

* Yes, yes, groundbreaking ideas can speak for themselves, but I guarantee that most ideas need someone speaking for them (at least to get them off the ground).

Note that from one set of communication-related slides to another, from one talk to the next, the same kind of advice surfaces again and again. Most often, the views and suggestions presented are not idiosyncratic, but common, accepted, guidelines. We've all been in the audience: we know what catches our interest and what bores us to death (and it's often not the content to blame).

Let me summarize (and paraphrase in my own words) some of Winston's talk advice:

start with an empowerment promise: give your audience a sense that they will walk away with something (e.g. some newfound knowledge or ideas) from your talk, so they know what to look forward to and why they should care
get your idea out quickly, and cycle back: don't expect that all your audience members will follow along with you until the end, and do not leave the most important to last ("avoid the crescendo, just blurt it out"); come back to, and reinforce your points
use verbal punctuation: people fog out, so bring them back once in a while, especially to accentuate a switching of topics, slides, etc. (kind of like an "ehem, you can wake up now, even if you've missed the last few minutes, I'm starting a new thread...");
avoid near-misses: foresee what the audience could be confused about and clarify your contributions
what you end with is the last impression: make it count, clarify your contributions, show your audience what they're walking away with; and remember: the final slide will be there forever, "don't squander this real estate" (is your final slide the infamous and content-less "thank you"?)
whether a poster or a presentation, what should come clearly through are your vision, steps, and contributions (Winston even advocates naming the relevant sections/slides accordingly)

When approached once by a young researcher looking to get advice on his job-talk slides, Winston proclaimed: "too many slides and too many words".

"How do you know?" the researcher asked.

"It's almost universally true."

(Winston later added that allowing powerpoint to have less than 30-point font is probably Bill Gates' biggest fault. When text has to shrink that much, there is too much of it on the slides.)

This is the kind of advice that will come up again and again. People have the tendency to cram as much as possible into very small (spatial or temporal) frames. Researchers want to talk about all the great work they've done (not realizing that they're drowning out the most important parts). Students put all the details of their projects on their posters (not realizing that the contributions get lost). Here's my suggestion: do one pass of the content from which you want to pull slides/talking points, and extract the most important points. Sleep on it. Then pick the most important points out of your selection, and scrap the rest. Repeat. With enough cycles, you would have cleaned away the debris, exposing the shine of the main ideas.

What I like about Winston's communication advice is that he comes at it from the perspective of a scientist (he is, after all, a computer science professor at MIT). Sprinkled throughout his talk are technical references and examples. Most of all, he emphasizes the importance of projections - the way an idea or a piece of work is communicated to an audience: the context, the stance, the voice, the presentation style, all of it.

Another individual with a great technical take on communication advice is Jean-luc Doumont (got a physics PhD from Stanford). Jean-luc (he prefers to be called by his first name) consistently refers to the importance of increasing signal and eliminating noise in a presentation, whether visual or oral. This concept is ever-present in his book: Trees, Maps, and Theorems - which I highly recommend.

Note that "noise" can refer to many things at once. In the case of presentations, the noise is everything that is tangential to your main points - it is the 'ums' and 'likes' in your speech, the nervous pacing and awkward hand fidgeting, the excessive details on your slides (do you really need your institute's logo on every slide?). In the technical writing, noise includes all the superfluous words (why say it in 10 words when you can say it in 3? why talk like a politician?).

With regards to maximizing signal, Jean-luc also talks about maximizing effective redundancy - which is to say helping to carry the message across despite noisy channels (those you have no control over, like the audience's attention or knowledge; whereas noisy channels that you do have control over should be minimized). Redundancy can be verbal or nonverbal. It can be complementary. For instance, your slides could contain your main points, but you're also there to describe them. If someone misses it in your speech, they see it on the screen*. You can also get the important messages across again later, in the same or different words (remember the cycling that Winston referred to?).

* This does not mean that what is on the screen should be what is said. The slides complement, not replace, the oral presentation. If people are spending all their cognitive resources reading your slides, they'll fail to process what you're saying, and that is where the communication breaks down.

Jean-luc's three laws to optimize communication are:

first law: adapt to your audience
second law: maximize the signal to noise ratio
third law: use effective redundancy

(but remember: second law > third law)

When studying information visualizations (graphs, charts, plots, etc.), our research team also found that when given visualizations with redundant encodings - i.e. when the message was presented in a number of ways (as a trend line, as an annotation of the trend line, as a description of the plot, in the title, etc.), human observers were more likely to recall the message correctly (different people might need to see things presented in a different way). Conversely, too many extra details, unrelated visuals, or metaphors led to worse recall and confusion, in that observers might recall only a piece of the main message, or misremember it entirely. The take-away? Make your priority getting the signal across, scrap the rest. You can do so quite effectively using the title. Importantly, if your title contains your message, more observers will remember and recall it.

Here's a little piece of advice that also tends to repeat: make your titles count. Be it the titles of talks, slides, section headings, visualizations/graphs. Jean-luc places a lot of emphasis on this in Trees, Maps, and Theorems. He gives great examples of how scientists often caption their plots something like "Y as a function of X", where it is clear that what is plotted is, by no surprise: Y as a function of X. You haven't told the reader anything new or useful! Consider instead using this valuable real estate to convey the message of the plot, such as "Y peaks when X is at its lowest value due to the effect of...". After hearing all of Jean-luc's examples of the way scientists title their slides, figures, etc., I got to thinking. It's true, they do!

I have since tried to be extra careful about my captions, my titles, my paper section headings, even my e-mail subject lines (I guess the current generations get a lot of twitter practice). I try to limit the noise, to imbue as much of the written text with meaning as possible, to carry across the most important points. In fact, when writing my master's thesis, I wanted the essence of the whole thesis to come through the list of contents, figures, and tables. I wanted the reader to walk away with the outlines of the story without even getting to the introduction.

Importantly, if the message can come across simply and quickly, that is not a bad thing. If there's an easier way to say something, why not say it? Jean-luc had great anecdotes at his lecture on "Communicating science to nonscientists" about how unnecessarily jargon-filled scientific communication can be. Here are a few of my favorite anecdotes (again, paraphrased):

After a room full of experts took turns describing their own research topics to each other, they were asked: how many of those descriptions did you understand? Less than half. How many do you still remember? Maybe a few. And this is a room of scientists! Moreover, they consider this normal. How many talks do you remember from your last conference? How many were engaging from start to finish? (maybe... 1?)
When researchers are asked to describe what it is they do, and when they get to any specialized vocabulary, they tend to say it faster and to lower their voice. It is like they are trying to limit us the pain of trying to understand them by saying it fast and low. But that is exactly the opposite of what we need in order to understand!
A student shows Jean-luc a passage he has written. Jean-luc looks confused and asks the student to explain what he meant to say in the passage. The student says: "Well what I mean to say is [blabla]... but I just don't know how to say it." Well in the [blabla] was exactly the explanation!

Jean-luc advises scientists "not to write complicated out of the principle of revenge" (for other scientists who write this way). Do not try to prove to the whole world how complicated your research is. Define technical words, avoid jargon, avoid synonyms, write simply. Provide reference points, comparisons, and examples. Give the why before the what.

I'll leave you on my favorite Jean-luc quote from Trees, Maps, and Theorems: "Effective communication is getting messages across. Thus it implies someone else: it is about an audience, and it suggests that we get this audience to understand something. To ensure that they understand it, we must first get them to pay attention. In turn, getting them to understand is usually nothing but a means to an end: we may want them to remember the material communicated, by convinced of it, or ultimately, act or at least be able to act on the basis of it."

And getting messages across first and foremost requires caring about the importance of getting those messages across. It is about recognizing and believing that effective communication matters. It is about adjusting your habits, your jargon, the amount of content on your slides, your projection, your figure captions and titles, and most importantly your awareness of all these things. Happy communicating!