Sunday, 17 January 2016

Pulling together the efforts of neuroscientists and computer scientists


Had a great time last Friday at the intersection of Neuroscience and Computer Science: http://cbmm.mit.edu/science-engineering-vassar
Heard from 6 MIT Superstars: Bill Freeman, Joshua Tenenbaum, Ed Boyden, Nancy Kanwisher, Feng Zhang, and Tomaso Poggio. Learned about how neuronal cells can be stretched for physical magnification and imaging, how CRISPR can bring us neurological therapeutics, and the specialized brain areas that we may have for visual words and music. But those are just the topics I will not talk about.
Intrigued to learn more anyway?
Here are some related articles:
http://news.mit.edu/2014/crispr-technique-determines-gene-function-1210
http://news.mit.edu/2015/faculty-profile-edward-boyden-0522
http://news.mit.edu/2015/neural-population-music-brain-1216

In this post, I will focus instead on the common strands passing through the works of Bill Freeman, Joshua Tenenbaum, Josh McDermott, and Nancy Kanwisher - both to highlight the great interdisciplinary collaborations happening at MIT, and to give a broader sense of how neuroscience and computer science are informing each other, and leading to cool new insights and innovations.

Bill Freeman presented the work spearheaded by his graduate student Andrew Owens: "Visually indicated sounds". Teaming up with Josh McDermott, who studies computational audition at the MIT Department of Brain and Cognitive Sciences, they linked sound to material properties and vice versa. Given a silent video as input (of a wooden stick hitting or scratching some surface), the team developed an algorithm that synthesizes realistic sound to go along with it. To do so they needed to convert videos of different scenes (with a mixture of materials) into some perceptually-meaningful space, and link them to sounds that were also represented in some perceptually-meaningful way. What does "perceptually-meaningful" refer to? The goal is to transform the complex mess that is colored pixels and audio waveforms into some stable representations that allows similar materials to be matched together and associated with the same material properties. For instance, pictures (and videos) of different foliage will look very different from each other (the shape and the color may have almost no pixel-overlap) and yet, somehow, the algorithm needs to discover the underlying material similarity.

Here is one place where CNNs (convolutional neural nets) have been successful at transforming a set of pixels into some semantic representation (enough to perform scene recognition, object detection, and the other high-level computer vision tasks that the academic and industry communities have recently flooded the media outlets with). CNNs can learn almost human-like associations between images and semantics (like labels) or between images and other images. Owens and colleagues used CNNs to represent their silent video frames.

On the sound side of things, waveforms were converted into "cochleagrams" - stable representations of sound that allow waveforms coming from similar sources (e.g. materials, objects) to be associated with each other even if individual timestamps of the waveforms have almost no overlap. Now to go from silent video frames to synthesized sounds, RNNs (recurrent neural nets) were used (RNNs are great for representing and learning sequences, by keeping around information from previous timesteps to make predictions for successive timesteps). The cochleagrams predicted by the RNNs could then be transformed back into sound, the final output of the algorithm. More details in their paper.

This work is a great example of the creative new problems that computer vision researchers are tackling. With the powerful representational architectures that deep neural networks provide, higher and higher-level tasks can be achieved - tasks that we would typically associate with human imagination and creativity (e.g. inferring what sound is emitted by some object, what lies beyond the video frame, what is likely to happen next, etc.). In turn, these powerful architectures are interesting from a cognitive science perspective as well: how are the artificial neural networks representing different features? images? inputs? What kinds of associations, correlations, and relationships do they learn from unstructured visual data? Do they learn to meaningfully associate semantically-related concepts? Cognitive scientists can give computer scientists some ideas about which representations may be reasonable for different tasks, given what is known from decades of experiments on the human brain. But the other side of the story is that computer scientists can prod these artificial networks to learn about the representational choices that the networks have converged on, and then cognitive scientists can design experiments to check if the networks in the human brain do the same. This allows the exploration of a wide space of hypotheses at low cost (no poking human brains required), to narrow down the focus of cognitive scientists in asking whether the human brain has converged on similar representations (or if not, how can it be more optimal?)

Nancy Kanwisher mentioned how advances in deep neural networks are helping to understand functional representation in the brain. Kanwisher has done pioneering work on functional specialization in the brain (which brain areas are responsible for which of our capabilities) - including discovering the fusiform face area (FFA). In her talk, she discussed how the "Principle of Modular Design" (Marr, 1982) just makes sense - it is more efficient. She mentioned some examples of work from MIT showing there are specialized areas for faces, language, visual words, even theory of mind. By giving human participants different tasks to do and scanning their brain (using fMRI), neuroscientists can test hypotheses about the function of different brain regions (they check whether the brain signal in those regions changes significantly as they give participants different tasks to do). Some experiments, for instance, have demonstrated that certain language-specific areas of the brain are not involved during logic tasks, arithmetic, or music (tasks that are sometimes hypothesized to depend on language). Experiments have shown that the brain's specialization is not all natural selection, and that specialized brain areas can develop as a child learns. Other experiments (with Josh McDermott) have shown that uniquely-human brain regions exist, like ones selective to music and human speech (but not other sounds). Other experiments probe causality: what happens if specific brain regions are stimulated or dysfunctional? How are the respective functions affected or impaired? Interestingly, stimulating the FFA using electrodes can cause people's representations of faces to change. Correspondingly, stimulating other areas of the brain using TMS can cause moral judgements to shift.

Kanwisher is now working with Josh Tenenbaum to look for areas of the brain that might be responsible for intuitive physical inference. Initial findings are showing that the regions activated during intuitive physics reasoning are the same ones responsible for action planning and motor control. Knowing how various functional areas are laid out in the brain, how they communicate with each other, and which resources they pool together, can help provide insights for new artificial neural architectures. Conversely, artificial neural architectures can help us support or cast doubt on neuroscience hypotheses by replicating human performance on tasks using different architectures (not just the ones hypothesized).

Josh Tenenbaum is working on artificial architectures that can make the same inferences humans make, but also make the same mistakes (for instance, Facebook's AI that reasons about the stability of towers of blocks, makes different incorrect predictions than humans). The best CNNs today are great at the tasks for which they are trained, sometimes even outperforming humans, but often also making very different mistakes. Why is it not enough to just get right what humans get right, without also having to get wrong what they get wrong? The mistakes humans make are often indicative of the types of broad inferences they are capable of, and uncover the generalizing power of the human mind. This is why one-shot learning is possible: humans can learn whole new concepts from a single example (and Tenenbaum has many demos to prove it). This is why we can explain, imagine, problem solve, and plan. Tenenbaum says: "intelligence is not just pattern recognition. It is about modeling the world", and by this he means "analysis by synthesis".

Tenenbaum wants to re-engineer "the game engine in your head". His group is working on probabilistic programs that can permit causal inference. For example, their algorithm can successfully recognize the parameters of a face (shape and texture; the layout and type of facial features) as well as the lighting and viewing angle used for the picture. Their algorithm does this by sampling from a generative model that iteratively creates and refines new faces, and then matching the result to the target face. Once a match is found, the parameters chosen for the synthesized face can be considered a good approximation for the parameters of the target face. Interestingly, this model, given similar tasks as humans (e.g. to determine if two faces are the same or different) takes similar amounts of time (corresponding to task difficulty) and makes similar mistakes. This is a good hint that the human brain might be engaged in a similar simulation/synthesis process during recognition tasks.

Tenenbaum and colleagues have made great strides in showing how "analysis by synthesis" can be used to solve and achieve state-of-the-art performance on difficult tasks like face recognition, pose estimation, and character identification (even passing the visual Turing test). As is the case for much of current neural network research, the original inspiration comes from the 80s and 90s. In particular, Hinton's Helmholtz Machine had a wake-sleep cycle where recognition tasks were interspersed with a type of self-reinforcement (during "sleep") that helped the model learn on its own, even when not given new input. This approach helps the model gain representational power, and might give some clues as well about human intelligence (what do we do when we sleep?).

How does the human mind make the inferences it does? How does it jump to its conclusions? How does it transfer knowledge gained on one task and apply it to a novel one? How does it learn abstract concepts? How does it learn from a single example? How does the human mind represent the world around it, and what physical structures are in place in order to accomplish this? How is the brain wired? These questions are driving all of the research described here and will continue to pull together the efforts of neuroscientists and computer scientists in the coming years more than ever before. Our new and ever-developing tools for constructing artificial systems and probing into natural ones are establishing more and more points of contact between fields. Symposia such as these can give one a small hint of what the tip of the iceberg might look like.

No comments:

Post a Comment