Tombone's Computer Vision Blog: image understanding

Showing posts with label image understanding. Show all posts

Monday, August 23, 2010

Beyond pixel-wise labeling: Blocks World Revisited

"Thoughts without content are empty, intuitions without concepts are blind." -- Immanuel Kant

The Holy Grail problem of computer vision research is general-purpose image understanding. Given as input a digital image (perhaps from Flickr or from Google Image search), we want to recognize the depicted objects (cars, dogs, sheep, Macbook Pros), their functional properties (which of the depicted objects are suitable for sitting), and recover the underlying geometry and spatial relations (which objects are lying on the desk).

The early days of vision were dominated via the "Image Understanding as Inverse Optics" mentality. In order to make the problem easier, as well as to cope with the meager computational resources of the 60s, early computer vision researchers tried to recover the 3D geometry of simple scenes consisting of arrangements of blocks. One of the earlier efforts in this direction, is the PhD thesis Machine Perception of Three-Dimensional Solids by Larry Roberts from MIT back in 1963.

But wait -- these block-worlds are unlike anything found in the real world! The drastic divide between the imagery that vision researchers were studying in the 60s and what humans observe during their daily experiences ultimately led to the disappearance of block-worlds in computer vision research.

Image Parsing Concept Image from Computer Blindness Blog

Over the past couple of decades, we have seen the success of Machine Learning, and it is of no surprise that we are currently living in the "Image Understanding as statistical inference" era. While a single 256x256 grayscale image might have been okay to use in the 1960s, today's computer vision researchers use powerful computer clusters and do serious heavy-lifting on millions of real-world megapixel images. The man-made blocks-world of the 1960s is a thing of the past, and the variety found on random images downloaded from Flickr is the complexity we must now cope with.

While the style of computer vision research has shifted since its early days in the 1960s/1970s, many old ideas (and perhaps prematurely considered outdated) are making a comeback!

Assigning basic-level object category labels to pixels is a very popular theme in vision. Unfortunately, to gain a deeper understanding of an image, robots will inevitably have to go beyond pixel-level class labels. (This is one of the central themes in my thesis -- coming out soon!) Given human-level understanding of a scene, it is trivial to represent it as a pixel-wise labeling map, but given a pixel-wise labeling map it is not trivial to convert it to human-level understanding.

What sort of questions can be answered about a scene when the output of an "image understanding" system is represented as a pixel-wise label map?

1. Is there a car in the image?
2. Is there a person at this location in the image?

What questions cannot be answered given a pixel-wise label map?

1. How many cars are in this image? (While there are some approaches that strive to deal with delineating object instance boundaries, most image parsing approaches fail to recognize boundaries between two instances of the same category)
2. Which surfaces can I sit on?
3. Where can I park my car?
4. How geometrically stable are the objects in the scene?

While I have more criticisms than tentative solutions, I believe that vision students shouldn't be parochially preoccupied with solely the most recent approach to image understanding. It is valuable to go back several decades in the literature and gain a broader perspective on image understanding. However, some progress is being made! A deeply insightful upcoming paper from ECCV 2010, is the following:

Abhinav Gupta, Alexei A. Efros and Martial Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics, European Conference on Computer Vision, 2010. (PDF)

What Abhinav Gupta does very elegantly in this paper is connect the blocks-world research of the 1960s with the geometric-class estimation problem, as introduced by Derek Hoiem. While the final system is evaluation in a Hoiem-like pixel-wise labeling task, the actual scene representation is 3D. The blocks in this approach are more abstract than the Lego-like volumes in the 1960s -- Abhinav's blocks are actually cars, buildings, and trees. I included the infamous Immanuel Kant quote, because I feel it describes Abhinav's work very well. Abhinav introduces the block as a theoretical construct which glues together a scene's elements and provides a much more solid interpretation -- Abhinav's blocks add the content to geometric image understanding which is lacking in the purely pixe-wise approaches.

While integrating large-scale categorization into this type of geometric reasoning is still an open problem, Abhinav provides us visionaries with a glimpse of what image understanding should be. The integration of robotics with image understanding technology will surely drive pixel-based "dumb" image understanding approaches to extinction.

Friday, March 05, 2010

Representation and Use of Knowledge in Vision: Barrow and Tenenbaum's Conclusion

To gain a better perspective on my research regarding the Visual Memex, I spent some time reading Object Categorization: Computer and Human Vision Perspectives which contains many lovely essays on Computer Vision. This book contains recently written essays by titans of Computer Vision and contains a great deal lessons learned from history. While such a 'looking back' on vision makes for a good read, it is also worthwhile to find old works 'looking forward' and anticipating the successes and failures of the upcoming generations.

In this 'looking forward' fashion, I want to share a passage regarding image understanding systems, from "Representation and Use of Knowledge in Vision," by H. G. Barrow and J. M. Tenenbaum, July 1975. This is a short paper worth reading for both graduate students and professors interested in pushing Computer Vision research to its limits. I enjoyed the succinct and motivational ending so much, it is worth repeating it verbatim:

--------

III Conclusion

We conclude by reiterating some of the major premises underlying this paper:

The more knowledge the better.
The more data, the better.
Vision is a gigantic optimization problem.
Segmentation is low-level interpretation using general knowledge.
Knowledge is incrementally acquired.
Research should pursue Truth, not Efficiency.

A further decade will determine our skill as visionaries.

-------------

Monday, January 18, 2010

Understanding versus Interpretation -- a philosophical distinction

Today I want to bring up an interesting discussion regarding the connotation of the word "understanding" versus "interpretation," particularly in the context of "scene understanding" versus "scene interpretation." While many vision researchers use these terms interchangeably, I think it is worthwhile to make the distinction, albeit a philosophical one.

On Understanding
While everybody knows that the goal of computer vision is to recognize all of the objects in an image, there is plenty of disagreement about how to represent objects and recognize them in the image. There is a physicalist account (from Wikipedia: Physicalism is a philosophical position holding that everything which exists is no more extensive than its physical properties), where the goal of vision is to reconstruct veridical properties of the world. This view is consistent with the realist stance in philosophy (think back to Philosophy 101) -- there exists a single observer-independent 'ground-truth' regarding the identities of all of the objects contained in the world. The notion of vision as measurement is very strong under this physicalist account. The stuff of the world is out there just waiting to be grasped! I think the term "understanding" fits very well into this truth-driven account of computer vision.

On interpretation
The second view, a postmodern and anti-realist one, is of vision as a way of interpreting scenes. The shift is from veridical recovery of the properties of the world from an image (measurement) to the observer-dependent interpretation of the input stimulus. Under this account, there is no need to believe in a god's eye 'objective' view of the world. Image interpretation is the registration of an input image with a vast network of past experience, both visual and abstract. The same person can vary their own interpretation of an input as time passes and the internal knowledge based has evolved. Under this view, two distinct robots could provide very useful yet distinct 'image interpretations' of the same input image. The main idea is that different robots could have different interpretation-spaces, that is they could obtain incommensurable (yet very useful!) interpretations of the same image.

It has been argued by Donald Hoffman (Interface Theory of Perception) that there is no reason why we should expect evolution to have driven humans towards veridical perception. In fact, Hoffman argues that natures drives veridical perception towards extinction and it only makes sense to speak of perception as guiding agents towards pragmatic interpretations of their environment.

In philosophy of science, there is the debate of whether the field of physics is unraveling some ultimate truth about the world versus physics painting a coherent and pragmatic picture of the world. I've always viewed science as an art and I embrace my anti-realist stance -- which has been shaped by Thomas Kuhn, William James, and many others. While my scientific interests have currently congealed in computer vision, it is no surprise that I'm finding conceptual agreement between my philosophy of science and my concrete research efforts in object recognition.

Friday, August 14, 2009

Bay Area Vision Meeting (BAVM 2009): Image and Video Understanding

Tomorrow (Friday) afternoon is BAVM 2009, a Bay Area workshop on Image and Video Understanding, which will be held at Stanford this year. It is being organized by Juan Carlos Niebles, one of Fei-Fei Li's students, and I will be there representing CMU. I have a poster about some new research and getting feedback is always good, but I'm really excited about meeting some of the other graduate students who work on image understanding. The Berkeley group has been pushing hard segmentation-driven image understanding so seeing what they're up to should be interesting. There will also be many fellow Googlers and researchers from companies in the Bay Area so it will also be a good place to network.

I look forward to hearing the invited speakers and the seeing the bleeding-edge stuff during the poster sessions. I'll try to blog a little bit about some of the coolest stuff I encounter when I get back.

Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.

Thursday, February 12, 2009

Context is the 'glue' that binds objects in coherent scenes.

This is not my own quote. It is on of my favorites from Moshe Bar. It comes from his paper "Visual Objects in Context."

I have been recently giving context (in the domain of scene understanding) some heavy thought.
While Bar's paper is good, the one I wanted to focus on goes back to the 1980s. According to Bar, the following paper (which I wish everybody would at least skim) is "a seminal study that characterizes the rules that govern a scene's structure and their influence on perception."

Biederman, I., Mezzanotte, R. J. & Rabinowitz, J. C. Scene perception: detecting and judging objects undergoing relational violations. Cogn. Psychol. 14, 143–177 (1982).

Biederman outlines 5 object relations. The three semantic (related to object categories) relations are probability, position, and familiar size. The two syntactic (not operating at the object category level) relations are interposition and support. According to Biederman, "these relations might constitute a sufficient set with which to characterize the organizations of a real-world scene as distinct from a display of unrelated objects." The world has structure and characterizing this structure in terms of such rules is quite a noble effort.

A very interesting question that Biederman addresses is the following: do humans reason about syntactic relations before semantic relations or the other way around? A Gibsonian (direct perception) kind of way to think about the world is that processing of depth and space precedes the assignment of identity to the stuff that occupies the empty space around us. J.J. Gibson's view is in accordance with Marr's pipeline.

However, Biederman's study with human subjects (he is a psychologist) suggests that information about semantic relationships among objects is not accessed after semantic-free physical relationships. Quoting him directly, "Instead of a 3D parse being the initial step, the pattern recognition of the contours and the access to semantic relations appear to be the primary stages" as well as "further evidence that an object's semantic relations to other objects are processed simultaneously with its own identification."

Now that I've wet your appetite, let's bring out the glue.

P.S. Moshe Bar was a student of I. Biederman and S. Ullman (Against Direct Perception author).

Sunday, August 10, 2008

What is segmentation? What is image segmentation?

According to Merriam-Webster, segmentation is "the process of dividing into segments" and a segment is "a separate piece of something; a bit, or a fragment." This is a rather broad definition which suggests that segmentation is nothing mystical -- it is just taking a whole and partitioning it into pieces. One can segment sentences, time periods, tasks, inhabitants of a country, and digital images.

Segmentation is a term that often pops up in technical fields, such as Computer Vision. I have attempted to write a short article on Knol about Image Segmentation and how it pertains to Computer Vision. Deciding to abstain from discussing specific algorithms -- which might be of interest to graduate students and not the population as a whole -- I instead target the high-level question, "Why segment images?" The answer, according to me, is that image segmentation (and any other image processing task) should be performed solely assist object recognition and image understanding.