Showing posts with label scene understanding. Show all posts
Showing posts with label scene understanding. Show all posts

Wednesday, June 26, 2013

[Awesome@CVPR2013] Scene-SIRFs, Sketch Tokens, Detecting 100,000 object classes, and more

I promised to blog about some more exciting papers at CVPR 2013, so here is a short list of a few papers which stood out.  This list also include this year's award winning paper: Fast, Accurate Detection of 100,000 Object Classes on a Single Machine.  Congrats Google Research on the excellent paper!



This paper uses ideas from Abhinav Gupta's work on 3D scene understanding as well as Ali Farhadi's work on visual phrases; however, it also uses RGB-D input data (like many other CVPR 2013 papers).

W. Choi, Y. -W. Chao, C. Pantofaru, S. Savarese. "Understanding Indoor Scenes Using 3D Geometric Phrases" in CVPR, 2013. [pdf]

This paper shows a uses the crowd to learn which parts of birds are useful for fine-grained categorization.  If you work on fine-grained categorization or run experiments with MTurk, then you gotta check this out!
Fine-Grained Crowdsourcing for Fine-Grained Recognition. Jia Deng, Jonathan Krause, Li Fei-Fei. CVPR, 2013. [ pdf ]

This paper won the best paper award.  Congrats Google Research!

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine. Thomas Dean, Mark Ruzon, Mark Segal, Jon Shlens, Sudheendra Vijayanarasimhan, Jay Yagnik. CVPR, 2013 [pdf]


The following is the Scene-SIRFs paper, which I thought was one of the best papers at this year's CVPR.  The ideas to to decompose an input image into intrinsic images using Barron's algorithm which was initially shown to work on objects, but now is being applied to realistic scenes.

Intrinsic Scene Properties from a Single RGB-D Image. Jonathan T. Barron, Jitendra Malik. CVPR, 2013 [pdf]


This is a graph-based localization paper which uses a sort of "Visual Memex" to solve the problem.
Graph-Based Discriminative Learning for Location Recognition. Song Cao, Noah Snavely. CVPR, 2013. [pdf]


This paper provides an exciting new way of localizing contours in images which is orders of magnitude faster than the gPb.  There is code available, so the impact is likely to be high.

Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection. Joseph J. Lim, C. Lawrence Zitnick, and Piotr Dollar. CVPR 2013. [ pdf ] [code@github]

Tuesday, June 18, 2013

Must-see Workshops @ CVPR 2013

June is that wonderful month during which computer vision researchers, students, and entrepreneurs go to CVPR -- the premier yearly Computer Vision conference.  Whether you are presenting a paper, learning about computer vision, networking with academic colleagues, looking for rock-star vision experts to join your start-up, or looking for rock-star vision start-ups to join, CVPR is where all of the action happens!  If you're not planning on going, it is not too late! The Conference starts next week in Portland, Oregon.


There are lots of cool papers at CVPR, many which I have already studied in great detail, and many others which I will learn about next week.  I will write about some of the cool papers/ideas I encounter while I'm at CVPR next week.  In addition to the main conference, CVPR has 3 action-packed workshop days.  I want to take this time to mention two super-cool workshops which are worth checking out during CVPR 2013.  Workshop talks are generally better than the main conference talks, since the invited speakers tend to be more senior and they get to present a broader view of their research (compared to the content of a single 8-page research paper as is typically discussed during the main conference).

SUNw: Scene Understanding Workshop
Sunday June 23, 2013


From the webpage: Scene understanding started with the goal of building machines that can see like humans to infer general principles and current situations from imagery, but it has become much broader than that. Applications such as image search engines, autonomous driving, computational photography, vision for graphics, human machine interaction, were unanticipated and other applications keep arising as scene understanding technology develops. As a core problem of high level computer vision, while it has enjoyed some great success in the past 50 years, a lot more is required to reach a complete understanding of visual scenes.

I attended some the other SUN workshops which were held at MIT during the winter months.  This time around, the conference is at CVPR, so by definition it will be accessible to more researchers.  Even though I have the pleasure of knowing personally the super-smart workshop organizers (Jianxiong Xiao, Aditya Khosla, James Hays, and Derek Hoiem), the most exciting tidbit about this workshop is the all-star invited speaker schedule.  The speakers include: Ali Farhadi, Yann LeCun, Fei-Fei Li, Aude Oliva, Deva Ramanan, Silvio Savarese, Song-Chun Zhu, and Larry Zitnick.  To hear some great talks and hear about truly bleeding-edge research by some of vision's most talented researchers, come to SUNw.

VIEW 2013: Vision Industry and Entrepreneur Workshop
Monday, June 24, 2013



From the webpage: Once largely an academic discipline, computer vision today is also a commercial force. Startups and global corporations are building businesses based on computer vision technology. These businesses provide computer vision based solutions for the needs of consumers, enterprises in many commercial sectors, non-profits, and governments. The demand for computer vision based solutions is also driving commercial and open-source development in associated areas, including hardware and software platforms for embedded and cloud deployments, new camera designs, new sensing technologies, and compelling applications. Last year, we introduced the IEEE Vision Industry and Entrepreneur Workshop (VIEW) at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) to bridge the academic and commercial worlds of computer vision. 

I include this workshop in the must-see list because the time is right for Compter Vision researchers to start innovating at start-ups.  First of all, the world wants your vision-based creations today.  With the availability of smart phones and widespread broadband access, the world does not want to wait a decade until the full academic research pipeline gets translated into products.  Seeing such workshops at CVPR is exciting, because this will help breed a new generation of researcher/entrepreneur.  I, for one, welcome our new company-starting computer vision overlords.

Tuesday, April 17, 2012

Using Panoramas for Better Scene Understanding

There's a lot more to automated object interpretation than merely predicting the correct category label.  If we want machines to be able to one day interact with objects in the physical world, then predicting additional properties of objects such as their attributes, segmentations, and poses is of utmost importance.  This has been one of the key motivations in my own research behind exemplar-based models of object recognition.

The same argument holds for scenes.  If we want to build machines which understand environments around them, then they will have to do much more than predict some sloppy "scene category."  Consider what happens when a machine automatically analyzes a picture and says that it from the "theatre" category.  Well, the picture could be of the stage, the emergency exit, or just about anything else within a theater -- in each of these cases, the "theatre" category would be deemed correct, but would fall short of explaining the content of the image.  Most scene understanding papers either focus getting the scene category right, or strive to obtain a pixel-wise semantic segmentation map.  However, there's more to scene categories than meets the eye.

Well, there is an interesting paper which will be presented this summer at the CVPR2012 Conference in Rhode Island which tries to bring the concept of "pose" into scene understanding.  Pose-estimation has already been well established in the object recognition literature, but this is one of the first serious attempts to bring this new way of thinking into scene understanding.

J. Xiao, K. A. Ehinger, A. Oliva and A. Torralba.
Recognizing Scene Viewpoint using Panoramic Place Representation.
Proceedings of 25th IEEE Conference on Computer Vision and Pattern Recognition, 2012.

The SUN360 panorama project page also has links to code, etc.


The basic representation unit of places in their paper is that of a panorama.  If you've ever taken a vision course, then you probably stitched some of your own.  Below are some examples of cool looking panoramas from their online gallery.  A panorama roughly covers the space of all images you could take while centered within a place.

Car interior panoramas from SUN360 page
 Building interior panoramas from SUN360 page

What the proposed algorithm accomplishes is twofold.  First it acts like an ordinary scene categorization system, but in addition to producing a meaningful semantic label, it also predicts the likely view within a place.  This is very much like predicting that there is a car in an image, and then providing an estimate of the car's orientation.  Below are some pictures of inputs (left column), a compass-like visualization which shows the orientation of the picture (with respect to a cylindrical panorama), as well as a depiction of the likely image content to fall outside of the image boundary.  The middle column shows per-place mean panoramas (in the style of TorralbaArt), as well as the input image aligned with the mean panorama.


I think panoramas are a very natural representation for places, perhaps not as rich as a full 3D reconstruction of places, but definitely much richer than static photos.  If we want to build better image understanding systems, then we should seriously start looking at using richer sources of information as compared to static images.  There is only so much you can do with static images and MTurk, thus videos, 3D models, panoramas, etc are likely to be big players in the upcoming years.

Tuesday, January 12, 2010

Image Interpretation Objectives

An example of a typical complex outdoor natural scene that a general knowledge-based image interpretation system might be expected to understand is shown in Figure 1. An objective of such systems is to identify semantically meaningful visual entities in a digitized and segmented image of some scene. That is, to correctly assign semantically meaningful labels (e.g., house, tree, grass, and so on) to regions in an image -- see [29,30]. A computer-based image interpretation system can be viewed as having two major components, a "low-level" component and a "high-level" component [19],[31]. In many respects, the low-level portion of the system is designed to mimic the early stages of visual image processing in human-like systems. In these early stages, it is believed that scenes are partitioned, to some extent, into regions that are homogeneous with respect to some set of perceivable features (i.e., feature vector) in the scene [6],[40],[39]. To this extent, most low-level general purpose computer vision systems are designed to perform the same task. An example of a partitioning (i.e., segmentation) of Figure 1 into homogeneous regions is shown in Figure 2. The knowledge-based computer vision system we shall describe in this paper is not currently concerned with resegmenting portions of an image. Rather, its task is to correctly label as many regions as possible in a given segmentation.

This a direct quote from a 1984 paper on computer vision. A great example of segmentation-driven scene understanding. The content is similar enough to my own line of work that it could have been an excerpt from my own thesis.

It is actually in a section called Image Interpretation Objectives from "Evidential Knowledge-Based Computer Vision" by Leonard P. Wesley, 1984. I found this while reading lots of good tech reports from SRI International's AI Center in Menlo Park. Some good stuff there by Tenenbaum, Barrow, Duda, Hart, Nillson, Fischler, Pereira, Pentland, Fua, Szeliski, to name a few. Lots of stuff there is relevant to scene understanding and grounds the problem in robotics (since there was no "internet" vision back in the 70s and 80s).

On another note, I still haven't been able to find a copy of the classic paper, Experiments in Interpretation-Guided Segmentation by Tenenbaum and Barrow from 1978. If anybody knows where to find a pdf copy send me an email. UPDATE: Thanks to the quick reply! I have the paper now.

Thursday, November 12, 2009

Learning and Inference in Vision: from Features to Scene Understanding


Tomorrow, Jonathan Huang and I are giving a Computer Vision tutorial at the First MLD (Machine Learning Department) Research Symposium at CMU. The title of our presentation is Learning and Inference in Vision: from Features to Scene Understanding.

The goal of the tutorial is to expose Machine Learning students to state-of-the-art object recognition, scene understanding and the inference problems associated with such high-level recognition problems. Our target audience is graduate students with little or no prior exposure to object recognition who would like to learn more about the use of probabilistic graphical models in Computer Vision. We outline the difficulties present in object recognition/detection and outline several different models for jointly reasoning about multiple object hypotheses.

Saturday, November 07, 2009

A model of thought: The Associative Indexing of the Memex

The Memex "Memory Extender" is an organizational device, a conceptual device, and a framework for dealing with conceptual relationships in an associative way. Abandoning the Aristotelian tradition of rooting concepts in definitions, the Memex suggests an association-based, non-parametric, and data-driven representation of concepts.

Since the mind=software analogy is so deeply engraved in my thoughts, it is hard for me to see intelligent reasoning as anything but a computer program (albeit one which we might never discover/develop). It is worthwhile to see sketches of the memex from an era before computers. (See Figure below). However, with the modern Internet, a magnificent example of a Bush's ideology, with links denoting the associations between pages, we need no better analogy. Bush's critique of the artificiality of traditional schemes of indexing resonates in the world wide web.


A Mechanical Memex Sketch

By extrapolating Bush's anti-indexing argument to visual object recognition, I realize that the blunder is to assign concepts to rigid categories. The desire to break free from categorization was the chief motivation for my Visual Memex paper. If Bush's ideas were so successful in predicting the modern Internet, we should ask ourselves, "Why are categories so prevalent in computational models of perception?" Maybe it is machine learning, with its own tradition of classes in supervised learning approaches, that has scarred the way we computer scientists see reality.

“The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.” -- Vannevar Bush

Is Google's grasp of the world of information anything more than a Memex? I'm not convinced that it is not. While the feat of searching billions of web pages in real time has already been demonstrated by Google (and reinforced every day), the best computer vision approaches as of today resemble nothing like Google's data-driven way of representing concepts. I'm quite interested in pushing this link-based data-driven mentality to the next level in the field of Computer Vision. Breaking free from the categorization assumptions that plague computational perception might the the key ingredient in the recipe for success.

Instead of summarizing, here is another link to a well-written article on the Memex by Finn Brunton. Quoting Brunton, "The deepest implications of the Memex would begin to become apparent here: not the speed of retrieval, or even the association as such, but the fact that the association is arbitrary and can be shared, which begins to suggest that, at some level, the data itself is also arbitrary within the context of the Memex; that it may not be “the shape of thought,” emphasis on the the, but that it is the shape of a new thought, a mediated and mechanized thought, one that is described by queries and above all by links."

Thursday, February 12, 2009

Context is the 'glue' that binds objects in coherent scenes.

This is not my own quote. It is on of my favorites from Moshe Bar. It comes from his paper "Visual Objects in Context."

I have been recently giving context (in the domain of scene understanding) some heavy thought.
While Bar's paper is good, the one I wanted to focus on goes back to the 1980s. According to Bar, the following paper (which I wish everybody would at least skim) is "a seminal study that characterizes the rules that govern a scene's structure and their influence on perception."

Biederman, I., Mezzanotte, R. J. & Rabinowitz, J. C. Scene perception: detecting and judging objects undergoing relational violations. Cogn. Psychol. 14, 143–177 (1982).

Biederman outlines 5 object relations. The three semantic (related to object categories) relations are probability, position, and familiar size. The two syntactic (not operating at the object category level) relations are interposition and support. According to Biederman, "these relations might constitute a sufficient set with which to characterize the organizations of a real-world scene as distinct from a display of unrelated objects." The world has structure and characterizing this structure in terms of such rules is quite a noble effort.

A very interesting question that Biederman addresses is the following: do humans reason about syntactic relations before semantic relations or the other way around? A Gibsonian (direct perception) kind of way to think about the world is that processing of depth and space precedes the assignment of identity to the stuff that occupies the empty space around us. J.J. Gibson's view is in accordance with Marr's pipeline.

However, Biederman's study with human subjects (he is a psychologist) suggests that information about semantic relationships among objects is not accessed after semantic-free physical relationships. Quoting him directly, "Instead of a 3D parse being the initial step, the pattern recognition of the contours and the access to semantic relations appear to be the primary stages" as well as "further evidence that an object's semantic relations to other objects are processed simultaneously with its own identification."

Now that I've wet your appetite, let's bring out the glue.

P.S. Moshe Bar was a student of I. Biederman and S. Ullman (Against Direct Perception author).

Thursday, February 05, 2009

suns 2009 cool talks

This past Friday I went SUNS 2009 at MIT, and in my opinion the coolest talks were by Aude Oliva, David Forsyth, and Ce Liu.

While I will not summarize their talks which referred to unpublished work, I will provide a few high level questions that summarize (according to me) the ideas conveyed by these speakers.

Aude: What is the interplay between scene-level context, local category-specific features, as well as category-independent saliency that makes us explore images in a certain way when looking for objects?

David: Is naming all the objects depicted in an image the best way to understand an image? Don't we really want some type of understanding that will allow us to reason about never before seen objects?

Ce: Can we understand the space of all images by cleverly interpolating between what we are currently perceiving and what we have seen in the past?

Wednesday, January 28, 2009

SUNS 2009: Scene Understanding Symposium at MIT

This Friday (January 30, 2009) I will be attending SUNS 2009, otherwise known as the Scene Understanding Symposium, held at MIT and organized by Aude Oliva, Thomas Serre, and Antonio Torralba. It is free, so grad students in the area should definitely go!

Quoting the SUNS 2009 homepage, "SUnS 09 features 16 speakers and about 20 poster presenters from a variety of disciplines (neurophysiology, cognitive neuroscience, visual cognition, computational neuroscience and computer vision) who will address a range of topics related to scene and natural image understanding, attention, eye movements, visual search, and navigation."

I'm looking forward to the talks by researchers such as Aude Oliva, David Forsyth, Alan Yuille, and Ted Adelson. I will try to blog about some cool stuff while I'm there.