CS395T: Visual Recognition and
Search
Spring 2008
Topics
Visual vocabularies
Mining image
collections
Fast indexing
methods
Faces
Datasets and
dataset creation
Near-duplicate
detection
Learning distance
functions
Place recognition
and kidnapped robots
Text/speech and
images/video
Context and
background knowledge in recognition
Learning about
images from keyword-based Web search
Video summarization
Image and video
retargeting
Exploring images in
3D
Canonical views and
visualization
Shape matching
Detecting
abnormal events
Visual vocabularies
Words are basic tokens in a document of text: they
allow us to index documents with a keyword search, or discover topics based on
common distributions of words. What is
the analogy for an image? Visual words are
prototypical local features that form a “vocabulary” to generate images. As with documents, they can be a useful
representation. Various recognition
approaches exploit a bag-of-visual-words feature space, identifying the
vocabulary words based on some quantization of a sample of local
descriptors. These papers address
questions surrounding vocabulary formation, including interest point selection,
quantization strategies, and maintaining efficient codebooks.
- *Sampling Strategies for Bag-of-Features Image
Classification. E. Nowak, F. Jurie,
and B. Triggs. In Proceedings of
the European Conference on Computer Vision (ECCV), 2006. [pdf]
- Visual Categorization with Bags of Keypoints, by
G. Csurka, C. Bray, C. Dance, and L. Fan.
In Workshop on Statistical Learning in Computer Vision, ECCV,
2004. [pdf]
- Adapted Vocabularies for Generic Visual Categorization,
by F. Perronnin, C. Dance, G. Csurka, M. Bressan, in Proceedings of the
European Conference on Computer Vision (ECCV), 2006. [pdf]
- *Fast Discriminative Visual Codebooks using
Randomized Clustering Forests, by A. Moosmann, B. Triggs and F.
Jurie. Neural Information
Processing Systems (NIPS), 2006. [pdf]
- Object Categorization by Learned Universal
Visual Dictionary. J. Winn, A.
Criminisi and T. Minka. In
Proceedings of the IEEE International Conference on Computer Vision
(ICCV), 2005. [pdf]
- Vector Quantizing Feature Space with a Regular Lattice, by T.
Tuytelaars and C. Schmid, in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf]
- *Scalable Recognition
with a Vocabulary Tree, by D.
Nister and H. Stewenius, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2006. [pdf]
- Adaptive Vocabulary Forests for Dynamic Indexing
and Category Learning, by T. Yeh, J. Lee, and T. Darrell. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf] [web]
Related links
Executables
for interest operators and descriptors, from Oxford VGG
Benchmark database from
University of Kentucky, used in vocab tree., plus the semiprocessed data.
Libpmk, library from John Lee
that includes hierarchical clustering / vocab
Software from
LEAR team at INRIA, including interest point detectors, shape features,
randomized forest image classifier
Mining image collections
Mining large unstructured collections of images can
identify common visual patterns and allow the discovery of topics or even
categories. These papers include methods
for clustering according to latent topics and repeated configurations of
features, mining for association rules, and playing with large image
collections.
- Video Data
Mining Using Configurations of Viewpoint Invariant Regions, by Sivic, J.
and Zisserman, A. in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2004. [pdf]
- Efficient Mining of Frequent and Distinctive
Feature Configurations, by T.
Quack, V. Ferrari, B. Leibe, and L. Van Gool, In Proceedings of the IEEE
International Conference on Computer Vision (ICCV), 2007. [pdf]
- Mining Association Rules Between Sets of Items in Large Databases,
by R. Agrawal, T. Imielinski, and A. N. Swami. In Special Interest Group on Management
of Data (SIGMOD), 1993. [pdf]
- Discovering Objects and Their Location in
Images, by J. Sivic, B. Russell, A.
Efros, A. Zisserman, and W. Freeman, In Proceedings of the IEEE
International Conference on Computer Vision (ICCV), 2005. [pdf]
[web]
- Mining Image Datasets using Perceptual Association Rules, by J.
Tesic, S. Newsam, and B. S. Manjunath.
In SIAM’03 Workshop on Mining Scientific and
Engineering Datasets, 2003. [pdf]
Related links
pLSA
implementations
Matlab
code and data for affinity propagation, from Dueck & Frey
Weka: Java data mining software,
includes implementatin of Apriori algorithm
Fast indexing methods
Content-based image and video retrieval, as well as example-based
recognition systems, require the ability to rapidly search very large image
collections. This area deals with
algorithms for fast search, specifically in the context of indexing images or
image features.
- Scalable Recognition
with a Vocabulary Tree, by D.
Nister and H. Stewenius, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2006. [pdf]
- *A Binning Scheme for
Fast Hard Drive Based Image Search, F.
Fraundorfer, H.
Stewenius, and D. Nister, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[pdf]
- *Fast Pose Estimation with Parameter Sensitive
Hashing, by G. Shakhnarovich, P. Viola, T. Darrell, In Proceedings of the
IEEE International Conference on Computer Vision (ICCV), 2003. [pdf]
- Video Google: A Text Retrieval Approach
to Object Matching in Videos, by J.
Sivic and A. Zisserman, In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2003. [pdf] [web]
- Fast Similarity Search for Learned Metrics. P. Jain, B. Kulis, and K. Grauman. UTCS Technical Report #TR-07-48,
September, 2007.
- *Learning Embeddings
for Fast Approximate Nearest Neighbor Retrieval. V. Athitsos, J. Alon, S. Sclaroff,
and G. Kollios, Nearest-Neighbor
Methods in Learning and Vision: Theory and Practice, G. Shakhnarovich, T. Darrell and P. Indyk,
Editors. MIT Press, March
2006. [ps]
Related links
LSH homepage, email authors for code
package
LSH Matlab code by
Greg Shakhnarovich
Nearest
neighbor datasets from Vassilis Athitsos
Electronic
copy of the book Nearest Neighbor Methods
in Learning and Vision: Theory and Practice (UT EID required)
Faces
These papers consider the problems of detecting
faces, recognizing familiar faces, and looking for repeated faces in
videos. A variety of techniques are
represented below.
- Face Recognition: A Literature Survey, by W. Zhao, R. Chellappa, A.
Rosenfeld, and P. Phillips. In ACM
Computing Surveys, 2003. [pdf]
- *Rapid Object Detection Using a Boosted Cascade
of Simple Features, by P. Viola and M. Jones, In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2001. [pdf]
o Active Appearance Models, by T.F.Cootes, G.J. Edwards
and C.J.Taylor. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), Vol.23, No.6,
pp.681-685, 2001.
- *Automatic Cast Listing in Feature-Length
Films with Anisotropic Manifold Space, by Arandjelovic and R. Cipolla, In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2006. [pdf]
- Person Spotting: Video Shot Retrieval for Face Sets, J. Sivic, M.
Everingham, and A. Zisserman. In International Conference on Image and
Video Retrieval (CIVR), 2005. [pdf]
- Leveraging Archival Video for Building
Face Datasets, D. Ramanan, S.
Baker, S. Kakade. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf]
- Face Recognition by Humans: 19 Results All Computer Vision
Researchers Should Know About, by P. Sinha, B. Balas, Y. Ostrovsky, and R.
Russell, Proceedings of the IEEE,
Vol. 94, No. 11, November 2006, pp. 1948-1962. [pdf]
Related links
Intel’s OpenCV
library, includes Viola & Jones face detector
Active
Appearance Models code from Tim Cootes
Data collections of
detected faces, from Oxford VGG
Face data from Buffy
episode, from Oxford VGG
University of Cambridge face data
from films [go to Data link]
PolarRose.com
Pittsburgh Pattern Recognition face detector
demo
Datasets and dataset creation
These papers discuss issues in generating image
datasets for recognition research.
Benchmark image datasets allow direct comparisons between various
recognition algorithms, and having accessible prepared datasets can be critical
for the research itself. The process of
designing an image collection is also important, since the degree of
variability can to some degree influence the assumptions made by new methods,
or may not adequately show-off their strengths.
Meanwhile, the process of collecting labeled data is expensive and can
be tedious. These papers include novel
ways to gather image collections with less pain, and highlight some of the
considerations to be made in database design.
*Coverage of this area should include highlights on recent commonly used
datasets.*
- Dataset Issues in Object Recognition. by J. Ponce, T.L. Berg, M.
Everingham, D.A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid,
B.C. Russell, A. Torralba, C.K.I. Williams, J. Zhang, and A.
Zisserman. In J. Ponce et al. (Eds.):
Toward Category-Level Object Recognition, LNCS 4170, pp. 29–48, 2006. [pdf]
- Soylent Grid: it’s Made of People! by S.
Steinbach, V. Rabaud and S. Belongie,
ICCV workshop on Interactive Computer Vision, 2007. [pdf]
- Harvesting Image Databases from the Web, by F. Schroff, A.
Criminisi, and A. Zisserman, Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf]
[No demo on this topic.]
Related links
Dataset
list with links
Near-duplicate detection
This problem
involves detecting cases where multiple images (or videos) are the same except
for some slight alterations.
Near-duplicate detection can be useful for detecting copyright
violations or forged images. These
papers include several vision approaches, as well as some papers on the core algorithms
often used.
- Efficient Near-Duplicate Detection and
Subimage Retrieval, by Yan Ke,
Rahul Sukthankar, and Larry Huston, ACM Multimedia 2004. [pdf]
- Enhancing DPF for Near-replica Image
Recognition, by Y. Meng, E. Chang, and B.
Li, Proceedings of the Conference on Computer Vision and Pattern
Recognition (CVPR), 2003. [pdf]
- Content-based Copy Detection using
Distortion-Based Probabilistic Similarity Search, by A. Joly, O. Buisson,
and C. Frélicot. In IEEE
Transactions on Multimedia, 2007. [pdf]
- Filtering Image Spam with Near-Duplicate
Detection, by Zhe Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li. Proceedings
of the 4th Conference on Email and Anti-Spam (CEAS), 2007. [pdf]
- M. Henzinger. Finding Near-Duplicate Web Pages: a Large-Scale
Evaluation of Algorithms. In ACM Special Interest Group on Information
Retrieval (SIGIR), 2006. (text
application) [pdf]
- On the Resemblance and Containment of Documents, Andrei Z. Broder,
1997. [pdf]
- Similarity Estimation Techniques from Rounding Algorithms, M. S.
Charikar. In 34th Annual
ACMSymposium on Theory of Computing (May 2002). [ps]
- Scalable Near Identical Image and Shot
Detection, by O. Chum, J. Philbin,
M. Isard, and A. Zisserman, ACM International
Conference on Image and Video Retrieval, 2007. [pdf]
Related links:
Data
from Ke et al. paper
LSH
homepage, email authors for code package
LSH Matlab code by
Greg Shakhnarovich
TRECVID data
Learning
distance functions
The success
of any distance-based indexing, clustering, or classification scheme depends
critically on the quality of the chosen distance metric, and the extent to which
it accurately reflects the true underlying relationships between the examples
in a particular data domain. An optimal distance metric should report small
distances for examples that are similar in the parameter space of interest (or
that share a class label), and large distances for examples that are
unrelated. These papers consider
distance learning specifically for image retrieval tasks.
- Learning Distance
Functions for Image Retrieval, by T.
Hertz, A. Bar-Hillel and D. Weinshall, in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) 2004.
[pdf]
- Learning a Mahalanobis Metric from Equivalence
Constraints, by A.
Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, in Journal of Machine
Learning Research (JMLR), 2005. [pdf]
- *Learning Globally-Consistent Local Distance
Functions for Shape-Based Image Retrieval and Classification, by A. Frome,
Y. Singer, F. Sha, J. Malik, in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf] [web]
- *Invariant Large
Margin Nearest Neighbor Classifier,
by P. Mudigonda, P. Torr, and A. Zisserman, in Proceedings of the IEEE International Conference on
Computer Vision (ICCV), 2007. [pdf]
- Fast Pose Estimation with Parameter Sensitive
Hashing, by G. Shakhnarovich, P. Viola, and T. Darrell, in Proceedings of
the IEEE International Conference on Computer Vision (ICCV), 2003. [pdf]
Related links:
DistBoost code,
Hertz et al.
Relevant Components Analysis code, Hertz et al.
DistLearn toolkit
Large
Margin Nearest Neighbors code by Weinberger et al.
Nearest
neighbor datasets from Vassilis Athitsos
Place recognition and kidnapped robots
How can an image of the current scene allow
localization or place recognition? Or, put
more dramatically, how can a kidnapped robot that is carried off to an
arbitrary location figure out where it is with no prior knowledge of its
position? These papers address this
problem, some specifically with a robotics slant, and some in terms of the
image-based scene matching problem.
- *Vision-Based Global Localization and Mapping for Mobile
Robots, Se, S., Lowe, D., & Little, J.
IEEE Transactions on Robotics, 2005. [pdf]
- Image-Based Localisation, by R. Cipolla, D. Robertson and B. Tordoff. Proceedings
of the10th International Conference on Virtual Systems and Multimedia,
2004. [pdf]
- *Qualitative Image Based Localization in
Indoors Environments, by J.
Kosecka, L. Zhou, P. Barber, and Z. Duric, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2003. [pdf]
- Location Recognition and Global Localization Based on
Scale-Invariant Keypoints, by J. Kosecka and X. Yang, CVPR workshop 2004. [pdf]
- Searching the Web with
Mobile Images for Location Recognition, T. Yeh, K. Tollmar, and T. Darrell, in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2004. [pdf]
- Total Recall:
Automatic Query Expansion with a Generative Feature Model for Object
Retrieval, by O. Chum, J. Philbin,
J. Sivic, M. Isard, A. Zisserman, in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf]
Related links:
Oxford
buildings dataset
Text and speech + images and video
Often images or videos are accompanied by text or
speech, which may provide complementary cues when we are trying to index,
cluster, or recognize objects. These
papers seek to leverage this cue in a number of different ways.
- *“Hello! My name is... Buffy” – Automatic Naming of Characters in
TV Video, by M. Everingham, J. Sivic and A. Zisserman, British Machine
Vision Conference (BMVC), 2006. [pdf]
- *Object Recognition as Machine Translation: Learning a Lexicon for
a Fixed Image Vocabulary, P. Duygulu, K. Barnard, N. de Freitas, and D.
Forsyth, in Proceedings of the European Conference on Computer Vision
(ECCV), 2002. [pdf] [web]
- Names and Faces in the News, by T. Berg, A.
Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller and D. Forsyth,
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2004. [pdf] [web]
- Learning Structured Appearance Models
from Captioned Images of Cluttered Scenes, by M. Jamieson A. Fazly, S.
Dickinson, S. Stevenson, S.
Wachsmuth. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2007. [pdf]
- Clustering Web Images with Multi-modal
Features, by M. Rege, M. Dong, and
J. Hua, ACM Multimedia 2007. [pdf]
Related links:
Face data from Buffy
episode, from Oxford Visual Geometry Group
Data from Duygulu et al. paper
Subrip for subtitle extraction
Context and
background knowledge in recognition
Many recognition systems consider snapshots of
objects in isolation, both when training and testing. But both our intuition and cognitive studies
indicate that the object’s greater context can also be crucial to the
recognition process. These papers
consider how prior external knowledge can aid in recognizing objects or
categories. The context cues may come
from reasoning explicitly about the 3d environment, knowing something about the
patterns of a user, learning about the typical patterns of occurrence, or
gleaning knowledge from an organized ontology.
- *Putting Objects in Perspective, by D. Hoiem,
A.A. Efros, and M. Hebert, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2006. [pdf] [web]
- Objects in Context, by A. Rabinovich, A. Vedaldi, C. Galleguillos, E.
Wiewiora, S. Belongie, in Proceedings of
the IEEE International Conference on Computer Vision (ICCV), 2007.
[pdf]
- Visual Contextual Awareness in Wearable Computing, by T. Starner,
B. Schiele, and A. Pentland. In Proceedings of Visual Contextual
Awareness in Wearable Computing, 1998. [pdf] [web]
- *Contextual Priming for Object Detection, by A. Torralba. International
Journal of Computer Vision, 2003.
[pdf] [web] [web]
- The Role of Context in Object Recognition, by A. Oliva and A.
Torralba. TRENDS in Cognitive Sciences, Vol 11 No 12, 2007. [pdf]
- Unsupervised Learning of Hierarchical Semantics
of Objects, by D. Parikh and T. Chen, in Proceedings of the International
Conference on Computer Vision (ICCV), 2007. [pdf]
[web]
Related links:
WordNet
Scene global feature code from Antonio
Torralba
MIT CSAIL database of
objects and scenes
Learning about images from
keyword-based Web search
Keyword-based search on the Web can be used to
retrieve images (or videos) that appear near the query word, are named with the
word, or are explicitly tagged with it.
Of course, this is not a completely reliable way to find images of a
given object or scene, and typically an image contains much more information
than can be conveyed in a few words anyhow.
Yet search engines’ rapid access to large amounts of image/video content
make them an interesting resource for vision research. These papers all consider ways to learn from
the images that come back from a keyword-based search, taking into account the
large amount of noise in the returns.
- *Learning Color Names from Real-World Images, by J. van de Weijer,
C. Schmid, J. Verbeek, in Proceedings of the IEEE International Conference
on Computer Vision (ICCV), 2007. [pdf]
- Searching the Web with
Mobile Images for Location Recognition, T. Yeh, K. Tollmar, and T. Darrell, in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2004. [pdf]
- *Learning Object Categories from Google’s Image Search, by R.
Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2005. [pdf] [web]
- Animals on the Web, by T. Berg and D. Forsyth,
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR),
2006. [pdf]
- Keywords to Visual Categories: Multiple-Instance
Learning for Weakly Supervised Object Categorization, by S.
Vijayanarasimhan and K. Grauman, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2008. [pdf]
- Harvesting Image Databases from the Web, by F. Schroff, A.
Criminisi, and A. Zisserman, in Proceedings
of the IEEE International Conference on Computer Vision (ICCV),
2007. [pdf]
- Probabilistic Web Image Gathering, by K. Yanai and K. Barnard, in ACM Multimedia
2005. [pdf]
Related links:
Animals on the
Web data from Berg et al.
Annotated Google
image data from Schroff et al. paper
Color name
datasets from van de Weijer et al. and
feature code
Google image data from Fergus et al.
Flickr Commons,
Library of Congress pilot project
Semantic robot vision challenge