"Image Retrieval and Classification using Local Distance Functions" - A. Frome,
Y. Singer, and J. Malik

The paper introduces a framework for learning the distance or similarity
function between a training image and other images in the training set for
visual recognition.  The distance functions are built on top of elementary
distance measures between path-based features, where the authors use two scales
of geometric blur features and a color feature.  The distance functions are used
for image browsing, retrieval, and classification.

The main contribution of the paper is incorporating the function for measuring
similarity or distance between two different images into the machine learning
process.  The authors try to demonstrate that the relative importance of visual
features on a finer scale can be useful for visual categorization, and do so by
implementing the distance functions.  Their objective is to learn the weights
of the features for each training image, which outputs a quantitative measure
of the relative importance of parts in an image.  This allows the authors to
combine and select features of different types.

The primary strength of the paper is its breakdown of the overall approach into
detailed segments such that the reader understands the goal and purpose for
each subsection.  The authors also explain how their approach of implementing
the "triplets" of images for learning is based on the distance metric learning
proposed by Schultz and Joachims, and show that the algorithm by Schultz and
Joachims is more widely applicable than originally presented.
Another strength is the explanation on specific instances of where the algorithm
could be used; image retrieval, browsing for an image, and classifying a query
image.

The authors use the Caltech101 dataset, which contains images from 101 different
categories with different numbers of images for each category, which is one of
the standard benchmarks for multi-class image categorization and object
recognition.  However, the authors ignored the background class, which if were
included in the experimentation, may produce poorer classification results.
The authors should experiment with different choices of "K" for determining
closest in-class and out-of-class images to make "triplets" for training.
The authors have also explored how the features perform separately and in
different combinations, which can be used to determine the best features to use
for classification.  Different combinations may perform better for certain image
classes, which would be an area for further exploration.

The work can be extended to improve computational time.  The unoptimized code of
the algorithm takes about 5 minutes per test image.  Since the authors use an
exact nearest neighbor computation, an approximate nearest neighbor algorithm
could be used to speed up the process.

 

 

 

 

 

Review of
  A. Frome, Y. Singer, and J. Malik.
 
Image Retrieval and Classification Using Local Distance Functions.  NIPS 2006.   
___________________________________________________________________________

A key step in many algorithms for object recognition is the computation
of distance between a new image and a set of training images to determine
the class of the new image. This paper presents an approach to automatically
learn the distance function in training images.

While there have been other approaches to automatically learn the distance
function, the main contribution of this paper is to learn a distance function
for _each_ training image. By doing so, the authors are able to produce
recognition rates that are at least equal to that of the current
state-of-the-art approaches on the Caltech101 dataset.

Each image consists of patches that are each identified by a feature vector. 
The distance from a patch in a focal image to a query image is found by
computing the L2 distance to the nearest neighbor in the query image. The
distance from focal image to query image is defined by a weighted
sum of these patch distances. The goal is to learn the weights for each
training image. This is done by maximizing the difference of the distance
between a pair of images that are labeled similar and dissimilar to the focal
image (that is, by a maximum margin formulation).

A query image is assigned rankings with respect to each training image based
on the distance functions. To identify the class of a query image, a binary
classifier is learnt for each training image and then voting is used to
generate the probability of the query image belonging to a particular class.
 
The paper is clearly written and easy to understand. Some advantages and
limitations of the approach are:

(1) One significant advantage is that the approach does not require each
    image to be defined by a fixed length feature vector: this allows the use
    of local features such as that obtained by interest point detectors
    rather than a global feature description. This means that one can compare
    a "simple" image with very few interest points to a complex cluttered
    scene with hundreds of interest points.

(2) Another advantage is that many of the weights learnt are zero reducing the
    number of feature comparisons between a query images and training image.

(3) An important feature of this method is that a distance function is learnt
    for each training image:

    (a) This may provide greater discriminative power.

    (b) The distance functions are not directly comparable since they are not
        in a normalized form. Thus, a binary classifier is trained for a each
        training image rather than for each class. This could be a big
        disadvantage as the number of classes and hence the number of training
        images per class grows. This also seems like generating redundant
        information. For a human being, it is possible to identify the relative
        "distance" between  all images, thus creating a ranking in global reference
        frame. This should also be possible for a machine to do rather than
        producing a relative ranking in each training image's reference frame.
        This is worth exploring further.

 

 

-------------------

 

 

 

 

"Unsupervised Learning of Models for Recognition" - M. Weber, M. Welling, P.
Perona

The authors present a method to learn object class models, defined as a
collection of objects which share characteristic features that are visually
similar and occur in similar spatial configurations, from unlabeled and
unsegmented cluttered scenes.  Their algorithm automatically selects
distinctive parts of the object class and learns the joint probability density
function encoding the object's appearance.  They show that the automatically
constructed object detector is robust to clutter and occlusion and demonstrate
the algorithm on frontal views of faces and rear views of cars.

The main contribution of the paper is the demonstration of the feasibility to
learn object models directly from unsegmented cluttered images.  By automating
the running of part detectors over the image and the formation of likely object
hypotheses, the authors have extended the algorithm presented by Burl et al.,
which only estimated the joint probability density function automatically.
Although there are many areas for improvement in their algorithm, the authors
have shown that unsupervised learning of models for recognition is possible and
can be done efficiently.

The paper's primary strengths are organization and detailed explanation of the
implementation.  The authors present the general problem at hand, related work
on the area, their approach to the problem, details of the model, and
experimental results.  Each category transits to the next with purpose such
that the reader is able to follow the intuition that the authors had when
building their algorithm.


An area for improvement would be to show direct comparisons to other algorithms;
a table or graph to show the different computational times and efficiencies in
detection/recognition.  In order to determine the validity of an unsupervised
learning algorithm as opposed to a human supervised algorithm, a comparison
with a detector of the same implementation but with human supervision could
have been presented.

The experiments are convincing.  The authors allowed their model to classify the
images without any intervention. However, further tests could have been made to
improve the overall assessment of their algorithm.  More training and test
images of both the face and car classes could have been used to observe changes
in the detector's performance.  More parts could have been learned by the model
(more than 5 parts for both face and car data sets); again to observe changes
in the detector's performance.  Of course both of the above tests would have
drastically increased the computation times for detection, but dealing with the
training as an off-line process, further experimentation would have been valid.

The work can be extended by incorporating different approaches to the detection
algorithm, such as multiscale image processing, multiorientation-multiresolution filters, neural networks, etc.  The scale and
orientation of the image patch, parameters describing the appearance and
likelihood of the patch, should be incorporated in the algorithm (including the
current use of the candidate part's location).  Optimization of the interest
operator and unsupervised clustering of parts is also an area of extension.
Another extension would be to build a model that is invariant to translation,
rotation, and scale, which would enable learning and recognition of images with
much larger viewpoint variations.

 

-----------------------------

 

 

 

Generic Visual Categorization (GVC) is the problem of identifying objects of
multiple classes in images. This paper presents a method for GVC by extending
the bag-of-keypoints approach of Csurka et. al (2004).

The main contribution of the paper is to provide a fast (computationally the
fastest method so far) of GVC using two types of vocabularies to represent
objects: a universal vocabulary and a class-specific adaptation of the
universal vocabulary. Instead of building one single universal vocabulary
by aggregation of class vocabularies (size C x N where C is number of categories
and N is class vocabulary size), the authors use a universal vocabulary
and a class adapted vocabulary (size 2 x N), thus reducing the computational
complexity.

The paper uses a Gaussian Mixture Model to represent a visual vocabulary.
The approach consists of two main steps. In the first step, the parameters of
a universal vocabulary are learnt from a training set of images of all
categories using Maximum Likelihood Estimation. The vocabulary parameters
are then adapted to a specific class using images from that class using the
MAP criterion to get an adapted vocabulary. The above two vocabularies are then
combined. In the second step, m linear SVMs are learnt using the above
vocabularies and training images, one per class.

The experiments have been carried out on three large datasets, in-house
(8 scenes with multiple objects), LAVA7 (7 categories) and Wang (10 categories).
The results show an accuracy of 95.8% in 150 ms. on LAVA7, which is the highest
accuracy and the fastest method so far on this database.

Limitations and some comments:

(1) The results show that the approach correctly categorizes the image. Was there
    only one object of the category set in an image? Did any image have objects of
    more than one category? If so, what was the percent of image area occupied by
    the "main" category? Did the method work for occluded objects?

(2) I am little wary of the speed comparison results because it depends a lot on
    how well the code is written and optimized, the programming language used, the
    machine used etc. The speed comparison does not make sense unless the there is
    some way of standardizing the above parameters.

(3)The experimental results for accuracy are convincing. The experiments on the
   in-house database are great because the test images were collected independently
   by a third party.

(4)The good results were obtained even with color features and not juts SIFT features.

(5)Perhaps reducing the size of SIFT vectors from 128 to 50 using PCA had some
   impact on the speed?