Abstract

Deep embeddings answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search. The most prominent approaches optimize a deep convolutional network with a suitable loss function, such as contrastive loss or triplet loss. While a rich line of work focuses solely on the loss functions, we show in this paper that selecting training examples plays an equally important role. We propose distance weighted sampling, which selects more informative and stable examples than traditional approaches. In addition, we show that a simple margin based loss is sufficient to outperform all other loss functions. We evaluate our approach on the Stanford Online Products, CAR196, and the CUB200-2011 datasets for image retrieval and clustering, and on the LFW dataset for face verification. Our method achieves state-of-the-art performance on all of them.

Code

Code is available in this MXNet example. Note that in this implementation sampling is performed within each GPU for speed and simplicity, while in the original paper cross-GPU sampling is performed. They produce similar results.

Paper

@inproceedings{wu2017sampling,
  title={Sampling Matters in Deep Embedding Learning},
  author={Wu, Chao-Yuan and Manmatha, R and Smola, Alexander J and Kr{\"a}henb{\"u}hl, Philipp},
  booktitle={ICCV},
  year={2017},
}
									

Results

The proposed distance weighted sampling and margin based loss outperforms popular triplet loss with semi-hard sampling (figure shows Recall@1 on the Stanford Online Products dataset).

We evaluate our approach on the Stanford Online Products, CAR196, and the CUB200-2011 datasets for image retrieval and clustering, and on the LFW dataset for face verification (see paper). Our method achieves state-of-the-art performance on all of them.

Learned embeddings

The following visualizes the learned embeddings on the CUB200-2011 dataset with t-SNE.