
number of machine learning algorithms (e.g. [6, 7]). In some cases, domain knowledge can
be used to define a go od embedding. For example, Torralba et al. [3] found that a 512
dimensional descriptor known as the GIST descriptor, gives an embedding where Euclidean
distance induces a reasonable similarity function on the items. But simply having Euclidean
embedding does not give us a fast retrieval mechanism.
If we forget about the requirement of having a small number of bits in the codewords, then
it is easy to design a binary co de so that items that are close in Euclidean space will map
to similar binary codewords. This is the basis of the popular locality sensitive hashing
method E2LSH [8]. As shown in[8], if every bit in the code is calculated by a random linear
projection followed by a random threshold, then the Hamming distance between codewords
will asymptotically approach the Euclidean distance between the items. But in practice this
method can lead to very inefficient codes . Figure 1 illustrates the problem on a toy dataset
of points uniformly sampled in a two dimensional rectangle. The figure plots the average
precision at Hamming distance 1 using a E2LSH encoding. As the number of bits increases
the precision improves (and approaches one with many bits), but the r ate of convergence
can be very slow.
Rather than using random projections to define the bits in a code, several authors have
pursued machine learning approaches. In [5] the authors used an autoencoder with several
hidden layers. The architecture can be thought of as a restricted Boltzmann machine (RBM)
in which there are only connections between layers and not within layers. In order to learn 32
bits, the middle layer of the autoencoder has 32 hidden units, and noise was injected during
training to encourage these bits to be as binary as possible. This method indeed gives codes
that are much more compact than the E2LSH codes. In [9] they used multiple stacked RBMs
to learn a non-linear mapping between input vector and code bits. Backpropagation us ing
an Neighborhood Components Analysis (NCA) objective function was used to refine the
weights in the network to preserve the neighborhood structure of the input space. Figure 1
shows that the RBM gives much better performance compared to random bits. A simpler
machine learning algorithm (Boosting SSC) was pursued in [10] who used adaBoost to
classify a pair of input items as similar or nonsimilar. Each weak learner was a decision
stump, and the output of all the weak learners on a given output is a binary code. Figure 1
shows that this boosting pr ocedure also works much better than E2LSH codes, although
slightly worse than the RBMs
1
.
The success of machine learning approaches over LSH is not limited to synthetic data. In [5],
RBMs gave several orders of magnitude improvement over LSH in document retrieval tasks.
In [3] both RBMs and Boosting were used to learn binary codes for a database of millions
of images and were found to outperform LSH. Also, the retrieval speed using these short
binary codes was found to be significantly faster than LSH (which was faster than other
methods such as KD trees).
The success of machine learning methods leads us to ask: what is the best code for perform-
ing semantic hashing for a given dataset? We formalize the requirements for a good code
and show that these are equivalent to a particular form of graph partitioning. This shows
that even for a single bit, the problem of finding optimal codes is NP hard. On the other
hand, the analogy to graph partitioning suggests a relaxed version of the problem that leads
to very efficient eigenvector solutions. These eigenvectors are exactly the eigenvectors used
in many spectral algorithms including spectral clustering and Laplacian eigenmaps [6, 11].
This leads to a new algorithm, which we call “spectral hashing” where the bits are calculated
by thresholding a subset of eigenvectors of the Laplacian of the similarity graph. By utiliz-
ing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami
eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel data-
point. Taken together, both learning the code and applying it to a novel point are extremely
simple. Our exper iments show that our codes outperform the state-of-the art.
1
All methods here use the same retrieval algorithm, i.e. semantic hashing. In many applica-
tions of LSH and Boosting SSC, a different retrieval algorithm is used whereby the binary code
only creates a shortlist and exhaustive search is performed on the shortlist. Such an algorithm is
impractical for the scale of data we are considering.
2