spectralhashing资源-CSDN下载

spectral

hashing

需积分: 9 195 浏览量 2015-09-06 22:32:22 上传评论收藏 996KB PDF 举报

资源推荐

资源详情

资源评论

Spectral Hashing

Yair Weiss

1,3

School of Computer Science,

Hebrew University,

91904, Jerusalem, Israel

[email protected]

Antonio Torralba

CSAIL, MIT,

32 Vassar St.,

Cambridge, MA 02139

[email protected]

Rob Fergus

Courant Institute, NYU,

715 Broadway,

New York, NY 10003

[email protected]

Abstract

Semantic hashing[1] seeks compact binary codes of data-points so that the

Hamming distance between codewords correlates with semantic similarity.

In this paper, we show that the problem of ﬁnding a best co de for a given

dataset is closely related to the problem of graph partitioning and can

be shown to be NP hard. By relaxing the original problem, we obtain a

spectral method whose solutions are simply a subset of thresholded eigen-

vectors of the graph Laplacian. By utilizing recent results on convergence

of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of

manifolds, we show how to eﬃciently calculate the code of a novel data-

point. Taken together, both learning the code and applying it to a novel

point are extremely simple. Our experiments show that our codes outper-

form the state-of-the art.

1 Introduction

With the advent of the Internet, it is now possible to use huge training sets to address

challenging tasks in machine learning. As a motivating example, consider the recent work

of Torralba et al. who collected a dataset of 80 million images from the Internet [2, 3]. They

then used this weakly labeled dataset to perform scene categorization. To categorize a novel

image, they simply searched for similar images in the dataset and used the labels of these

retrieved images to predict the label of the novel image. A similar approach was used in [4]

for scene completion.

Although conceptually simple, actually carrying out such methods requires highly eﬃcient

ways of (1) storing millions of images in memory and (2) quickly ﬁnding similar images to

a target image.

Semantic hashing, introduced by Salakhutdinov and Hinton[5] , is a clever way of addressing

both of these challenges. In semantic hashing, each item in the database is represented by a

compact binary code. The code is constructed so that similar items will have similar binary

codewords and there is a simple feedforward network that can calculate the binary code for

a novel input. Retrieving similar neighbors is then done simply by retrieving all items with

codes within a small Hamming distance of the code for the query. This kind of retrieval can

be amazingly fast - millions of queries per second on standard computers. The key for this

method to work is to learn a good code for the dataset. We need a code that is (1) easily

computed for a novel input (2) requires a small number of bits to code the full dataset and

(3) maps similar items to similar binary codewords.

To simplify the problem, we will assume that the items have already been embedded in

a Euclidean space, say R

, in which Euclidean distance correlates with the desired simi-

larity. The problem of ﬁnding such a Euclidean emb edding has been addressed in a large

number of machine learning algorithms (e.g. [6, 7]). In some cases, domain knowledge can

be used to deﬁne a go od embedding. For example, Torralba et al. [3] found that a 512

dimensional descriptor known as the GIST descriptor, gives an embedding where Euclidean

distance induces a reasonable similarity function on the items. But simply having Euclidean

embedding does not give us a fast retrieval mechanism.

If we forget about the requirement of having a small number of bits in the codewords, then

it is easy to design a binary co de so that items that are close in Euclidean space will map

to similar binary codewords. This is the basis of the popular locality sensitive hashing

method E2LSH [8]. As shown in[8], if every bit in the code is calculated by a random linear

projection followed by a random threshold, then the Hamming distance between codewords

will asymptotically approach the Euclidean distance between the items. But in practice this

method can lead to very ineﬃcient codes . Figure 1 illustrates the problem on a toy dataset

of points uniformly sampled in a two dimensional rectangle. The ﬁgure plots the average

precision at Hamming distance 1 using a E2LSH encoding. As the number of bits increases

the precision improves (and approaches one with many bits), but the r ate of convergence

can be very slow.

Rather than using random projections to deﬁne the bits in a code, several authors have

pursued machine learning approaches. In [5] the authors used an autoencoder with several

hidden layers. The architecture can be thought of as a restricted Boltzmann machine (RBM)

in which there are only connections between layers and not within layers. In order to learn 32

bits, the middle layer of the autoencoder has 32 hidden units, and noise was injected during

training to encourage these bits to be as binary as possible. This method indeed gives codes

that are much more compact than the E2LSH codes. In [9] they used multiple stacked RBMs

to learn a non-linear mapping between input vector and code bits. Backpropagation us ing

an Neighborhood Components Analysis (NCA) objective function was used to reﬁne the

weights in the network to preserve the neighborhood structure of the input space. Figure 1

shows that the RBM gives much better performance compared to random bits. A simpler

machine learning algorithm (Boosting SSC) was pursued in [10] who used adaBoost to

classify a pair of input items as similar or nonsimilar. Each weak learner was a decision

stump, and the output of all the weak learners on a given output is a binary code. Figure 1

shows that this boosting pr ocedure also works much better than E2LSH codes, although

slightly worse than the RBMs

The success of machine learning approaches over LSH is not limited to synthetic data. In [5],

RBMs gave several orders of magnitude improvement over LSH in document retrieval tasks.

In [3] both RBMs and Boosting were used to learn binary codes for a database of millions

of images and were found to outperform LSH. Also, the retrieval speed using these short

binary codes was found to be signiﬁcantly faster than LSH (which was faster than other

methods such as KD trees).

The success of machine learning methods leads us to ask: what is the best code for perform-

ing semantic hashing for a given dataset? We formalize the requirements for a good code

and show that these are equivalent to a particular form of graph partitioning. This shows

that even for a single bit, the problem of ﬁnding optimal codes is NP hard. On the other

hand, the analogy to graph partitioning suggests a relaxed version of the problem that leads

to very eﬃcient eigenvector solutions. These eigenvectors are exactly the eigenvectors used

in many spectral algorithms including spectral clustering and Laplacian eigenmaps [6, 11].

This leads to a new algorithm, which we call “spectral hashing” where the bits are calculated

by thresholding a subset of eigenvectors of the Laplacian of the similarity graph. By utiliz-

ing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami

eigenfunctions of manifolds, we show how to eﬃciently calculate the code of a novel data-

point. Taken together, both learning the code and applying it to a novel point are extremely

simple. Our exper iments show that our codes outperform the state-of-the art.

All methods here use the same retrieval algorithm, i.e. semantic hashing. In many applica-

tions of LSH and Boosting SSC, a diﬀerent retrieval algorithm is used whereby the binary code

only creates a shortlist and exhaustive search is performed on the shortlist. Such an algorithm is

impractical for the scale of data we are considering.

剩余7页未读，继续阅读

评论收藏

内容反馈

u010453586

粉丝: 0

spectral hashing

拟合频谱哈希

Reversed Spectral Hashing

Supervised Hashing with Kernels（KSH）

Spectral-Clustering-Algorithms

FSpH：拟合频谱散列，可进行有效的相似度搜索

医学图像处理

最近似近邻相关论文

Deep learning Methods and Applications

c++入门，核心，提高讲义笔记

数字图像处理 冈萨雷斯 课后习题

离散数学及其应用 第八版 奇数编号练习答案.pdf

科研伦理与学术规范 期末考试2 （40题）.pdf

软件著作权设计说明书模板（含填写说明）.docx

最值得收藏的 考研线性代数 全部知识点思维导图整理(张宇, 汤家凤), 附带惯用思维/做题技巧/易错点整理.emmx

SMA_Connector.zip

AUTOSAR官方培训教材.zip

最优化理论与算法习题解答.pdf

LabView 官方教程（全）

HALCON快速入门手册.pdf

菜菜sklearn课程讲义.rar

notepad++-7.9下载

最值得收藏的 考研高等数学 全部知识点思维导图整理(张宇, 汤家凤), 附带做题技巧/易错点/知识点整理.emmx

孙兴华讲PowerBI【火力全开版】课件和笔记.rar

浙江大学机器学习配套资源（胡老师）.rar

2019年最新全国行政区划省市区县级别（矢量数据.shp格式）

费恩曼物理学讲义.pdf

【C++】4种强制类型转换|static_cast|dynamic_cast|reinterpret_cast|const_cast---编辑中

探索ICCS 2018：图与概念的前沿

最新资源

数字图像处理冈萨雷斯课后习题

离散数学及其应用第八版奇数编号练习答案.pdf

科研伦理与学术规范期末考试2 （40题）.pdf

最值得收藏的考研线性代数全部知识点思维导图整理(张宇, 汤家凤), 附带惯用思维/做题技巧/易错点整理.emmx

最值得收藏的考研高等数学全部知识点思维导图整理(张宇, 汤家凤), 附带做题技巧/易错点/知识点整理.emmx