没有合适的资源?快使用搜索试试~ 我知道了~
Hierarchical Consensus Hashing for Cross-Modal Retrieval


试读
13页
需积分: 0 1 下载量 196 浏览量
更新于2024-11-24
收藏 3.2MB PDF 举报
交叉模态哈希是一种近年来备受关注的技术,主要用于促进不同模态数据之间的高效检索。这种技术能够将不同模态(如文本、图像、音频等)的数据通过哈希编码映射到低维哈希码中,以便实现快速的相似性搜索。当前大多数方法没有充分考虑到数据的层次结构信息,往往直接通过单层哈希函数将跨模态数据一次性映射为共同的低维哈希码。这种做法虽然简单快速,但是会造成维度的突然降低,以及巨大的语义间隙,进而导致鉴别信息的损失。
为了解决上述问题,本文提出了一个名为层次共识哈希(Hierarchical Consensus Cross-Modal Hashing, HCCH)的新型哈希技术。HCCH技术采用了一种从粗到细的渐进机制,提出了一种分层的哈希方案,该方案利用了两层哈希函数,首先通过较粗的哈希函数过滤掉冗余和损坏的特征,然后逐步细化到更细致的哈希函数,以渐进式的方式将数据对投影到汉明空间中。为了更有效地将数据编码到共识空间中,本文还引入了共识学习方法,通过这种方式可以逐步减少语义间隙。
在实验部分,研究人员将HCCH方法与其他先进的跨模态哈希方法在四个基准数据集上进行了广泛的对比实验。结果显示,本文提出的HCCH方法在这些数据集上显示出了其有效性和高效性。这表明HCCH方法在实际应用中具有很大的潜力,如在跨模态检索任务中,可以检索出与查询集语义上相关的实例。此外,跨模态哈希技术在现实世界中的多种应用中也逐渐吸引了更多的关注,例如在图像搜索、视频标注和多媒体检索等领域中,都有很好的应用前景。
文中还提到了一些关键词:共识学习、跨模态检索,这些关键词代表了本研究的核心内容和应用领域。通过文献中提及的对比图,我们可以直观地看到不同跨模态哈希方法之间的差异,其中,HCCH方法在减少语义间隙和保留鉴别信息方面显示出了明显的优势。
总结而言,层次共识哈希技术通过分层哈希方案和共识学习,有效解决了跨模态数据检索中遇到的语义间隙和鉴别信息损失问题,进一步推动了交叉模态哈希技术在现实应用中的发展。

824 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
Hierarchical Consensus Hashing for
Cross-Modal Retrieval
Yuan Sun , Zhenwen Ren ,PengHu , Dezhong Peng, and Xu Wang
Abstract—Cross-modal hashing (CMH) has gained much
attention due to its effectiveness and efficiency in facilitating
efficient retrieval between different modalities. Whereas, most
existing methods unconsciously ignore the hierarchical structural
information of the data, and often learn a single-layer hash
function to directly transform cross-modal data into common
low-dimensional hash codes in one step. This sudden drop of
dimension and the huge semantic gap can cause the discriminative
information loss. To this end, we adopt a coarse-to-fine progressive
mechanism and propose a novel Hierarchical Consensus Cross-
Modal Hashing (HCCH). Specifically, to mitigate the loss of
important discriminative information, we propose a coarse-to-fine
hierarchical hashing scheme that utilizes a two-layer hash function
to refine the beneficial discriminative information gradually. And
then, the
2,1
-norm is imposed on the layer-wise hash function to
alleviate the effects of redundant and corrupted features. Finally,
we present consensus learning to effectively encode data into
a consensus space in such a progressive way, thereby reducing
the semantic gap progressively. Through extensive contrast
experiments with some advanced CMH methods, the effectiveness
and efficiency of our HCCH method are demonstrated on four
benchmark datasets.
Index Terms—Consensus learning, cross-modal retrieval,
hierarchical hashing, learning to hash.
I. INTRODUCTION
I
N RECENT years, with the ubiquity of multi-media equip-
ment, multimedia data is becoming more accessible. Cross-
modal retrieval [1], [2], [3] is in urgent demand and has
Manuscript received 16 December 2022; revised 26 March 2023; accepted
25 April 2023. Date of publication 5 May 2023; date of current version 18 Jan-
uary 2024. The work was supported in part by the National Natural Science
Foundation of China under Grants U19A2078, 62102274, and 62106209, in
part by the National Defense Science and Technology through Base Strength-
ening Program under Grant 2022-JCJQ-JJ-0292, in part by Sichuan Science
and Technology Planning Project under Grants 2022YFQ0014, 2021YFG0301,
2023ZHCG0016, and 2023YFG0033, in part by the Fundamental Research
Funds for the Central Universities under Grant 2022SCU12081, and in part by
Sichuan University Postdoctoral Interdisciplinary Innovation Fund under Grant
JCXK2234. The Associate Editor coordinating the review of this manuscript and
approving it for publication was Dr. Wengang Zhou. (Corresponding author: Xu
Wang.)
Yuan Sun, Peng Hu, and Xu Wang are with the College of Computer Science,
Zhenwen Ren is with the Department of National Defence Science and Tech-
nology, Southwest University of Science and Technology, Mianyang 621010,
Dezhong Peng is with the College of Computer Science, Sichuan Univer-
sity, Chengdu 610044, China, and also with the Sichuan Zhiqian Technology
We have released the source code at https://github.com/sunyuan-cs.
Digital Object Identifier 10.1109/TMM.2023.3272169
Fig. 1. Comparison of different cross-modal hashing methods. (a) Existing
cross-modal hashing directly projects data pair into Hamming space by the
single-layer hash function W . (b) Our HCCH proposes a coarse-to-fine hier-
archical hashing scheme that utilizes a two-layer hash function PR to project
data pair into Hamming space gradually. Different shapes and the different colors
represent different classes.
attracted more and more attention in a variety of real-world
applications [4], [5], [6]. It aims at retrieving semantically re-
lated instances from different modalities f or the query set, such
as adopting text to search a similar image or adopting an im-
age to search the similar text. Cross-modal Hashing (CMH) [7],
[8] can well deal with such cross-modal retrieval tasks, when
facing large-scale data since binary hash codes can reduce stor-
age space cost and significantly speed up the retrieval process
through XOR operations. Therefore, CMH has attracted a lot of
attention for scalable cross-modal search [9], [10].
Since hashing learning has low space cost, fast search speed,
and effective accuracy, a large number of CMH methods [11],
[12] have been proposed in recent years. Hashing l earning [13]
aims to project original data into Hamming space, while pre-
serving semantic similarity from original space. The pioneering
hashing methods can be usually parted into two types: unsuper-
vised hashing and supervised hashing. The former mainly ex-
plores the semantic correlations from different modalities with-
out label information, thereby learning binary codes. And the
latter mainly utilizes the label information that guides to gener-
ate the accurate hash codes. Compared with unsupervised meth-
ods, supervised CMH usually has better performance due to the
embrace of some prior semantic knowledge.
Although supervised CMH methods have obtained promis-
ing performance, there are still some tough challenges to be
further addressed. First, the previous CMH methods invariably
tend to learn binary codes using a single-layer hash function (as
shown in Fig. 1). Nevertheless, few methods learn hash codes
in multi-level space to capture hierarchical relation of features.
1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Shandong Normal University. Downloaded on November 14,2024 at 03:21:07 UTC from IEEE Xplore. Restrictions apply.

SUN et al.: HIERARCHICAL CONSENSUS HASHING FOR CROSS-MODAL R ETRIEVAL 825
Fig. 2. Overview of hierarchical consensus cross-modal hashing framework (HCCH). Our HCCH projects the text-image pair into specific hash codes by two-layer
hash function, in which hierarchical hashing mapping is performed in different layers to capture multi-level structure information. Then, a consensus learning
scheme is proposed that learns consistent hash codes to reduce the distribution difference between text and image hash codes.
Second, the existing CMH methods often directly project high-
dimensional instance pairs into low-dimensional hash codes by
a single-layer hash function. The sudden one-step dimension
reduction can result in the loss of certain important data proper-
ties or discriminative information. Third, to reduce the challeng-
ing heterogeneity gap of multiple modalities, one widely used
scheme is to map original multi-modal data into a shared Ham-
ming space. Clearly, this scheme ignores some latent beneficial
specific knowledge, resulting in poor performance.
To overcome these problems and improve the retrieval per-
formance, we propose a novel Hierarchical Consensus Cross-
modal Hashing (HCCH). As shown in Fig. 2, we propose a
hierarchical hashing strategy with a coarse-to-fine architecture
to refine discrimination feature information. That is, we use a
two-layer hash function with the
2,1
-norm constraint to learn
specific hash codes from different modalities gradually, thereby
obtaining a hierarchical low-level semantic structure. Further,
we learn consistent hash codes to relieve heterogeneity gap.
Overall, the main contributions of this article are as follows:
r
We propose an elegant HCCH for cross-modal retrieval.
To the best of our knowledge, we are among the first to
propose a coarse-to-fine hierarchical hashing scheme to
sufficiently exploit feature hierarchy information from dif-
ferent modalities.
r
To mitigate the discriminative information loss caused by a
considerable drop in dimension, we propose a hierarchical
learning paradigm that adopts two-layer hashing mapping
to gather the relatively important information gradually and
further learn specific hash codes.
r
To relieve heterogeneity gap, we present consensus learn-
ing to fully excavate the private property for each modality
and the shared semantics for different modalities.
r
To efficiently solve this binary learning problem, we de-
velop an iterative optimization algorithm. We conduct nu-
merous experiments on four public datasets, and the ex-
perimental results show that our HCCH outperforms these
state-of-the-art comparison methods for retrieval perfor-
mance.
The next sections of our article are organized as follows. In
Section II, we review some existing CMH methods. In Sec-
tion III, we first give our insight and motivation, and then de-
scribe our HCCH framework, optimization scheme, and some
detailed analysis. Moreover, we perform various comparison ex-
periments and some analysis in Section IV. Finally, the conclu-
sion is given in Section V.
II. R
ELATED WORK
In this section, we introduce some existing supervised and
unsupervised CMH methods briefly.
A. Unsupervised Cross-Modal Hashing
Unsupervised CMH [14], [15] usually does not utilize su-
pervised information, but explores intrinsic similarities directly
from heterogeneous data to learn hash codes. Fusion similar-
ity hashing (FSH) [16] explicitly embeds a fusion similarity
graph into the shared hash codes. To enhance robustness, ro-
bust and flexible discrete hashing (RFDH) [17] adopts
2,1
-norm
constraint to learn robust binary codes. Afterwards, to capture
semantic similarities, adaptive structural similarity preserva-
tion hashing (ASSPH) [14] adaptively learns semantic corre-
lation through asymmetric semantic preserving and the correla-
tion constraints. And deep graph-neighbor coherence preserv-
ing network (DGCPN) [18] simultaneously utilizes the similar-
ity between samples and that between neighbors of the sam-
ples to represent semantic correlation. To mitigate the effects of
false-negative pairs, unsupervised contrastive cross-modal hash-
ing (UCCH) [5] introduce contrastive learning to learn shared
hash codes. However, these method can not further enhance re-
trieval performance due to unsupervised information.
B. Supervised Cross-Modal Hashing
Supervised CMH [19], [20] usually utilizes supervised
information to guide the generation of compact hash codes.
Some label-based hashing methods have been proposed. For
Authorized licensed use limited to: Shandong Normal University. Downloaded on November 14,2024 at 03:21:07 UTC from IEEE Xplore. Restrictions apply.

826 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
example, label consistent matrix factorization hashing
(LCMFH) [21] adopts label information to guide the gen-
eration of hash codes. To make hash codes from same class
closer, subspace relation learning for cross-modal hashing
(SRLCH) [22] further proposes to exploit relation information
of labels. However, these label-based methods have fixed
semantic margins and have the large quantization error. Thus,
fast discriminative discrete hashing (FDDH) [23] is proposed
that utilizes the ξ-dragging on labels to provide large label
margins. Considering t he different dimensionality of image
text pairs, matrix tri-factorization hashing [24] jointly learns
the specific hash codes with different length settings and then
correlate the semantic consistency by learning two semantic
correlation matrices. Scalable discrete matrix factorization
and semantic autoencoder (SDMSA) [25] adopts a two-step
strategy to learn hash codes and the hash function, respectively.
Enhanced discrete multi-modal hashing (EDMH) [26] learns
hash codes by hash balance and de-correlation constraints.
In addition, some asymmetric-based hashing methods have
been also proposed. For example, discrete latent factor hash-
ing (DLFH) [27] uses a maximum likelihood loss function
to measure similarity. To sufficiently explore the intrinsic
correlation between different modalities, fast Asymmetric su-
pervised consistent and specific hashing (ASCSH) [28] utilizes
asymmetric semantic similarity to learn the consistent and
specific hash codes. Moreover, scalable asymmetric discrete
cross-modal hashing (BATCH) [29] puts forward collective
matrix factorization and distance-distance difference mini-
mization to learn hash codes. Then, fast cross-modal hashing
(FCMH) [30] simultaneously utilizes global similarity infor-
mation and local similarity. To exploit the high-order semantic
label correlations, adaptive label correlation based asymmetric
cross-modal hashing (ALECH) [31] is proposed that learns
the semantic relationships among all labels without any prior
knowledge. To alleviate the label noises, WASH [32] adopts
low-rank factorization on the noise-reduced labels to learn hash
codes. The existing cross-modal hashing often ignores the label
correlations. Whereupon, SHDCH [33] and SHDCH [33] think
that the multi-label contains much discrimination information
and further consider the problem of the hierarchical label struc-
ture. However, these methods largely ignore the hierarchical
semantics of data pairs and serious information loss caused by
one step dimension reduction.
III. P
ROPOSED METHOD
This section gives the proposed HCCH, including the moti-
vation, formulation, optimization algorithm, convergence anal-
ysis, computational complexity, and out-of-sample extension. In
this article, though our method can be easily extend to multiple
modalities, we only consider the data existed two modalities,
i.e., text and image.
A. Problem Definition
In this paper, the uppercase bold font character and the low-
ercase bold font character represent matrix and vector, respec-
tively. Assume that U
t
∈ R
n×d
t
is the training instances with t
modality, where d
t
is the feature dimension and n is the number
of samples. The corresponding ground-truth label is denoted as
Y ∈{0, 1}
n×c
, where c is the number of shared classes, and
Y
ij
=1if the j-th instance pair belongs to the i-th class and
otherwise Y
ij
=0. B ∈{−1, 1}
n×l
is the l-bit hash codes.
It is well-known that kernel trick can better express nonlin-
early separable data correlations, such as RBF kernel mapping.
For each modality, the kernelized features of each data pair can
be denoted as φ(u
t
),
φ(u
t
)=
exp
u
t
− a
t
1
2
2
−2σ
2
,...,exp
u
t
− a
t
k
2
2
−2σ
2
(1)
where σ is the kernel width, and a
t
represents the randomly
chosen k
t
anchors. To simplify the presentation, we denote X
t
to represent φ(U
t
).
B. Motivation
Since the bit de-correlation [26] (i.e., orthogonal constraint)
has been proved to improve the discriminant of hash codes, it is
widely used in learning to hash. Without loss of generality, For
solving hash codes, singular value decomposition (SVD) need to
be performed. Afterwards, we conduct SVD to show eigenvalue
distribution. The larger the eigenvalue, the more discriminative
information the eigenvector carries. We thus hope that l bits bi-
nary codes can preserve most of the discriminative information
by choosing the eigenvectors corresponding to top-l eigenval-
ues. However, the prior CMH methods usually learns discrete
codes by using a single-layer hash function. In other words,
they directly transform high-dimensional image-text data with
size R
n×k
t
into low-dimensional binary codes with R
n×l
.The
multi-level semantics contained in the original kernel features
are difficult to extract by the single-layer hashing [34]. More-
over, this dimensional plunge easily leads to the loss of discrim-
inant information.
To reveal the above insight, we show eigenvalue distribution
of the image modality of the MIRFlickr dataset that sorted from
large to small in Fig. 3. For Fig. 3(a), the one layer hash projects
image data from 1000-dimension to 64-dimension. We can ob-
serve that the hash codes only keep 57.09% kernel information.
We wondered if there could be a way to preserve more discrimi-
native information and capture multi-level semantics. Motivated
by the hierarchical architecture [35], to effectively avoid the
abrupt drop of dimension and alleviate the discriminative infor-
mation loss, we consider a coarse-to-fine hierarchical hashing
scheme that utilizes a progressive way to learn a multi-level hash
function. More specifically, we employ multiple intermediary
matrices to extract the discriminative information from kernel
features. Note that here, we only focus on the two-layer hash
function. For Fig. 3(b), the two-layer hash can preserve more
discriminative information (i.e., 60.80% > 57.09%). Therefore,
the above analysis shows that our two-layer hashing idea can pre-
serve the more information of kernel features, thereby reducing
the information loss.
Authorized licensed use limited to: Shandong Normal University. Downloaded on November 14,2024 at 03:21:07 UTC from IEEE Xplore. Restrictions apply.
剩余12页未读,继续阅读
资源推荐
资源评论
196 浏览量
116 浏览量
2024-10-08 上传
2024-10-01 上传
190 浏览量
2021-07-29 上传
143 浏览量
172 浏览量
2021-05-18 上传
2021-02-11 上传
2019-09-02 上传
118 浏览量
131 浏览量
122 浏览量
2021-02-06 上传
162 浏览量
153 浏览量
2019-07-07 上传
190 浏览量
2021-05-04 上传
103 浏览量
2021-05-17 上传
200 浏览量
162 浏览量
2021-02-21 上传
150 浏览量
158 浏览量
点击了解资源详情
资源评论


薛定谔的猫ovo
- 粉丝: 6w+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 该项目为一个集数据抓取与展示一体的ACM队员数据系统,基于Django、python实现。.zip
- 辅助背单词软件,基于艾宾浩斯记忆曲线(其实背啥都行)的Python重构版,增加在线查词与翻译等功能.zip
- 基于C开发的命令行输入输出流重定向与实时分析工具_支持快捷按键和文本框输入实时过滤计算分析多格式结果呈现文本提示弹窗曲线表格支持批量测试和日志抓取_用于开发调试协议分.zip
- 各种有用的web api 基于Golang, Python(tornado django scrapy gevent).zip
- 华南理工大学找到卷王,基于 Python 的综测系统数据爬虫.zip
- 湖南大学(HNU)数据库系统课程大作业 ATM系统 前端基于Python的PyQt5,后端基于MySQL.zip
- (新闻爬虫),基于python+Flask+Echarts,实现首页与更多新闻页面爬取
- 基于 Flask + Requests 的全平台音乐接口 Python 版.zip
- 基于 FFmpeg ,使用 Python 开发的批量媒体文件格式转换器。.zip
- 基于 CAI 的 OneBot Python 实现.zip
- 基于 nonebot2 开发的消息交互式 Python 解释器,依赖 docker SDK.zip
- 基于 Python 3 + Django 2 开发的用于适配手机的简单 Jenkins 构建平台.zip
- Python 语言的爬楼梯问题实现-计算爬到第 n 级台阶的方法数
- 基于 Napcat, NcatBot, JMComic-Crawler-Python 的 QQ 机器人。.zip
- 基于 Python Tornado 的博客程序 (练习).zip
- 基于 Python 3.5 + Django 2.0 开发的简单个人博客.zip
安全验证
文档复制为VIP权益,开通VIP直接复制
