【免费】HierarchicalConsensusHashingforCross-ModalRetrieval_unsupervisedcrossmodalhash方法资源-CSDN下载

需积分: 0 196 浏览量更新于2024-11-24 收藏 3.2MB PDF 举报

交叉模态哈希是一种近年来备受关注的技术，主要用于促进不同模态数据之间的高效检索。这种技术能够将不同模态（如文本、图像、音频等）的数据通过哈希编码映射到低维哈希码中，以便实现快速的相似性搜索。当前大多数方法没有充分考虑到数据的层次结构信息，往往直接通过单层哈希函数将跨模态数据一次性映射为共同的低维哈希码。这种做法虽然简单快速，但是会造成维度的突然降低，以及巨大的语义间隙，进而导致鉴别信息的损失。为了解决上述问题，本文提出了一个名为层次共识哈希（Hierarchical Consensus Cross-Modal Hashing, HCCH）的新型哈希技术。HCCH技术采用了一种从粗到细的渐进机制，提出了一种分层的哈希方案，该方案利用了两层哈希函数，首先通过较粗的哈希函数过滤掉冗余和损坏的特征，然后逐步细化到更细致的哈希函数，以渐进式的方式将数据对投影到汉明空间中。为了更有效地将数据编码到共识空间中，本文还引入了共识学习方法，通过这种方式可以逐步减少语义间隙。在实验部分，研究人员将HCCH方法与其他先进的跨模态哈希方法在四个基准数据集上进行了广泛的对比实验。结果显示，本文提出的HCCH方法在这些数据集上显示出了其有效性和高效性。这表明HCCH方法在实际应用中具有很大的潜力，如在跨模态检索任务中，可以检索出与查询集语义上相关的实例。此外，跨模态哈希技术在现实世界中的多种应用中也逐渐吸引了更多的关注，例如在图像搜索、视频标注和多媒体检索等领域中，都有很好的应用前景。文中还提到了一些关键词：共识学习、跨模态检索，这些关键词代表了本研究的核心内容和应用领域。通过文献中提及的对比图，我们可以直观地看到不同跨模态哈希方法之间的差异，其中，HCCH方法在减少语义间隙和保留鉴别信息方面显示出了明显的优势。总结而言，层次共识哈希技术通过分层哈希方案和共识学习，有效解决了跨模态数据检索中遇到的语义间隙和鉴别信息损失问题，进一步推动了交叉模态哈希技术在现实应用中的发展。

824 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

Hierarchical Consensus Hashing for

Cross-Modal Retrieval

Yuan Sun , Zhenwen Ren ,PengHu , Dezhong Peng, and Xu Wang

Abstract—Cross-modal hashing (CMH) has gained much

attention due to its effectiveness and efﬁciency in facilitating

efﬁcient retrieval between different modalities. Whereas, most

existing methods unconsciously ignore the hierarchical structural

information of the data, and often learn a single-layer hash

function to directly transform cross-modal data into common

low-dimensional hash codes in one step. This sudden drop of

dimension and the huge semantic gap can cause the discriminative

information loss. To this end, we adopt a coarse-to-ﬁne progressive

mechanism and propose a novel Hierarchical Consensus Cross-

Modal Hashing (HCCH). Speciﬁcally, to mitigate the loss of

important discriminative information, we propose a coarse-to-ﬁne

hierarchical hashing scheme that utilizes a two-layer hash function

to reﬁne the beneﬁcial discriminative information gradually. And

then, the 

2,1

-norm is imposed on the layer-wise hash function to

alleviate the effects of redundant and corrupted features. Finally,

we present consensus learning to effectively encode data into

a consensus space in such a progressive way, thereby reducing

the semantic gap progressively. Through extensive contrast

experiments with some advanced CMH methods, the effectiveness

and efﬁciency of our HCCH method are demonstrated on four

benchmark datasets.

Index Terms—Consensus learning, cross-modal retrieval,

hierarchical hashing, learning to hash.

I. INTRODUCTION

N RECENT years, with the ubiquity of multi-media equip-

ment, multimedia data is becoming more accessible. Cross-

modal retrieval [1], [2], [3] is in urgent demand and has

Manuscript received 16 December 2022; revised 26 March 2023; accepted

25 April 2023. Date of publication 5 May 2023; date of current version 18 Jan-

uary 2024. The work was supported in part by the National Natural Science

Foundation of China under Grants U19A2078, 62102274, and 62106209, in

part by the National Defense Science and Technology through Base Strength-

ening Program under Grant 2022-JCJQ-JJ-0292, in part by Sichuan Science

and Technology Planning Project under Grants 2022YFQ0014, 2021YFG0301,

2023ZHCG0016, and 2023YFG0033, in part by the Fundamental Research

Funds for the Central Universities under Grant 2022SCU12081, and in part by

Sichuan University Postdoctoral Interdisciplinary Innovation Fund under Grant

JCXK2234. The Associate Editor coordinating the review of this manuscript and

approving it for publication was Dr. Wengang Zhou. (Corresponding author: Xu

Wang.)

Yuan Sun, Peng Hu, and Xu Wang are with the College of Computer Science,

Sichuan University, Chengdu 610044, China (e-mail: sunyuan_w[email protected];

[email protected]; [email protected]).

Zhenwen Ren is with the Department of National Defence Science and Tech-

nology, Southwest University of Science and Technology, Mianyang 621010,

China (e-mail: [email protected]).

Dezhong Peng is with the College of Computer Science, Sichuan Univer-

sity, Chengdu 610044, China, and also with the Sichuan Zhiqian Technology

Company, Ltd, Chengdu 610041, China (e-mail: [email protected]).

We have released the source code at https://github.com/sunyuan-cs.

Digital Object Identiﬁer 10.1109/TMM.2023.3272169

Fig. 1. Comparison of different cross-modal hashing methods. (a) Existing

cross-modal hashing directly projects data pair into Hamming space by the

single-layer hash function W . (b) Our HCCH proposes a coarse-to-ﬁne hier-

archical hashing scheme that utilizes a two-layer hash function PR to project

data pair into Hamming space gradually. Different shapes and the different colors

represent different classes.

attracted more and more attention in a variety of real-world

applications [4], [5], [6]. It aims at retrieving semantically re-

lated instances from different modalities f or the query set, such

as adopting text to search a similar image or adopting an im-

age to search the similar text. Cross-modal Hashing (CMH) [7],

[8] can well deal with such cross-modal retrieval tasks, when

facing large-scale data since binary hash codes can reduce stor-

age space cost and signiﬁcantly speed up the retrieval process

through XOR operations. Therefore, CMH has attracted a lot of

attention for scalable cross-modal search [9], [10].

Since hashing learning has low space cost, fast search speed,

and effective accuracy, a large number of CMH methods [11],

[12] have been proposed in recent years. Hashing l earning [13]

aims to project original data into Hamming space, while pre-

serving semantic similarity from original space. The pioneering

hashing methods can be usually parted into two types: unsuper-

vised hashing and supervised hashing. The former mainly ex-

plores the semantic correlations from different modalities with-

out label information, thereby learning binary codes. And the

latter mainly utilizes the label information that guides to gener-

ate the accurate hash codes. Compared with unsupervised meth-

ods, supervised CMH usually has better performance due to the

embrace of some prior semantic knowledge.

Although supervised CMH methods have obtained promis-

ing performance, there are still some tough challenges to be

further addressed. First, the previous CMH methods invariably

tend to learn binary codes using a single-layer hash function (as

shown in Fig. 1). Nevertheless, few methods learn hash codes

in multi-level space to capture hierarchical relation of features.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Shandong Normal University. Downloaded on November 14,2024 at 03:21:07 UTC from IEEE Xplore. Restrictions apply.

SUN et al.: HIERARCHICAL CONSENSUS HASHING FOR CROSS-MODAL R ETRIEVAL 825

Fig. 2. Overview of hierarchical consensus cross-modal hashing framework (HCCH). Our HCCH projects the text-image pair into speciﬁc hash codes by two-layer

hash function, in which hierarchical hashing mapping is performed in different layers to capture multi-level structure information. Then, a consensus learning

scheme is proposed that learns consistent hash codes to reduce the distribution difference between text and image hash codes.

Second, the existing CMH methods often directly project high-

dimensional instance pairs into low-dimensional hash codes by

a single-layer hash function. The sudden one-step dimension

reduction can result in the loss of certain important data proper-

ties or discriminative information. Third, to reduce the challeng-

ing heterogeneity gap of multiple modalities, one widely used

scheme is to map original multi-modal data into a shared Ham-

ming space. Clearly, this scheme ignores some latent beneﬁcial

speciﬁc knowledge, resulting in poor performance.

To overcome these problems and improve the retrieval per-

formance, we propose a novel Hierarchical Consensus Cross-

modal Hashing (HCCH). As shown in Fig. 2, we propose a

hierarchical hashing strategy with a coarse-to-ﬁne architecture

to reﬁne discrimination feature information. That is, we use a

two-layer hash function with the 

2,1

-norm constraint to learn

speciﬁc hash codes from different modalities gradually, thereby

obtaining a hierarchical low-level semantic structure. Further,

we learn consistent hash codes to relieve heterogeneity gap.

Overall, the main contributions of this article are as follows:

We propose an elegant HCCH for cross-modal retrieval.

To the best of our knowledge, we are among the ﬁrst to

propose a coarse-to-ﬁne hierarchical hashing scheme to

sufﬁciently exploit feature hierarchy information from dif-

ferent modalities.

To mitigate the discriminative information loss caused by a

considerable drop in dimension, we propose a hierarchical

learning paradigm that adopts two-layer hashing mapping

to gather the relatively important information gradually and

further learn speciﬁc hash codes.

To relieve heterogeneity gap, we present consensus learn-

ing to fully excavate the private property for each modality

and the shared semantics for different modalities.

To efﬁciently solve this binary learning problem, we de-

velop an iterative optimization algorithm. We conduct nu-

merous experiments on four public datasets, and the ex-

perimental results show that our HCCH outperforms these

state-of-the-art comparison methods for retrieval perfor-

mance.

The next sections of our article are organized as follows. In

Section II, we review some existing CMH methods. In Sec-

tion III, we ﬁrst give our insight and motivation, and then de-

scribe our HCCH framework, optimization scheme, and some

detailed analysis. Moreover, we perform various comparison ex-

periments and some analysis in Section IV. Finally, the conclu-

sion is given in Section V.

II. R

ELATED WORK

In this section, we introduce some existing supervised and

unsupervised CMH methods brieﬂy.

A. Unsupervised Cross-Modal Hashing

Unsupervised CMH [14], [15] usually does not utilize su-

pervised information, but explores intrinsic similarities directly

from heterogeneous data to learn hash codes. Fusion similar-

ity hashing (FSH) [16] explicitly embeds a fusion similarity

graph into the shared hash codes. To enhance robustness, ro-

bust and ﬂexible discrete hashing (RFDH) [17] adopts 

2,1

-norm

constraint to learn robust binary codes. Afterwards, to capture

semantic similarities, adaptive structural similarity preserva-

tion hashing (ASSPH) [14] adaptively learns semantic corre-

lation through asymmetric semantic preserving and the correla-

tion constraints. And deep graph-neighbor coherence preserv-

ing network (DGCPN) [18] simultaneously utilizes the similar-

ity between samples and that between neighbors of the sam-

ples to represent semantic correlation. To mitigate the effects of

false-negative pairs, unsupervised contrastive cross-modal hash-

ing (UCCH) [5] introduce contrastive learning to learn shared

hash codes. However, these method can not further enhance re-

trieval performance due to unsupervised information.

B. Supervised Cross-Modal Hashing

Supervised CMH [19], [20] usually utilizes supervised

information to guide the generation of compact hash codes.

Some label-based hashing methods have been proposed. For

Authorized licensed use limited to: Shandong Normal University. Downloaded on November 14,2024 at 03:21:07 UTC from IEEE Xplore. Restrictions apply.

826 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

example, label consistent matrix factorization hashing

(LCMFH) [21] adopts label information to guide the gen-

eration of hash codes. To make hash codes from same class

closer, subspace relation learning for cross-modal hashing

(SRLCH) [22] further proposes to exploit relation information

of labels. However, these label-based methods have ﬁxed

semantic margins and have the large quantization error. Thus,

fast discriminative discrete hashing (FDDH) [23] is proposed

that utilizes the ξ-dragging on labels to provide large label

margins. Considering t he different dimensionality of image

text pairs, matrix tri-factorization hashing [24] jointly learns

the speciﬁc hash codes with different length settings and then

correlate the semantic consistency by learning two semantic

correlation matrices. Scalable discrete matrix factorization

and semantic autoencoder (SDMSA) [25] adopts a two-step

strategy to learn hash codes and the hash function, respectively.

Enhanced discrete multi-modal hashing (EDMH) [26] learns

hash codes by hash balance and de-correlation constraints.

In addition, some asymmetric-based hashing methods have

been also proposed. For example, discrete latent factor hash-

ing (DLFH) [27] uses a maximum likelihood loss function

to measure similarity. To sufﬁciently explore the intrinsic

correlation between different modalities, fast Asymmetric su-

pervised consistent and speciﬁc hashing (ASCSH) [28] utilizes

asymmetric semantic similarity to learn the consistent and

speciﬁc hash codes. Moreover, scalable asymmetric discrete

cross-modal hashing (BATCH) [29] puts forward collective

matrix factorization and distance-distance difference mini-

mization to learn hash codes. Then, fast cross-modal hashing

(FCMH) [30] simultaneously utilizes global similarity infor-

mation and local similarity. To exploit the high-order semantic

label correlations, adaptive label correlation based asymmetric

cross-modal hashing (ALECH) [31] is proposed that learns

the semantic relationships among all labels without any prior

knowledge. To alleviate the label noises, WASH [32] adopts

low-rank factorization on the noise-reduced labels to learn hash

codes. The existing cross-modal hashing often ignores the label

correlations. Whereupon, SHDCH [33] and SHDCH [33] think

that the multi-label contains much discrimination information

and further consider the problem of the hierarchical label struc-

ture. However, these methods largely ignore the hierarchical

semantics of data pairs and serious information loss caused by

one step dimension reduction.

III. P

ROPOSED METHOD

This section gives the proposed HCCH, including the moti-

vation, formulation, optimization algorithm, convergence anal-

ysis, computational complexity, and out-of-sample extension. In

this article, though our method can be easily extend to multiple

modalities, we only consider the data existed two modalities,

i.e., text and image.

A. Problem Deﬁnition

In this paper, the uppercase bold font character and the low-

ercase bold font character represent matrix and vector, respec-

tively. Assume that U

∈ R

n×d

is the training instances with t

modality, where d

is the feature dimension and n is the number

of samples. The corresponding ground-truth label is denoted as

Y ∈{0, 1}

n×c

, where c is the number of shared classes, and

=1if the j-th instance pair belongs to the i-th class and

otherwise Y

=0. B ∈{−1, 1}

n×l

is the l-bit hash codes.

It is well-known that kernel trick can better express nonlin-

early separable data correlations, such as RBF kernel mapping.

For each modality, the kernelized features of each data pair can

be denoted as φ(u

φ(u



exp



u

− a



−2σ



,...,exp



u

− a



−2σ





(1)

where σ is the kernel width, and a

represents the randomly

chosen k

anchors. To simplify the presentation, we denote X

to represent φ(U

B. Motivation

Since the bit de-correlation [26] (i.e., orthogonal constraint)

has been proved to improve the discriminant of hash codes, it is

widely used in learning to hash. Without loss of generality, For

solving hash codes, singular value decomposition (SVD) need to

be performed. Afterwards, we conduct SVD to show eigenvalue

distribution. The larger the eigenvalue, the more discriminative

information the eigenvector carries. We thus hope that l bits bi-

nary codes can preserve most of the discriminative information

by choosing the eigenvectors corresponding to top-l eigenval-

ues. However, the prior CMH methods usually learns discrete

codes by using a single-layer hash function. In other words,

they directly transform high-dimensional image-text data with

size R

n×k

into low-dimensional binary codes with R

n×l

.The

multi-level semantics contained in the original kernel features

are difﬁcult to extract by the single-layer hashing [34]. More-

over, this dimensional plunge easily leads to the loss of discrim-

inant information.

To reveal the above insight, we show eigenvalue distribution

of the image modality of the MIRFlickr dataset that sorted from

large to small in Fig. 3. For Fig. 3(a), the one layer hash projects

image data from 1000-dimension to 64-dimension. We can ob-

serve that the hash codes only keep 57.09% kernel information.

We wondered if there could be a way to preserve more discrimi-

native information and capture multi-level semantics. Motivated

by the hierarchical architecture [35], to effectively avoid the

abrupt drop of dimension and alleviate the discriminative infor-

mation loss, we consider a coarse-to-ﬁne hierarchical hashing

scheme that utilizes a progressive way to learn a multi-level hash

function. More speciﬁcally, we employ multiple intermediary

matrices to extract the discriminative information from kernel

features. Note that here, we only focus on the two-layer hash

function. For Fig. 3(b), the two-layer hash can preserve more

discriminative information (i.e., 60.80% > 57.09%). Therefore,

the above analysis shows that our two-layer hashing idea can pre-

serve the more information of kernel features, thereby reducing

the information loss.

Authorized licensed use limited to: Shandong Normal University. Downloaded on November 14,2024 at 03:21:07 UTC from IEEE Xplore. Restrictions apply.

剩余12页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

薛定谔的猫ovo

粉丝: 6w+

Hierarchical Consensus Hashing for Cross-Modal Retrieval

hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv

Algorithm-Hierarchical-Meta-Reinforcement-Learning.zip

【Hierarchical RL】分层演员-评论家（Hierarchical Actor-Critic ）算法代码

手工标注的包含两层类别结构的网页分类数据集_Hierarchical-WebSite-Theme-DataSet_SCU.zip

Python库 | hierarchical_memmap_format-0.0b43-py3-none-any.whl

hierarchical-group-sparse-regularization-master.zip

【Hierarchical RL】分层强化学习：Hierarchical-DQN算法

managing-hierarchical-data-in-mysql.rar

Hierarchical-Double-Attention-Neural-Networks-for-Sentiment-Classification:分层双注意力神经网络在情感分类中的应用

Hierarchical Deep Click Feature Prediction for Fine-grained Image Recognition

IJCAI 2019 今日开幕，十三个奖项依次揭晓.pdf

matlab贝叶斯阈值代码-Bayesian-Hierarchical-Varying-sparsity-Regression-Models-

Hierarchical Convolutional Neural Networks for EEG-Based Emotion Recognition

Hierarchical-Prototype-Based-Classifier:题为“基于分层原型的分类方法”的研究论文的源代码。-matlab开发

Providing hierarchical lookup service for P2P-VoD systems

庞亮__HAS-QA Hierarchical Answer Spans Model for Open-domain Quest

Double-Walled Magnesium Silicate Hollow Nanofibers with Hierarchical Structure for Waste-Water Treatment

HIHCA论文翻译

matlab代码sqrt-Hierarchical_Classification_LC-KSVD:Hierarchical_Classific

HFT-CNN:基于分层类别结构的卷积神经网络的多标签短文本分类

hierarchical-agglomerative-clustering-from-scratch

Hierarchical-Recurrent-Neural-Networks-for-Speech-Bandwidth-Extension:论文编号

An efficient hierarchical identification method for general dual-rate sampled-data systems

FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding 分层视频高细粒度动作理解数据集-数据集

Hierarchical sparse representation based Multi-Instance Semi-Supervised Learning with application to image categorization

Hierarchical human-like strategy for aspect-level sentiment classification with sentiment linguistic knowledge and reinforcement learning

pdf转为扫描件，免费

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

在 树莓派4 上 USB 启动

if strings.ToLower(c.SourceDB.DbType) == "dm" || strings.ToLower(c.SourceDB.DbType) == "mysql" { c.SourceDB.DbType = strings.ToLower(c.SourceDB.DbType) }报错panic: runtime error: invalid memory address or nil pointer dereference

最新资源

在树莓派4 上 USB 启动