Cross-modal hash retrieval of medical images based on Transformer semantic alignment_Journal of Biomedical Engineering

Authors：

WU Qianlin ¹ , TANG Lun ¹ , LIU Qinghai ¹ ,  XU Liming ^2,3 , CHEN Qianbin ¹

1. School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China;
2. Sichuan Artificial Intelligence Research Institute, Yibin, Sichuan 644005, P. R. China;
3. School of Computer Science, China West Normal University, Nanchong, Sichuan 637009, P. R. China;

Corresponding author：

XU Liming, Email: xulm@cwnu.edu.cn

Keywords：

Cross-modal hash; Transformer; Semantic alignment; Segmented training

DOI：

10.7507/1001-5515.202407034

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

Medical cross-modal retrieval aims to achieve semantic similarity search between different modalities of medical cases, such as quickly locating relevant ultrasound images through ultrasound reports, or using ultrasound images to retrieve matching reports. However, existing medical cross-modal hash retrieval methods face significant challenges, including semantic and visual differences between modalities and the scalability issues of hash algorithms in handling large-scale data. To address these challenges, this paper proposes a Medical image Semantic Alignment Cross-modal Hashing based on Transformer (MSACH). The algorithm employed a segmented training strategy, combining modality feature extraction and hash function learning, effectively extracting low-dimensional features containing important semantic information. A Transformer encoder was used for cross-modal semantic learning. By introducing manifold similarity constraints, balance constraints, and a linear classification network constraint, the algorithm enhanced the discriminability of the hash codes. Experimental results demonstrated that the MSACH algorithm improved the mean average precision (MAP) by 11.8% and 12.8% on two datasets compared to traditional methods. The algorithm exhibits outstanding performance in enhancing retrieval accuracy and handling large-scale medical data, showing promising potential for practical applications.

Citation： WU Qianlin, TANG Lun, LIU Qinghai, XU Liming, CHEN Qianbin. Cross-modal hash retrieval of medical images based on Transformer semantic alignment. Journal of Biomedical Engineering, 2025, 42(1): 156-163. doi: 10.7507/1001-5515.202407034 Copy

1.
2.
3.
4.	苏海, 钟雨辰. 基于偏差抑制对比学习的无监督深度哈希图像检索. 计算机系统应用, 2025, 34(2): 165-173.
5.	刘华咏, 徐明慧. 基于混合注意力与偏振非对称损失的哈希图像检索. 计算机科学, 2024: 1-12.
6.	Jiang Q Y, Li W J. Deep cross-modal hashing// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3232-3240.
7.
8.
9.
10.
11.
12.	Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation// International Conference on Machine Learning. PMLR, 2022: 12888-12900.
13.
14.
15.	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need// 31st Conference on Neural Information Processing Systems (NIPS2017). Long Beach: NIPS, 2017: 6000-6010.
16.	Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention// International Conference on Machine Learning. PMLR, 2021: 10347-10357.
17.	Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv, 2018: 1810.04805.
18.	Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv, 2017: 1711.05101.
19.
20.
21.
22.	Li C, Deng C, Li N, et al. Self-supervised adversarial hashing networks for cross-modal retrieval// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4242-4251.
23.
24.
25.	Tu J, Liu X, Lin Z, et al. Differentiable cross-modal hashing via multimodal transformers// Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 453-461.

1.
2.
3.
4. 苏海, 钟雨辰. 基于偏差抑制对比学习的无监督深度哈希图像检索. 计算机系统应用, 2025, 34(2): 165-173.
5. 刘华咏, 徐明慧. 基于混合注意力与偏振非对称损失的哈希图像检索. 计算机科学, 2024: 1-12.
6. Jiang Q Y, Li W J. Deep cross-modal hashing// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3232-3240.
7.
8.
9.
10.
11.
12. Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation// International Conference on Machine Learning. PMLR, 2022: 12888-12900.
13.
14.
15. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need// 31st Conference on Neural Information Processing Systems (NIPS2017). Long Beach: NIPS, 2017: 6000-6010.
16. Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention// International Conference on Machine Learning. PMLR, 2021: 10347-10357.
17. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv, 2018: 1810.04805.
18. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv, 2017: 1711.05101.
19.
20.
21.
22. Li C, Deng C, Li N, et al. Self-supervised adversarial hashing networks for cross-modal retrieval// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4242-4251.
23.
24.
25. Tu J, Liu X, Lin Z, et al. Differentiable cross-modal hashing via multimodal transformers// Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 453-461.

Journal of Biomedical Engineering

Cross-modal hash retrieval of medical images based on Transformer semantic alignment

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content