Triple-residual enhanced convolution- and transformer-based hybrid encoder-decoder network for medical image segmentation

Minghan Lin, Ziyang Wang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

In contemporary medical practice, computed tomography and ultrasound imaging play a critical role in disease diagnosis, treatment planning, and monitoring. Image segmentation is a foundational step in medical image analysis, which is essential for extracting key anatomical and pathological information and has been dominated by deep learning. Convolutional Neural Network (CNN) and Vision Transformer (ViT) are two foundational architectures, and both demonstrate promising performance in segmentation. In this paper, Triple-Residual Enhanced Convolution- and Transformer-Based Hybrid Encoder-Decoder Network, named TriResUNet, is proposed. Specifically, a redesigned residual multi-resolution CNN-based block is introduced to enhance multi-level feature extraction ability for Encoders and Decoders, respectively. Secondly, Shift-Window-based ViT with residual connection is explored in the bottleneck aiming to capture long-range relationships. Thirdly, residual connection with computationally efficient CNN-based block is integrated between Encoders and Decoders to mitigate the vanishing gradient problem, facilitating efficient gradient flow and enhancing feature reuse across layers. The proposed network is comprehensively validated on various publicly available datasets against comprehensive baseline methods and various types of similar hybrid networks. The experimental findings suggest the proposed network demonstrates superior performance. The code will be available upon acceptance at https://github.com/ziyangwang007/VIT4UNet.

Original languageEnglish
Article number131125
JournalNeurocomputing
Volume652
Early online date28 Jul 2025
DOIs
Publication statusPublished - 1 Nov 2025

Keywords

  • Computed tomography
  • Convolution
  • Image segmentation
  • Ultrasound
  • Vision transformer

Fingerprint

Dive into the research topics of 'Triple-residual enhanced convolution- and transformer-based hybrid encoder-decoder network for medical image segmentation'. Together they form a unique fingerprint.

Cite this