TY - JOUR
T1 - Triple-residual enhanced convolution- and transformer-based hybrid encoder-decoder network for medical image segmentation
AU - Lin, Minghan
AU - Wang, Ziyang
PY - 2025/11/1
Y1 - 2025/11/1
N2 - In contemporary medical practice, computed tomography and ultrasound imaging play a critical role in disease diagnosis, treatment planning, and monitoring. Image segmentation is a foundational step in medical image analysis, which is essential for extracting key anatomical and pathological information and has been dominated by deep learning. Convolutional Neural Network (CNN) and Vision Transformer (ViT) are two foundational architectures, and both demonstrate promising performance in segmentation. In this paper, Triple-Residual Enhanced Convolution- and Transformer-Based Hybrid Encoder-Decoder Network, named TriResUNet, is proposed. Specifically, a redesigned residual multi-resolution CNN-based block is introduced to enhance multi-level feature extraction ability for Encoders and Decoders, respectively. Secondly, Shift-Window-based ViT with residual connection is explored in the bottleneck aiming to capture long-range relationships. Thirdly, residual connection with computationally efficient CNN-based block is integrated between Encoders and Decoders to mitigate the vanishing gradient problem, facilitating efficient gradient flow and enhancing feature reuse across layers. The proposed network is comprehensively validated on various publicly available datasets against comprehensive baseline methods and various types of similar hybrid networks. The experimental findings suggest the proposed network demonstrates superior performance. The code will be available upon acceptance at https://github.com/ziyangwang007/VIT4UNet.
AB - In contemporary medical practice, computed tomography and ultrasound imaging play a critical role in disease diagnosis, treatment planning, and monitoring. Image segmentation is a foundational step in medical image analysis, which is essential for extracting key anatomical and pathological information and has been dominated by deep learning. Convolutional Neural Network (CNN) and Vision Transformer (ViT) are two foundational architectures, and both demonstrate promising performance in segmentation. In this paper, Triple-Residual Enhanced Convolution- and Transformer-Based Hybrid Encoder-Decoder Network, named TriResUNet, is proposed. Specifically, a redesigned residual multi-resolution CNN-based block is introduced to enhance multi-level feature extraction ability for Encoders and Decoders, respectively. Secondly, Shift-Window-based ViT with residual connection is explored in the bottleneck aiming to capture long-range relationships. Thirdly, residual connection with computationally efficient CNN-based block is integrated between Encoders and Decoders to mitigate the vanishing gradient problem, facilitating efficient gradient flow and enhancing feature reuse across layers. The proposed network is comprehensively validated on various publicly available datasets against comprehensive baseline methods and various types of similar hybrid networks. The experimental findings suggest the proposed network demonstrates superior performance. The code will be available upon acceptance at https://github.com/ziyangwang007/VIT4UNet.
KW - Computed tomography
KW - Convolution
KW - Image segmentation
KW - Ultrasound
KW - Vision transformer
UR - http://www.scopus.com/inward/record.url?scp=105012149269&partnerID=8YFLogxK
UR - https://www.sciencedirect.com/science/article/pii/S0925231225017977?via%3Dihub
U2 - 10.1016/j.neucom.2025.131125
DO - 10.1016/j.neucom.2025.131125
M3 - Article
AN - SCOPUS:105012149269
SN - 0925-2312
VL - 652
JO - Neurocomputing
JF - Neurocomputing
M1 - 131125
ER -