Training‐free few‐shot construction tool and material detection using pre‐trained vision‐language model

Zhaoxin Zhang, Yantao Yu*, Zaolin Pan, Maxwell Fordjour Antwi‐Afari

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Direct visual understanding of construction entities, such as tools and materials (T&M), underpin construction management and resource scheduling. Traditional supervised learning methods suffer from high annotation cost, severe computational demands, and limited datasets. In contrast, training‐free approaches offer an effective alternative well‐suited for construction scenarios constrained by data scarcity and limited resources. Besides, vision‐language models (VLMs) can directly learn image semantics through natural language supervision and also demonstrate strong zero‐shot detection capabilities without requiring retraining. Existing methods often exhibit limited image–text semantic alignment in construction scenarios, which restricts their effectiveness in construction tasks. Therefore, there is an urgent need for approaches that can enhance cross‐modal understanding in such domain‐specific contexts. To address this challenge, this paper proposes a training‐free, knowledge‐enhanced VLM to recognize T&M in construction tasks. The proposed approach leverages image matching and image–text knowledge alignment strategies, thereby utilizing the training‐free nature of existing VLMs while benefiting from enhanced performance brought by knowledge integration. This method offers a novel solution for construction management and robotic collaboration tasks that are traditionally constrained by data and computational resource dependencies.
Original languageEnglish
Number of pages20
JournalComputer-Aided Civil and Infrastructure Engineering
Early online date6 Nov 2025
DOIs
Publication statusE-pub ahead of print - 6 Nov 2025

Bibliographical note

Copyright © 2025 The Author(s). Computer-Aided Civil and Infrastructure Engineering published by Wiley Periodicals LLC on behalf of Editor.
This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

Funding

This work was supported by the National Natural Science Foundation of China [grant number 72201226]; the Research Grants Council (Hong Kong) [grant number 26208323, C6044\u201023GF].

FundersFunder number
National Natural Science Foundation of China72201226
Research Grants Council, University Grants Committee26208323, C6044‐23GF

    Fingerprint

    Dive into the research topics of 'Training‐free few‐shot construction tool and material detection using pre‐trained vision‐language model'. Together they form a unique fingerprint.

    Cite this