Abstract
Direct visual understanding of construction entities, such as tools and materials (T&M), underpin construction management and resource scheduling. Traditional supervised learning methods suffer from high annotation cost, severe computational demands, and limited datasets. In contrast, training‐free approaches offer an effective alternative well‐suited for construction scenarios constrained by data scarcity and limited resources. Besides, vision‐language models (VLMs) can directly learn image semantics through natural language supervision and also demonstrate strong zero‐shot detection capabilities without requiring retraining. Existing methods often exhibit limited image–text semantic alignment in construction scenarios, which restricts their effectiveness in construction tasks. Therefore, there is an urgent need for approaches that can enhance cross‐modal understanding in such domain‐specific contexts. To address this challenge, this paper proposes a training‐free, knowledge‐enhanced VLM to recognize T&M in construction tasks. The proposed approach leverages image matching and image–text knowledge alignment strategies, thereby utilizing the training‐free nature of existing VLMs while benefiting from enhanced performance brought by knowledge integration. This method offers a novel solution for construction management and robotic collaboration tasks that are traditionally constrained by data and computational resource dependencies.
| Original language | English |
|---|---|
| Number of pages | 20 |
| Journal | Computer-Aided Civil and Infrastructure Engineering |
| Early online date | 6 Nov 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 6 Nov 2025 |
Bibliographical note
Copyright © 2025 The Author(s). Computer-Aided Civil and Infrastructure Engineering published by Wiley Periodicals LLC on behalf of Editor.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
Funding
This work was supported by the National Natural Science Foundation of China [grant number 72201226]; the Research Grants Council (Hong Kong) [grant number 26208323, C6044\u201023GF].
| Funders | Funder number |
|---|---|
| National Natural Science Foundation of China | 72201226 |
| Research Grants Council, University Grants Committee | 26208323, C6044‐23GF |