The construction sector increasingly adopts robotic automation and digital tools like digital twinning to enhance capital productivity and tackle labor shortages. Despite these advancements, a major safety concern remains when deploying robots alongside workers due to the risk of collisions.
The industry’s typical response to unwanted collisions is proximity monitoring, which is not always effective. Identifying hazards solely based on proximity is inadequate. A thorough assessment also needs to consider the relationships between entities to determine if they are allowed to be near each other.
Investigating DNN-Based Visual Relationship Detection
This study explored the potential of deep neural network (DNN)-based single-shot visual relationship detection to infer connections among construction objects directly from site images. A well-designed DNN can blend local and global features into a single feature map, detecting intuitive relationships similarly to human vision.
Research Methodology
The research employed three DNN models of varying complexity for visual relationship detection. The Pixel2Graph DNN architecture, recognized for its multi-scale feature abstraction and relationship detection, was utilized. Initial training used the Visual Genome dataset, followed by fine-tuning with construction-specific data. The models were developed and tested using Python 3, with Recall@X as the evaluation metric.
Three models were tested:
- Basic Relationship Detection (Only-Rel): This model detected relationships using provided object bounding boxes and classes.
- Combined Classification and Relationship Detection (Cla-Rel): This model handled both object classification and relationship detection.
- Integrated Object Detection and Relationship Detection (Loc-Cla-Rel): The most complex model performed object localization, classification, and relationship detection.
Testing happened to involve new construction images right from site videos to YouTube, with image annotation that were facilitated by Amazon Mechanical Turk.
Performance Analysis and Discussion
The research aimed to enhance safety in the construction industry by enabling safe coexistence of robots and humans. The models showed varying levels of effectiveness, as reflected in their performance metrics.
- Basic Relationship Detection Model (Only-Rel): This model recognized relationships between predefined entities with given bounding boxes and classes. It showed high consistency with Recall@5 of 90.89% during fine-tuning and 90.63% during testing, indicating strong generalization and reliability for real-world use.
- Combined Classification and Relationship Detection Model (Cla-Rel): This model deduced both classification and relationships of entities based on provided bounding boxes. It performed well in fine-tuning with Recall@5 of 90.54%, but its performance dropped to 72.02% on the test dataset. This decline highlights challenges in classifying objects under variable conditions, affecting relationship detection accuracy.
- Integrated Object Detection and Relationship Detection Model (Loc-Cla-Rel): The most complex model aimed to perform object detection (localization and classification) and relationship detection. It achieved Recall@5 of 92.96% during fine-tuning, but its performance decreased to 66.28% during testing. This significant drop suggests difficulties in managing multiple tasks simultaneously, particularly integrating object detection with relationship inference.
The decline in accuracy with increased task complexity underscores the need for further optimization. This may involve refining network architecture or using advanced training techniques to simulate diverse construction site conditions. Increasing dataset variety and volume might also enhance the model’s generalization capabilities. These improvements are crucial for practical deployment to ensure worker safety alongside robotic systems.
Summary
While construction workers remain essential in automated environments, their safety around robots is critical. This study presents a novel approach using DNNs for single-shot visual relationship detection, which emulates human-like perception to identify hazards from a single image. This technology can complement existing proximity monitoring systems without extra hardware.
Although initial results are promising, test performances suggest a need for further refinement. Future enhancements could include more extensive training data and advancements in DNN architectures and methodologies to better prepare the models for real-world applications.