In the past few years, object detection has attracted a lot of attention in the context of human–robot collaboration and Industry 5.0 due to enormous quality improvements in deep learning technologies. In many applications, object detection models have to be able to quickly adapt to a changing environment, i.e., to learn new objects. A crucial but challenging prerequisite for this is the automatic generation of new training data which currently still limits the broad application of object detection methods in industrial manufacturing. In this work, we discuss how to adapt state-of-the-art object detection methods for the task of automatic bounding box annotation in a use case where the background is homogeneous and the object’s label is provided by a human. We compare an adapted version of Faster R-CNN and the Scaled-YOLOv4-p5 architecture and show that both can be trained to distinguish unknown objects from a complex but homogeneous background using only a small amount of training data. In contrast to most other state-of-the-art methods for bounding box labeling, our proposed method neither requires human verification, a predefined set of classes, nor a very large manually annotated dataset. Our method outperforms the state-of-the-art, transformer-based object discovery method LOST on our simple fruits dataset by large margins.
An approach focussed on resolving identity of
subjects in a photo using mobile device connectivity,
Web services and social network ontologies is
presented in this paper. A framework is described in
which mobile device sensors, Web services and
ontologies are combined to provide meaningful photo
annotation metadata that can be used to recall photos
from the Web. Useful metadata can be gleaned from
the environment at the time of capture and further
information inferred from available Web services.
This paper presents an approach to semi-automate photo annotation. Instead of using content-recognition techniques this approach leverages context information available at the scene of the photo such as time and location in combination with existing photo annotations to provide suggestions to the user. An algorithm exploits a number of technologies including Global Positioning System (GPS), Semantic Web, Web services and Online Social Networks, considering all information and making a best-eort attempt to suggest both people and places depicted in the photo. The user then selects which of the suggestions are correct to annotate the photo. This process accelerates the photo annotation process dramatically which in turn aids photo search for a wide range of query tools that currently trawl the millions of photos on the Web.
J. Jeon, V. Lavrenko, and R. Manmatha. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, page 119--126. New York, NY, USA, ACM, (2003)
S. Repp, S. Linckels, and C. Meinel. Proceedings of the international workshop on Educational multimedia and multimedia education, page 19--26. New York, NY, USA, ACM, (2007)
R. Jesus, D. Goncalves, A. Abrantes, and N. Correia. Computer Vision and Pattern Recognition Workshops, 2008. CVPR Workshops 2008. IEEE Computer Society Conference on, (June 2008)
X. Li, L. Chen, L. Zhang, F. Lin, and W. Ma. MULTIMEDIA '06: Proceedings of the 14th annual ACM international conference on Multimedia, page 607--610. New York, NY, USA, ACM, (2006)
C. Saathoff, and S. Staab. 9th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2008), page 16-19. Los Alamitos, CA, USA, IEEE Computer Society, (2008)