A Fast System for Person Description Search in Videos | Science of Security Virtual Organization

A Fast System for Person Description Search in Videos
Author	Sumeth Yuenyong
Abstract	Nearest Neighbor Search - Security CCTV cameras are important for public safety. These cameras record continuously 24/7 and produce a large amount of video data. If the videos are not reviewed immediately after an incident, it can be difficult and timeconsuming to find a specific person out of many hours of recording. In this work we present a system that can search for people that fit a textual description in a video. It utilizes a imagetext multimodal deep learning model to calculate the similarity between an image of a person against a text description and find the top matches. Normally this would require calculating the textimage similarity scores between one text description and every person in the video, which is O(n) in the number of people in the video and therefore impractical for real-time search. We propose a solution to this by pre-calculating embeddings of person images and applying approximate nearest neighbor vector search. At inference time, only one forward pass through the deep learning model is needed, the computational cost is therefore the time to embed a text description O(1), plus the time to perform an approximate nearest neighbor search O(log(n)). This makes realtime interactive search possible.
Year of Publication	2022
Date Published	nov
Publisher	IEEE
Conference Location	Nonthaburi, Thailand
ISBN Number	978-1-66548-912-6
URL	https://ieeexplore.ieee.org/document/10067673/
DOI	10.1109/InCIT56086.2022.10067673
Google Scholar \| BibTeX \| DOI