A Fast System for Person Description Search in Videos
Author
Abstract

Nearest Neighbor Search - Security CCTV cameras are important for public safety. These cameras record continuously 24/7 and produce a large amount of video data. If the videos are not reviewed immediately after an incident, it can be difficult and timeconsuming to find a specific person out of many hours of recording. In this work we present a system that can search for people that fit a textual description in a video. It utilizes a imagetext multimodal deep learning model to calculate the similarity between an image of a person against a text description and find the top matches. Normally this would require calculating the textimage similarity scores between one text description and every person in the video, which is O(n) in the number of people in the video and therefore impractical for real-time search. We propose a solution to this by pre-calculating embeddings of person images and applying approximate nearest neighbor vector search. At inference time, only one forward pass through the deep learning model is needed, the computational cost is therefore the time to embed a text description O(1), plus the time to perform an approximate nearest neighbor search O(log(n)). This makes realtime interactive search possible.

Year of Publication
2022
Date Published
nov
Publisher
IEEE
Conference Location
Nonthaburi, Thailand
ISBN Number
978-1-66548-912-6
URL
https://ieeexplore.ieee.org/document/10067673/
DOI
10.1109/InCIT56086.2022.10067673
Google Scholar | BibTeX | DOI