01
Jun
In-video search is ability to search for a specific content within a video. This can include searching for particular words spoken, objects shown or description of a scene. With the current advancement in transformers the process of in-video search have become more accurate and fairly simple. Although most of the transformers doesn’t have a joint embedding space for multiple modalities but there are few models like Meta’s ImageBind that a joint embedding space between text, image, audio, depth, thermal and IMU, or OpenAI’s CLiP model have joint embedding space between text and image. We can use these models to create…