Can current video-language models truly understand who did what in a scene? We introduce VELOCITI, a benchmark designed …
source
Can current video-language models truly understand who did what in a scene? We introduce VELOCITI, a benchmark designed …
source