ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
ProfVLM is a lightweight multimodal Vision-Language Model designed to analyze sports performance videos and generate textual proficiency feedback. The system leverages a frozen TimeSformer video encoder and a LoRA-injected SmolLMv2 language model to efficiently process video-text pairs. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%.
ProfVLM is a lightweight multimodal Vision-Language Model designed to analyze sports performance videos and generate textual proficiency feedback. The system leverages a frozen TimeSformer video encoder and a LoRA-injected SmolLMv2 language model to efficiently process video-text pairs. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%.