ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

ProfVLM is a lightweight multimodal Vision-Language Model designed to analyze sports performance videos and generate textual proficiency feedback. The system leverages a frozen TimeSformer video encoder and a LoRA-injected SmolLMv2 language model to efficiently process video-text pairs. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%.

ProfVLM is a lightweight multimodal Vision-Language Model designed to analyze sports performance videos and generate textual proficiency feedback. The system leverages a frozen TimeSformer video encoder and a LoRA-injected SmolLMv2 language model to efficiently process video-text pairs. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%.