VL-JEPA: Vision-Language JEPA
VL-JEPA introduces a new approach arguing you don't always need token-by-token generation for strong multimodal understanding. The system learns continuous representations with text decoding only when necessary, offering more efficient training/inference and strong multimodal understanding without being fully generative-first.
Key Benefits: - More efficient training and inference - Strong multimodal understanding without full generative approach - Promise for real-time robot perception and reasoning
Paper: https://arxiv.org/abs/2512.10942
mini-VLA: Training Vision-Language-Action Models
Keivalya Pandya provides a practical guide covering the complete pipeline from data collection through Vision-to-Language-to-Action design, training, and inference without requiring massive computational resources.
The post emphasizes that substantial learning is possible without massive computational resources, making it valuable for building a VLA portfolio.
Topics Covered: - Data collection strategies - Vision-to-Language-to-Action design - Training pipelines - Inference optimization - Design lessons and best practices
G1Pilot: ROS 2 Package for Unitree G1 Humanoid
G1Pilot is a practical ROS 2 toolkit enabling teleoperation, detached control (keeping native locomotion while controlling arms separately), and autonomous navigation via Nav2Goal integration.
Features: - Teleoperation support - Detached control (separate arm manipulation from locomotion) - Autonomous navigation via Nav2 - Maintains Unitree's native locomotion stability
The split-control approach allows researchers to focus on manipulation while maintaining stable walking.
Repository: https://github.com/hucebot/g1pilot
NeurIPS 2025 BEHAVIOR-1K Challenge
The BEHAVIOR-1K Challenge featured long-horizon household tasks (cooking, cleaning, multi-step manipulation) with winning solutions from Robot Learning Collective (1st place) and Openpi Comet (2nd place).
Key Theme: System 2 Reasoning
A major theme emerged emphasizing moving beyond fast pattern matching toward agents capable of deliberation and reasoning for complex tasks, with applications for embodied decision-making in robotics.
Key Takeaway
The converging stack combines: - Improved world models - Practical training frameworks - Scalable synthetic data - Long-horizon benchmarks - Enhanced reasoning capabilities
This forms a comprehensive toolkit for generalist robot development.