Newsletter #7: VL-JEPA, mini-VLA Training Guide, G1Pilot for Unitree

VL-JEPA: Vision-Language JEPA

VL-JEPA introduces a new approach arguing you don't always need token-by-token generation for strong multimodal understanding. The system learns continuous representations with text decoding only when necessary, offering more efficient training/inference and strong multimodal understanding without being fully generative-first.

Key Benefits: - More efficient training and inference - Strong multimodal understanding without full generative approach - Promise for real-time robot perception and reasoning

Paper: https://arxiv.org/abs/2512.10942

mini-VLA: Training Vision-Language-Action Models

Keivalya Pandya provides a practical guide covering the complete pipeline from data collection through Vision-to-Language-to-Action design, training, and inference without requiring massive computational resources.

The post emphasizes that substantial learning is possible without massive computational resources, making it valuable for building a VLA portfolio.

Topics Covered: - Data collection strategies - Vision-to-Language-to-Action design - Training pipelines - Inference optimization - Design lessons and best practices

G1Pilot: ROS 2 Package for Unitree G1 Humanoid

G1Pilot is a practical ROS 2 toolkit enabling teleoperation, detached control (keeping native locomotion while controlling arms separately), and autonomous navigation via Nav2Goal integration.

Features: - Teleoperation support - Detached control (separate arm manipulation from locomotion) - Autonomous navigation via Nav2 - Maintains Unitree's native locomotion stability

The split-control approach allows researchers to focus on manipulation while maintaining stable walking.

Repository: https://github.com/hucebot/g1pilot

NeurIPS 2025 BEHAVIOR-1K Challenge

The BEHAVIOR-1K Challenge featured long-horizon household tasks (cooking, cleaning, multi-step manipulation) with winning solutions from Robot Learning Collective (1st place) and Openpi Comet (2nd place).

Key Theme: System 2 Reasoning

A major theme emerged emphasizing moving beyond fast pattern matching toward agents capable of deliberation and reasoning for complex tasks, with applications for embodied decision-making in robotics.

Key Takeaway

The converging stack combines: - Improved world models - Practical training frameworks - Scalable synthetic data - Long-horizon benchmarks - Enhanced reasoning capabilities

This forms a comprehensive toolkit for generalist robot development.

VL-JEPA, mini-VLA Training Guide, G1Pilot for Unitree