We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from premanipulation images and natural language instructions requires appropriate instruction-flow alignment.
To tackle this challenge, we propose the flow-based Language Instructionguided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and PromptConditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation.
Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods.
We introduce LILAC, a novel approach for generating task-relevant motion trajectories from visual observations and natural-language instructions in open-loop settings. LILAC consists of two main modules: a flow generation module and an action de-tokenizer. Our key contributions are (1) the Prompt-Conditioned Cross-Modal Adapter that dynamically fuses information based on contextual prompts, and (2) the Semantic Reconstruction Loss that encourages the model to learn meaningful representations of the language instructions.
Fig. 2: Overview of flow generation module of LILAC. The model receives a single RGB frame and a natural-language instruction, encodes each modality, and fuses them through a Prompt-Conditioned Multi-Modal Adapter (PC-MMA). The fused representation is decoded by a transformer to produce a temporally coherent sequence of 3-D end-effector waypoints suitable for direct execution on the robot.
Table Ⅰ: Quantitative comparison on the Fractal and BridgeData V2 subsets in the Robot Flow benchmark. In addition, the rows “LILAC w/o srl” and “LILAC w/o vp” respectively show the results of the ablation studies on Semantic Reconstruction Loss and the visual prompt, respectively. Lower values of ADE are better. Meanwhile, higher values of AUC and P@K are better. Best values per column are shown in bold.
Fig. 8: Success rates of different methods on physical experiments. We compared our proposed method against three baseline approaches (Im2Flow2Act, FLIP, and π0) across five manipulation tasks and report the average success rate. Each task was evaluated over 20 trials. Bold numbers indicate the highest success rate for each task.
To be appeared.