We address the task of generating the 2D flow of a target object and the corresponding end-effector trajectory for robotic manipulation, given an RGB image before manipulation and a language instruction.
Flow-based methods that predict both 2D flow and trajectories from language and initial images, offer significant advantages: they can adapt to various daily tasks using minimal real-robot data and can be trained on readily available web videos of object manipulation. However, there is a lack of multimodal flow-based methods trained on large datasets. Furthermore, many existing approaches employ closed-loop trajectory generation, which requires more demonstration data and suffers from error accumulation.
To overcome these limitations, we propose flow-based Language Instruction-guided open-Loop ACtion generator (LILAC), which performs offline trajectory generation. LILAC introduces Semantic Consistency Loss that strengthens language conditioning to generate instruction aligned optical flow, and Prompt-Conditioned Cross-Modal Adapter that aligns a learned visual prompt with image and text features to provide richer cues for appropriate flow generation.
Experimental results demonstrate that our method generates higher quality flow compared to existing approaches and achieves superior task success rates in real-robot object manipulation tasks.
We introduce LILAC, a novel approach for generating task-relevant motion trajectories from visual observations and natural-language instructions in open-loop settings. LILAC consists of two main modules: a flow generation module and an action de-tokenizer. Our key contributions are (1) the Prompt-Conditioned Cross-Modal Adapter that dynamically fuses information based on contextual prompts, and (2) the Semantic Reconstruction Loss that encourages the model to learn meaningful representations of the language instructions.
Fig. 2: Overview of flow generation module of LILAC. The model receives a single RGB frame and a natural-language instruction, encodes each modality, and fuses them through a Prompt-Conditioned Multi-Modal Adapter (PC-MMA). The fused representation is decoded by a transformer to produce a temporally coherent sequence of 3-D end-effector waypoints suitable for direct execution on the robot.
Table 1: Quantitative comparison of flow‐based VLAs on the Fractal and Bridge_v2 subsets in the Robot Flow benchmark. Lower is better for Average Distance Error (ADE \(\downarrow\)); higher is better for Area Under the Curve (AUC) and precision@\(k\) metrics (\(\uparrow\)). Best values per column are highlighted in bold.
Table 6: Success rates (%) on the real-robot evaluation for three manipulation primitives. The highest score in each column is shown in bold.
To be appeared.