LILAC: Language‑Conditioned Object‑Centric Optical Flow for Open‑Loop Trajectory Generation

Motonari Kambara1, Koki Seno1, Tomoya Kaichi2, Yanan Wang2, Komei Sugiura1

1Keio University, 2KDDI Research Inc.

IEEE RA-L 2026





eye-catch.png

Fig. 1: Overview of LILAC, 2D object-centric optical flow-based Visionand-Language trajectory generation framework. In this figure, ‘Act. DeTokenizer’ represents Action De-Tokenizer. Given a natural language instruction, LILAC generates 2D flow from an RGB image and the instruction, and converts the flow into a 6-DoF robot trajectory.

Abstract

We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from premanipulation images and natural language instructions requires appropriate instruction-flow alignment.

To tackle this challenge, we propose the flow-based Language Instructionguided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and PromptConditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation.

Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods.

Real-World Experiments (8×)

Overview

We introduce LILAC, a novel approach for generating task-relevant motion trajectories from visual observations and natural-language instructions in open-loop settings. LILAC consists of two main modules: a flow generation module and an action de-tokenizer. Our key contributions are (1) the Prompt-Conditioned Cross-Modal Adapter that dynamically fuses information based on contextual prompts, and (2) the Semantic Reconstruction Loss that encourages the model to learn meaningful representations of the language instructions.


.

Fig. 2: Overview of flow generation module of LILAC. The model receives a single RGB frame and a natural-language instruction, encodes each modality, and fuses them through a Prompt-Conditioned Multi-Modal Adapter (PC-MMA). The fused representation is decoded by a transformer to produce a temporally coherent sequence of 3-D end-effector waypoints suitable for direct execution on the robot.

  • CORE NOVELTIES:
     
  • 1. We propose LILAC, Vision-and-Language open-loop trajectory generation pipeline that generates object-centric 2D optical flow from an RGB image and a natural language instruction and converts it into a 6D end-effector trajectory.
  • 2. We introduce Prompt-Conditioned Multimodal Adapter that dynamically integrates visual observations, language instruction, and visual prompts, enabling task-adaptive behavior generation.
  • 3. We introduce Semantic Reconstruction Loss during training, which explicitly encourages the model to learn semantically meaningful representations of language instructions, leading to improved generalization and better alignment between the command and the resulting trajectory.



Results

Qualitative Results


Quantitative Results

Table Ⅰ: Quantitative comparison on the Fractal and BridgeData V2 subsets in the Robot Flow benchmark. In addition, the rows “LILAC w/o srl” and “LILAC w/o vp” respectively show the results of the ablation studies on Semantic Reconstruction Loss and the visual prompt, respectively. Lower values of ADE are better. Meanwhile, higher values of AUC and P@K are better. Best values per column are shown in bold.

.

Fig. 8: Success rates of different methods on physical experiments. We compared our proposed method against three baseline approaches (Im2Flow2Act, FLIP, and π0) across five manipulation tasks and report the average success rate. Each task was evaluated over 20 trials. Bold numbers indicate the highest success rate for each task.

.


BibTeX


    To be appeared.