Language-Guided Cognitive Planning with Video Prediction

Vector Institute | May 2021 - Apr 2022 | Research, team of Three

We proposed a novel architecture to tackle computational cognitive planning as a video prediction problem, given an initial visual observation and a natural language task description. The architecture was broken down into two submodules: high-level planning of actions by a transformer model, and cognitive grounding of the planned actions by an extended video generation network.

My contributions:

Experimented with the high-level planning submodule and solved the issues such as unstable training results by scheduling meetings with the author and reporting progress to other teammates timely
Improved the generalizability of the grounding submodule to unseen objects, by adding a language model as the text encoder (two different approaches experimented: BERT model and text encoder in CLIP)
Discussed and proposed different real dataset ideas, recorded several possible real datasets including ‘stack plastic dinnerware’, ‘assembly kits’ and ‘spelling words’, and examined their feasibility
Implemented a stochastic CLVER dataset on Isaac Sim, connected it with OMPL motion planner, tested with different physics solvers such as ROS, Pybullet, and motion planners including MoveIt and Riemannian Motion Policy
Experimented with and conducted plenty of research for additional datasets including ‘BEHAVIOR’ and ‘EPIC-KITCHENS’ datasets. Implemented a simulation dataset of a robot arm performing spelling of various words on a board
Implemented the grounding submodule with different encoder-decoder architectures (DCGAN, U-Net, or CrevNet), solved the problems during training such as blurred image generations and slow progress by hyperparameter tuning
Implemented the evaluation with different performance metrics including SSIM, LPIPS, PVQA, and OCR metric, demonstrated significant improvements by comparing to video generation baselines, and better grounding generalization to unseen objects

Language-Guided Cognitive Planning with Video Prediction

My contributions:

Ziyi Zhou

Research Engineer