2024 | FICTION: 4D Future Interaction Prediction from Video
💷2024 | FICTION: 4D Future Interaction Prediction from Video
type
status
date
slug
summary
tags
category
icon
password

Task

Given an observation of a person performing an activity, the model can anticipate all subsequent interactions, up to a future time (3 minutes), including its location and its body poses.
input
  • observation video till time
  • 3D locations
  • body pose
output:
  • a set of points such that the person interacts with the object at the 3D point at the timestamp .
    • the distribution of the likely body poses
      notion image

      Method

      notion image

      Video representation

      • : EgoVLPv2
      • : linear layer

      Pose Representation

      • : linear layer

      Object Bounding Boxes

      Assign an object index at a voxel location containing that object.
      Actor is also viewed as an object
      • : linear layer

      Multimodel Transformer Encoder

      choose the first output of the transformer as the output representation.

      Decoding future interaction location

      use a simple location decoder linear layer which maps to a vector having dimensions.
      a voxel is marked 1 when the corresponding location has a future interaction; 0 otherwise.

      CVAE for pose distribution

      location query , mapped into .

      Training

      where is the desired ground truth body pose.

      Loss Function

      1. MSE Loss between the predicted SMPL parameters and the ground truth parameters
      1. convert the SMPL parameters to 3D body joints, and compute the joint error.
      1. KL divergence between the predicted parameters and .

      Inference

      sample multiple and decode.

      Dataset Construction

      3D object bounding boxes

      1. object segmentation in 2D video
      1. use mapping between the video pixels to the 3D locations
      1. obtain 3D bounding boxes

      body pose

      extract pose from exo video.

      interaction

      notion image
      use the annotated narrations.
      1. use Llama-3.1-8B to classify all the narrations into either a touch-based interaction or a non-touch interaction.
      1. match the object mentioned in the narration to the object detection vocabulary
      1. use the narration timestamps
       
      docker 学习笔记CVPR 2024 | Dynamic Prompt Optimizing for Text-to-Image Generation
      Loading...