September 2021
tl;dr: Have tracker in the loop improves perception and prediction. Track level features is important for long term trajectory prediction.
MOT has two challenges: the discrete problem of data association and the continuous problem of trajectory estimation.
Previous methods with perception and prediction only uses tracking as postprocessing. The full temporal history contained in tracks is not used by detection and prediction. They usually limit the time step to 3, instead of a long-term trajectory. Their performance usually saturates with fewer than 1 second of sensor data.
PnPNet includes a tracker in the loop and thus can be trained end to end. The Hungarian matching cost function is learnable.
- Tracker in the loop to leverage long-term track level information.
- Input: multiple sweeps by concatenating along the height dimension, with the ego motion compensated for the previous sweeps, similar to IntentNet.
- Trajectory level object trajectory
- Start from memories of BEV features maps
- Based on the object location, feature maps are rotated-RoI-pooled from the BEV feature map
- Velocity is obtained by finite difference.
- An LSTM to mine features in the track
$h(P^t_j)$
- Motion forecasting is based on the track feature
$h(P^t_j)$ . This is different from FaF and IntentNet. Exploting motion from explicit object trajectories is more accurate than inferring motion from the features computed from the raw sensor data.
- Prediction = Motion Forceasting.
- Track history is 16 frames @ 10 Hz, larger than 1 second.
- The RRoIAlign is also mentioned in Spatially-Aware Graph Neural Networks for Relational Behavior Forecasting from Sensor Data ICRA 2020. See youtube video for a good illustration.