3D Crowd Counting via Geometric Attention-guided Multi-View Fusion

[Submitted on 18 Mar 2020 (v1), last revised 30 Oct 2024 (this version, v2)]

View a PDF of the paper titled 3D Crowd Counting via Geometric Attention-guided Multi-View Fusion, by Qi Zhang and Antoni B. Chan

View PDF
HTML (experimental)

Abstract:Recently multi-view crowd counting using deep neural networks has been proposed to enable counting in large and wide scenes using multiple cameras. The current methods project the camera-view features to the average-height plane of the 3D world, and then fuse the projected multi-view features to predict a 2D scene-level density map on the ground (i.e., birds-eye view). Unlike the previous research, we consider the variable height of the people in the 3D world and propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground plane. Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views. The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density. Furthermore, instead of using the standard method of copying the features along the view ray in the 2D-to-3D projection, we propose an attention module based on a height estimation network, which forces each 2D pixel to be projected to one 3D voxel along the view ray. We also explore the projection consistency among the 3D prediction and the ground truth in the 2D views to further enhance the counting performance. The proposed method is tested on the synthetic and real-world multiview counting datasets and achieves better or comparable counting performance to the state-of-the-art.