View a PDF of the paper titled UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction, by Xixuan Hao and 6 other authors
Abstract:Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.
Submission history
From: Yuxuan Liang [view email]
[v1]
Mon, 25 Mar 2024 14:57:18 UTC (5,761 KB)
[v2]
Wed, 29 May 2024 06:11:30 UTC (14,120 KB)
[v3]
Wed, 22 Jan 2025 08:45:56 UTC (12,782 KB)
Source link
lol