While land use classification and mapping based on visual interpretation of aerial images have been extensively studied over decades, such overhead imagery can hardly determine land use(s) accurately in complicated urban areas (e.g., a building with different functionalities). Meanwhile, images taken at the ground level (e.g., street view images) are more fine-grained and informative for mixed land use detection. Considering land use categories are often used to describe urban images, mixed land use detection can be regarded as the Natural Language for Visual Reasoning (NLVR) problem. As such, this study develops a vision-language multimodal learning model with street view images for mixed land use detection, which is based on the contrastive language-image pre-training (CLIP) model and further improved and tailored by two procedures: 1) prompt tuning on CLIP, which not only learns the visual features from street view images, but also integrates land use labels to generate textual features and fuses them with the visual ones; and 2) calculating the Diversity Index (DI) from the fusions of visual and textual features, and using the DI value to estimate the mixed level for each image. Our experiments demonstrate that simply leveraging the street view image itself with tailored prompt engineering is effective for mixed land use detection, reaching the degree of matching from 71% to 84% between the predicted labels and the OpenStreetMap ones. Moreover, a land-use map with mixture information represented as probabilities of different land-use types is produced, paving the way for fine-grained land-use mapping in urban areas with heterogeneous functionalities.