GeoAdvances 2025 – 10th International Conference on GeoInformation Advances, Marrakush, Fas, 29 - 30 Mayıs 2025, cilt.17, ss.279-283, (Tam Metin Bildiri)
Reliable semantics in 3D building models support practical urban tasks such as planning, asset inventory, and maintenance. This paper presents an approach that pairs graph-based geometry (GNN) with image-based appearance (ViT) to improve component segmentation. A Graph Neural Network (GNN) is first applied to the building mesh to capture structural cues and produce initial labels. Multi-view 2D projections (orthographic and perspective) are then rendered and processed with a Vision Transformer (ViT) to recover visual patterns related to windows, doors, roofs, and walls. The two streams are reconciled through a simple consensus fusion that projects ViT predictions back onto the 3D geometry and refines the labels. In experiments, the proposed pipeline improves accuracy and classwise consistency over a GNN baseline, with clearer gains on small or visually ambiguous elements.