V-GameGym: Visual Game Generation for Code Large Language Models

Wei Zhang1, Jack Yang, Renshuai Tao3, Lingzheng Chai, Shawn Guo, Jiajun Wu,
Xiaoming Chen4, Ganqu Cui1*, Ning Ding1, Xander Xu2, Hu Wei2*, Bowen Zhou1*

1Shanghai AI Lab; 2Alibaba Group; 3Beijing Jiaotong University; 4AIStrong

📁 Resources & Links

Abstract

Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.

1. Introduction

Recent advances in code large language models (code LLMs) have demonstrated remarkable capabilities in programming tasks, building upon foundational models such as Qwen-Coder, StarCoder, and DeepSeek-Coder, establishing strong baselines for code generation, completion, and understanding tasks. These LLMs adopt specialized training strategies combining pre-training on large code corpora from repositories like GitHub, followed by post-training to align outputs with programming best practices.

Figure 1: A visual programming about the flappy bird style arcade game.
Figure 2: Overview of the V-GameGym framework from data collection to evaluation.

The focus of these advanced LLMs is not on solving algorithmic problems, but rather on visual programming to provide more intuitive demonstrations of model performance. However, these approaches primarily focus on code generation accuracy and syntax correctness, overlooking critical game-specific evaluation metrics such as playability, visual aesthetics, user engagement, and performance optimization. The absence of comprehensive evaluation frameworks and targeted improvement methodologies limits the practical deployment of code LLMs in professional game development workflows.

Figure 3: Correlation between model size and games solved.

2. Leaderboard

Our comprehensive evaluation covers 70 state-of-the-art code large language models across different sizes and architectures. The evaluation assesses models on multiple dimensions including code generation quality, visual output, and dynamic gameplay elements.

View Full Leaderboard

The leaderboard includes detailed performance metrics, interactive filtering, and comprehensive analysis of model capabilities across the V-GameGym benchmark. Models are evaluated on:

  • Code Quality: Syntax correctness, structure, and game logic implementation
  • Visual Output: Screenshot analysis of generated games
  • Dynamic Behavior: Video analysis of gameplay mechanics
  • Overall Performance: Composite scores across all evaluation dimensions

3. Conclusion

We introduce V-GameGym, a multimodal benchmark for evaluating code LLMs in visual game generation. Built by curating 2219 high-quality Pygame samples, our framework assesses both code generation and visual capabilities. Our evaluation of 70 models reveals a significant performance gap between proprietary and open-source models, with top models succeeding only ~45%. The benchmark highlights critical limitations in visual understanding and dynamic gameplay generation, providing a foundation for advancing AI-assisted game development.

This work contributes to the broader understanding of code LLMs capabilities and limitations, particularly in creative and visual programming domains. V-GameGym serves as both an evaluation tool and a dataset for future research in multimodal code generation, game development automation, and AI-assisted creative programming.

@misc{zhang2025vgamegymvisualgamegeneration,
    title={V-GameGym: Visual Game Generation for Code Large Language Models}, 
    author={Wei Zhang and Jack Yang and Renshuai Tao and Lingzheng Chai and Shawn Guo 
        and Jiajun Wu and Xiaoming Chen and Ganqu Cui and Ning Ding and Xander Xu and Hu Wei and Bowen Zhou},
    year={2025},
    eprint={2509.20136},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2509.20136}, 
}