HackerRank’s ASTRA benchmark is composed of multi-file, project-based problems designed to closely mimic real-world coding tasks. The objective is to evaluate the capabilities of advanced AI models across the entire SDLC. The initial release (v1) is primarily composed of frontend development problems and includes frameworks such as Node.js, React.js, Angular.js, Django, Java Spring Boot, Ruby on Rails, and .NET. In v1, the evaluation focuses exclusively on the model's ability to perform new feature development, assessed purely through code generation tasks. Both the input and output in the evaluation framework are text-based. The primary emphasis is on the correctness and consistency of the models, as these are fundamental to real-world applications. Evaluation metrics include average score and average pass@1, with the consistency (median standard deviation) considered as an additional reference.
Features of the HackerRank ASTRA Benchmark:
Based on the analysis of average scores, the models o1, o1-preview, and Claude-3.5-Sonnet-1022 demonstrate superior performance for multi-file, real-world front-end coding tasks. However, due to the high variance within the average scores across 65 questions, a paired t-test reveals that, with the exception of GPT-4o-0513, the differences between model performances are not statistically significant. Despite this, the average score with k=32 indicates a meaningful practical impact in real-world production settings. Similar trends were observed when evaluating the models using the average pass@1 metric.
In our benchmark evaluation, we assessed the Consistency of LLMs using the standard deviation (SD) of their scores across 32 independent runs per question and then evaluated the median SD across 65 questions. The models demonstrated varying levels of performance stability, with Claude-3.5-Sonnet-1022 exhibiting the lowest variability (SD = 0.0497), indicating the highest consistency across problems. The difference between Claude-3.5-Sonnet-1022 and the rest of the models are statistically significant based on the paired t-test.
The v1 ASTRA Benchmark Dataset comprises 65 project-based coding questions, systematically categorized into 10 primary coding skill domains and 34 subcategories.
The key statistics are summarized in the following table:
Here is an example of a RESTful API project from the ASTRA Benchmark Dataset. The task involves developing a RESTful API for managing product records using Node.js and Express, reflecting a common real-world e-commerce development scenario. The project structure is depicted in the following screenshot:
The API includes the following endpoints:
Modifications or deletions via PUT and DELETE methods are explicitly disallowed, aligning with specific business requirements.The implementation mandates a modular code structure, with separate files for routes, controllers, and database interactions. Candidates are required to implement robust error handling and adhere to business logic, such as ensuring products are only published if they satisfy predefined criteria:
This task reflects challenges encountered in building production-grade APIs, such as:
Additionally, the problem emphasizes practical concerns like returning appropriate HTTP status codes, handling error responses, and following an organized project structure. These are critical components for building scalable and maintainable APIs.
Taking one of the GPT-4o-0513 solutions as an example: GPT-4o-0513 successfully implemented the core logic for the API. The controllers in controllers/products.js effectively handled operations such as adding, retrieving, and updating products. Moreover, the routes in routes/products.js were correctly defined to map API endpoints to their respective controllers.
The routes defined by GPT-4o-0513 for handling products were as follows:
Despite correctly implementing the core logic, GPT-4o-0513 missed a critical step: integrating the product routes into the main application file (app.js). Instead of importing productsRouter from routes/products.js and linking it to the / path, GPT-4o-0513 incorrectly used a placeholder indexRouter. This oversight caused all requests to /products to fail with a "404 Not Found" error, as the routes were not properly connected. Consequently, every test case expecting responses from /products failed.
Here is the app.js provided by GPT-4o-0513:
To fix this issue, the productsRouter from routes/products.js should be directly linked to the root (/) endpoint in app.js. This ensures that all product-related routes are accessible as expected since the absolute paths are already defined within routes/products.js.
Fixes required in app.js:
We have provided a detailed project walkthrough along with additional three examples in the following document. This resource is intended for those interested in exploring the question and solution structure in greater depth.
The evaluation primarily targets code generation correctness and consistency, focusing exclusively on the model’s ability to generate accurate and functional solutions in response to a text-based API call.
1. Input Data Preparation
2. Solution Generation
3. Post-Processing
4. Solution Integration
5. Test Case Validation
6. Store Partial Results
7. Overall Aggregation
Once all the questions have been evaluated, an aggreggation script is executed to compute key performance metrics for each question.
Evaluation Metrics
Where:
Where:
These metrics are chosen for their alignment with real-world coding standards, where both complete and partially correct solutions carry significance. The Average Score accounts for the model’s incremental problem-solving ability, offering a granular view of how much of a solution’s functionality is achieved even when it is not fully correct. Pass@1 indicates how reliably a model can produce correct code immediately, which is crucial in real-world scenarios where developers aim to get solutions right with minimal revisions. The Median Standard Deviation reflects the consistency of a model’s solutions for each problem, highlighting whether the model performs steadily across its multiple attempts or exhibits significant variability.
Using k=32 provides a meaningful measure of a model’s capability to explore diverse solutions, as this number of attempts allows it to overcome minor variances while maintaining focus on a feasible solution space. We use the mean for metrics like average score and pass@1 because these aggregate metrics aim to capture the overall performance of the model across problems. For standard deviation, however, we use the median because the variability of scores across problems often contains outliers, and the median provides a more robust measure of the typical consistency of the model's performance.
Finding 1: ASTRA benchmark challenges LLMs with multi-file front-end projects
Finding 2: o1, o1 preview and Claude 3.5 Sonnet are the leading models for front-end development (as of January 2025)
Finding 3: o1 leads in average score and average pass@1, while Claude 3.5 Sonnet outperforms in consistency.
Finding 4: Model performance varies across subskills, suggesting that a “best” AI front-end development tool is dependent on specific use cases.
Finding 5: XML performs better than JSON results across all the models
XML prompt
JSON prompt
Finding 6: ASTRA Benchmark reveals JSON escaping challenges and rare refusals in o1-preview and o1, emphasizing the need for refined guardrails.
Finding 7: Common Errors from the model
User Interface and Presentation Issues: Errors that impact the visual or interactive aspects of the application, degrading the user experience by displaying incorrect or suboptimal layouts and requiring user intervention to correct.
Data Handling and Misuse Errors: Errors caused by improper or unnecessary manipulation of data files or structures, disrupting the application's expected functionality and potentially leading to runtime or compilation failures.
Typos, Syntax, and Misinterpretation Errors: Errors resulting from minor formatting issues, typographical mistakes, or misinterpretation of the problem statement. These errors typically involve incorrect output formatting or failure to adhere to the specified requirements.
Logical and Implementation Errors: Errors in the implementation that fail to account for specific conditions, edge cases, or problem constraints, despite having correct syntax.
Finding 8: Correlation Between Model Performance and Input/Output Length
The correlation between the average output length and the average score is approximately -0.560, indicating a moderate negative relationship. This suggests that longer outputs are generally associated with lower scores. In contrast, the correlation between input length and average score is approximately -0.164, reflecting a weak negative relationship. This implies that while longer inputs may slightly reduce the average score.
While our study provides valuable insights into AI model performance on multi-file real-world coding tasks, several limitations should be noted:
Limited Skill Coverage: The current version of the benchmark primarily focuses on front-end projects, such as React and Angular.js, which narrows the scope of skill evaluation. While these areas are critical, the lack of representation for back-end skills and other domains limits the comprehensiveness of the evaluation. In the next iteration, we aim to address this limitation by expanding the benchmark to include a broader range of back-end skills and technologies.
Absence of Agentic Approaches: Our evaluation does not yet leverage agentic methods to maximize model performance, where models are given the autonomy to iteratively explore, adapt, and refine their solutions within the benchmark constraints. Incorporating such approaches in future versions will enable a more realistic and nuanced understanding of the model’s potential in dynamic and complex problem-solving scenarios.
Lack of More Model and Iterative Feedback Mechanism: The current study evaluates a handful models by requesting outputs directly for each attempt without providing feedback based on test case results. This approach limits our ability to assess how models perform when given iterative guidance, which is an essential aspect of real-world coding.
Limited model selection: The current model selection is limited to a subset of top-tier models. However, we are actively working to expand testing by including additional models, such as DeepSeek and Llama, in future evaluations. Furthermore, we are developing a community-driven approach to benchmark testing, enabling broader model comparisons to enhance our leaderboard. As an initial step, with this release, we are open-sourcing all 65 project questions on GitHub and Hugging Face.
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770, 2023. https://arxiv.org/abs/2310.06770
Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang. HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation. arXiv preprint arXiv:2412.21199, 2024. https://arxiv.org/abs/2412.21199
Woojeong Kim, Ashish Jagmohan, Aditya Vempaty. Scale AI SEAL: Suite for Evaluating API-use of LLMs. arXiv preprint arXiv:2409.15523, 2024. https://arxiv.org/abs/2409.15523
Berk Atil, Alexa Chittams, Liseng Fu, Ferhan Ture, Lixinyu Xu, Breck Baldwin. LLM Stability: A Detailed Analysis with Some Surprises. arXiv preprint arXiv:2408.04667, 2024. https://arxiv.org/abs/2408.04667
Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2405.19856, 2024. https://arxiv.org/abs/2405.19856
Baizhou Huang, Shuai Lu, Weizhu Chen, Xiaojun Wan, Nan Duan. Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency. arXiv preprint arXiv:2309.17272, 2024. https://arxiv.org/abs/2309.17272
John Yang, Carlos E. Jimenez, Alexander L. Zhang, Kilian Lieret, Jiani Yang, Xinyun Wu, Ofir Press, Nils Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? arXiv preprint arXiv:2410.03859, 2024. https://arxiv.org/abs/2410.03859
Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen. Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study. arXiv preprint arXiv:2403.08604, 2024. https://arxiv.org/abs/2403.08604
Qian Huang, Jian Vora, Percy Liang, Jure Leskovec. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. arXiv preprint arXiv:2310.03302, 2024. https://arxiv.org/abs/2310.03302
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688, 2023. https://arxiv.org/abs/2308.03688
Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, Shikun Zhang. A Survey on Evaluating Large Language Models in Code Generation Tasks. arXiv preprint arXiv:2408.16498, 2024. https://arxiv.org/abs/2408.16498
Debalina Ghosh Paul, Hong Zhu, Ian Bayley. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review. arXiv preprint arXiv:2406.12655, 2024. https://arxiv.org/abs/2406.12655
Weixi Tong, Tianyi Zhang. CodeJudge: Evaluating Code Generation with Large Language Models. arXiv preprint arXiv:2410.02184, 2024. https://arxiv.org/abs/2410.02184
Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, Lingming Zhang. Evaluating Language Models for Efficient Code Generation. arXiv preprint arXiv:2408.06450, 2024. https://arxiv.org/abs/2408.06450
Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun. Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models. arXiv preprint arXiv:2407.11470, 2024. https://arxiv.org/abs/2407.11470
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021. https://arxiv.org/abs/2107.03374
Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Dragomir Radev, Arman Cohan. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models. arXiv preprint arXiv:2309.17446, 2023. https://arxiv.org/abs/2309.17446
Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng. Towards More Realistic Evaluation of LLM-based Code Generation: An Experimental Study and Beyond. arXiv preprint arXiv:2406.06918, 2024. https://arxiv.org/abs/2406.06918
Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599, 2024. https://arxiv.org/abs/2404.00599