Meet HackerRank ASTRA

Evaluating LLM performance across the SDLC

HackerRank ASTRA* challenges AI models with real projects, testing how well they solve complex software tasks.
View leaderboard
*ASTRA (Assessment of Software Tasks in Real-world Applications)

Methodology

The leaderboard shows results from private datasets with project questions created by the same experts who design HackerRank’s developer assessments.

primary difference

Real-world datasets

The evaluation focuses on the model’s ability to perform complex tasks across the software development life cycle.
Primary focus

Correctness and consistency

To assess the production reliability of the model, we prioritize median standard deviation with k=32, rather than relying on the industry-standard pass@k.
dive deeper

Skill Leaderboards

To assist in model selection for various tasks, performance is reported at both the skill level and on a global leaderboard.

ASTRA Leaderboard

Model
Avg. Score
Avg. Pass@1
Consistency
Rank:
1
Model:
o1
Average score:
75.80%
Average Pass@1:
63.92%
Consistency:
.11
Rank:
2
Model:
o1-preview
Average score:
75.55%
Average Pass@1:
60.89%
Consistency:
.17
Rank:
3
Model:
Claude-3.5-sonnet
Average score:
75.07%
Average Pass@1:
62.74%
Consistency:
.05
Rank:
4
Model:
Gemini-1.5-pro
Average score:
71.17%
Average Pass@1:
58.15%
Consistency:
.13
Rank:
5
Model:
GPT-4o
Average score:
69.52%
Average Pass@1:
50.91%
Consistency:
.20

evaluation metrics

ASTRA assesses AI models with metrics that matter in actual development.

Average Score

Measures how much of a task a model correctly solves on its first try. This shows how well a model handles complex, layered problems.

Average Pass@1

Reflects how often a model delivers a fully correct solution on the first attempt.

Consistency

Measures the mean standard deviation of scores to track how reliably a model performs. Lower numbers mean more consistent results; higher numbers signal inconsistent performance. 

Key findings

See how today’s frontier models perform on real software development challenges.
Read the report

For developers, data scientists, and the hopelessly curious

Want to roll up your sleeves and get into the weeds? Check out our full report

Read the report