HackerRank ASTRA | An AI Benchmark for the SDLC

Methodology

The leaderboard shows results from private datasets with project questions created by the same experts who design HackerRank’s developer assessments.

primary difference

Real-world datasets

The evaluation focuses on the model’s ability to perform complex tasks across the software development life cycle.

Primary focus

Correctness and consistency

To assess the production reliability of the model, we prioritize median standard deviation with k=32, rather than relying on the industry-standard pass@k.

dive deeper

Skill Leaderboards

To assist in model selection for various tasks, performance is reported at both the skill level and on a global leaderboard.

ASTRA Leaderboard

Model

Avg. Score

Avg. Pass@1

Consistency

Rank:

Model:

GPT 4.1

Average score:

81.96%

Average Pass@1:

71.72

Consistency:

0.14

Rank:

Model:

DeepSeek-R1

Average score:

81.49%

Average Pass@1:

69.09%

Consistency:

0.11

Rank:

Model:

o3-mini

Average score:

80.75%

Average Pass@1:

71.28%

Consistency:

0.12

Rank:

Model:

DeepSeek-V3

Average score:

77.89%

Average Pass@1:

64.11%

Consistency:

0.16

Rank:

Model:

Claude-3.7-sonnet

Average score:

77.82%

Average Pass@1:

69.54%

Consistency:

0.1

Rank:

Model:

GPT-4.5-preview

Average score:

77.46%

Average Pass@1:

64.91%

Consistency:

0.13

Rank:

Model:

Average score:

75.80%

Average Pass@1:

63.92%

Consistency:

0.11

Rank:

Model:

o1-preview

Average score:

75.55%

Average Pass@1:

60.89%

Consistency:

0.17

Rank:

Model:

Llama-4-Maverick

Average score:

75.44%

Average Pass@1:

63%

Consistency:

0.12

Rank:

Model:

Claude-3.5-sonnet

Average score:

75.07%

Average Pass@1:

62.74%

Consistency:

0.05

Rank:

Model:

Gemini-1.5-pro

Average score:

71.17%

Average Pass@1:

58.15%

Consistency:

0.13

Rank:

Model:

GPT-4o

Average score:

69.52%

Average Pass@1:

50.91%

Consistency:

0.2

Rank:

Model:

Gemini-2.5-pro-exp-03-25

Average score:

67.43%

Average Pass@1:

58.02%

Consistency:

0.23

Rank:

Model:

Llama-3.3-70B

Average score:

61.65%

Average Pass@1:

46.54%

Consistency:

0.09

evaluation metrics

ASTRA assesses AI models with metrics that matter in actual development.

Average Score

Measures how much of a task a model correctly solves on its first try. This shows how well a model handles complex, layered problems.

Average Pass@1

Reflects how often a model delivers a fully correct solution on the first attempt.

Consistency

Measures the mean standard deviation of scores to track how reliably a model performs. Lower numbers mean more consistent results; higher numbers signal inconsistent performance.

Evaluating LLM performance across the SDLC

Methodology

Real-world datasets

Correctness and consistency

Skill Leaderboards

ASTRA Leaderboard

evaluation metrics

Average Score

Average Pass@1

Consistency

For developers, data scientists, and the hopelessly curious