Repository-level QA benchmark for software engineering LLMs

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

StackRepoQA is a multi-project benchmark built from real developer questions and accepted answers, designed to evaluate whether LLMs can answer questions that require repository-scale program comprehension.

Read paper Get dataset DOI

A project from the Code World, No Blanket research group at Virginia Tech Computer Science.

Authors

Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan, Chris Brown

Links

PDF · Research group

Abstract

Why repository-level QA matters

Large Language Models have shown strong capabilities across software engineering tasks, but many benchmarks focus on isolated functions or single-file snippets. Real-world program comprehension often requires reasoning across multiple files, structural dependencies, project conventions, and historical developer discussions.

This work introduces StackRepoQA, a dataset of real Stack Overflow questions and accepted answers mapped to open-source Java repositories. The benchmark compares direct prompting with retrieval-augmented methods that use file-level retrieval and graph-based representations of structural dependencies.

Dataset release

StackRepoQA

Download options

Dataset README Full dataset (CSV)

About the dataset

Real developer questions

Questions are based on Stack Overflow posts with accepted answers, supporting realistic repository-level QA evaluation.

Repository mapping

Each QA item is associated with an open-source Java project, enabling project-aware retrieval and analysis.

Reasoning vs. memorization

The benchmark supports studying when high scores reflect genuine repository reasoning versus answer memorization.

Findings

What the benchmark reveals

Baseline LLM performance is moderate, suggesting that repository-scale comprehension remains challenging.

Retrieval augmentation improves results when the retrieved context contains useful structural and file-level signals.

Graph-based structural representations help expose dependencies that are difficult to capture with snippet-only context.

Some high-scoring outputs appear to reproduce accepted Stack Overflow answers, raising memorization concerns.

Citation

Cite this work

Use the BibTeX below when citing the paper or dataset website.

@inproceedings{alebachew2026beyond,
            title     = {Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering},
            author    = {Alebachew, Yoseph Berhanu and Leary, Hunter and Vaishampayan, Swanand and Brown, Chris},
            booktitle = {Proceedings of the 22nd International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2026)},
            month     = jul,
            year      = {2026},
            note      = {Available as arXiv preprint},
            archivePrefix = {arXiv},
            eprint    = {2603.26567},
            primaryClass = {cs.SE},
            url       = {https://arxiv.org/abs/2603.26567}
          }
}