20
May
By Sayash Kapoor, Benedikt Stroebl, Arvind NarayananWhich is the most accurate AI system for generating code? Surprisingly, there isn’t currently a good way to answer questions like these. Based on HumanEval, a widely used benchmark for code generation, the most accurate publicly available system is LDB (short for LLM debugger). But there’s a catch. The most accurate generative AI systems, including LDB, tend to be agents, which repeatedly invoke language models like GPT-4. That means they can be orders of magnitude more costly to run than the models themselves (which are already pretty costly). If we eke out a 2% accuracy…