Move over, Devin: Cosine’s Genie takes the AI coding crown

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

It wasn’t long ago that the startup Cognition was blowing minds with its product Devin, an AI-based software engineer powered by OpenAI’s GPT-4 foundation large language model (LLM) on the backend that could autonomously write and edit code when given instructions in natural language text.

But Devin emerged in March 2024 — five months ago — an eternity in the fast-moving generative AI space.

Now, another “C”-named startup, Cosine, which was founded through the esteemed Y Combinator startup accelerator in San Francisco, has announced its own new autonomous AI-powered engineer Genie, which it says handily outperforms Devin, scoring 30% on third-party benchmark test SWE-Bench compared to Devin’s 13.8%, and even surpassing the 19% scored by Amazon’s Q and Factory’s Code Droid.

Screenshot from Cosine’s website showing Genie’s performance on SWE-Bench compared to other AI coding engineer models. Credit: Cosine

“This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE [software engineer],” wrote Cosine’s co-founder and CEO Alistair Pullen in a post on his account on the social network X.

I’m excited to share that we’ve built the world’s most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE. pic.twitter.com/OyvqKLxcGV
— Alistair (@AlistairPullen) August 12, 2024

What is Genie and what can it do?

Genie is an advanced AI software engineering model designed to autonomously tackle a wide range of coding tasks, from bug fixing to feature building, code refactoring and validation through comprehensive testing, as instructed by human engineers or managers.

It operates either fully autonomously or in collaboration with users and aims to provide the experience of working alongside a skilled colleague.

“We’ve been chasing the dream of building something that can genuinely automatically perform end-to-end programming tasks with no intervention and a high degree of reliability – an artificial colleague. Genie is the first step in doing exactly that,” wrote Pullen in the Cosine blog post announcing Genie’s performance and limited, invitation-only availability.

The AI can write software in a multitude of languages — there are 15 listed in its technical report as being sources of data, including:

JavaScript
Python
TypeScript
TSX
Java
C#
C++
C
Rust
Scala
Kotlin
Swift
Golang
PHP
Ruby

Cosine claims Genie can emulate the cognitive processes of human engineers.

“My thesis on this is simple: make it watch how a human engineer does their job, and mimic that process,” Pullen explained in the blog post.

The code Genie generates is stored in a user’s GitHub repo, meaning Cosine does not retain a copy, nor any of the attendant security risks.

Furthermore, Cosine’s software platform is already integrated with Slack and system notifications, which it can use to alert users of its state, ask questions, or flag issues as a good human colleague would.

”Genie also can ask users clarifying questions as well as respond to reviews/comments on the PRs [pull requests] it generates,” Pullen wrote to VentureBeat. “We’re trying to get Genie to behave like a colleague, so getting the model to use the channels a colleague would makes the most sense.”

Powered by a long context OpenAI model

Unlike many AI models that rely on foundational models supplemented with a few tools, Genie was developed through a proprietary process that involves training and fine-tuning a long token output AI model from OpenAI .

“In terms of the model we’re using, it’s a (currently) non-general availability GPT-4o variant that OpenAI have allowed us to train as part of the experimental access program,” Pullen wrote to VentureBeat via email. “The model has performed well and we’ve shared our learnings with the OpenAI finetuning team and engineering leadership as a result. This was a real turning point for us as it convinced them to invest resources and attention in our novel techniques.”

While Cosine doesn’t specify the particular model, OpenAI just recently announced the limited availability of a new GPT-4o Long Output Context model which can spit out up to 64,000 tokens of output instead of GPT-4o’s initial 4,000 — a 16-fold increase.

The training data was key

“For its most recent training run Genie was trained on billions of tokens of data, the mix of which was chosen to make the model as competent as possible on the languages our users care about the most at the current time,” wrote Pullen in Cosine’s technical report on the agent.

With its extensive context window and a continuous loop of improvement, Genie iterates and refines its solutions until they meet the desired outcome.

Cosine says in its blog post that it spent nearly a year curating a dataset with a wide range of software development activities from real engineers.

“In practice, however, getting such and then effectively utilising that data is extremely difficult, because essentially it doesn’t exist,” Pullen elaborated in his blog post, adding. “Our data pipeline uses a combination of artefacts, static analysis, self-play, step-by-step verification, and fine-tuned AI models trained on a large amount of labelled data to forensically derive the detailed process that must have happened to have arrived at the final output. The impact of the data labelling can’t be understated, getting hold of very high-quality data from competent software engineers is difficult, but the results were worth it as it gave so much insight as to how developers implicitly think about approaching problems.”

In an email to VentureBeat, Pullen clarified that: “We started with artefacts of SWEs doing their jobs like PRs, commits, issues from OSS repos (MIT licensed) and then ran that data through our pipeline to forensically derive the reasoning, to reconstruct how the humans came to the conclusions they did. This proprietary dataset is what we trained the v1 on, and then we used self-play and self-improvement to get us the rest of the way.”

This dataset not only represents perfect information lineage and incremental knowledge discovery but also captures the step-by-step decision-making process of human engineers.

“By actually training our models with this dataset rather than simply prompting base models which is what everyone else is doing, we have seen that we’re no longer just generating random code until some works, it’s tackling problems like a human,” Pullen asserted.

Pricing

In a follow-up email, Pullen described how Genie’s pricing structure will work.

He said it will initially be broken into two tiers:

“1. An accessible option priced competitively with existing AI tools, around the $20 mark. This tier will have some feature and usage limitations but will showcase Genie’s capabilities for individuals and small teams.

2. An enterprise-level offering with expanded features, virtually unlimited usage and the ability to create a perfect AI colleague who’s an expert in every line code ever written internally. This tier will be priced more substantially, reflecting its value as a full AI engineering colleague.”

Implications and Future Developments

Genie’s launch has far-reaching implications for software development teams, particularly those looking to enhance productivity and reduce the time spent on routine tasks. With its ability to autonomously handle complex programming challenges, Genie could potentially transform the way engineering resources are allocated, allowing teams to focus on more strategic initiatives.

“The idea of engineering resource no longer being a constraint is a huge driver for me, particularly since starting a company,” wrote Pullen. “The value of an AI colleague that can jump into an unknown codebase and solve unseen problems in timeframes orders of magnitude quicker than a human is self-evident and has huge implications for the world.”

Cosine has ambitious plans for Genie’s future development. The company intends to expand its model portfolio to include smaller models for simpler tasks and larger models capable of handling more complex challenges. Additionally, Cosine plans to extend its work into open-source communities by context-extending one of the leading open-source models and pre-training on a vast dataset.

Availability and Next Steps

While Genie is already being rolled out to select users, broader access is still being managed.

Interested parties can apply for early access to try Genie on their projects by filling out a web form on the Cosine website.

Cosine remains committed to continuous improvement, with plans to ship regular updates to Genie’s capabilities based on customer feedback.

“SWE-Bench recently changed their submission requirements to include the full working process of AI models, which poses a challenge for us as it would require revealing proprietary methodologies,” noted Pullen. “For now, we’ve decided to keep these internal processes confidential, but we’ve made Genie’s final outputs publicly available for independent verification on GitHub.”

More on Cosine

Cosine is a human reasoning lab focused on researching and codifying how humans perform tasks, intending to teach AI to mimic, excel at, and expand on these tasks.

Founded in 2022 by Pullen, Sam Stenner, and Yang Li, the company’s mission is to push the boundaries of AI by applying human reasoning to solve complex problems, starting with software engineering.

Cosine has already raised $2.5 million in seed funding from Uphonest and SOMA Capital, with participation from Lakestar, Focal and others.

With a small but highly skilled team, Cosine has already made significant strides in the AI field, and Genie is just the beginning.

“We truly believe that we’re able to codify human reasoning for any job and industry,” Pullen stated in the announcement blog post. “Software engineering is just the most intuitive starting point, and we can’t wait to show you everything else we’re working on.”

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link lol