Exploring GraphCodeBERT for Code Search: Insights and Limitations

As a professional developer working daily with a massive codebase containing millions of lines of code and over 1,000 C# projects, finding the right pieces of code to modify can often be a time-consuming task. Recently, my interest has revolved around solving the problem of code search, and I was particularly intrigued by the potential of GraphCodeBERT, as outlined in the research paper GraphCodeBERT: Pre-training Code Representations with Data Flow.

Encouraged by the promising results described in the paper, I decided to evaluate its capabilities. The pretrained model is available here, with a corresponding demo project hosted in the GitHub repository: GraphCodeBERT Demo.

Diving Into Code Search

Initially, I went all in and vectorized the SeaGOAT repository, resulting in 193 Python function records stored in my Elasticsearch database. Using natural language queries, I attempted to find relevant functions by comparing their embeddings via cosine similarity. Unfortunately, I noticed that similar results were returned across multiple, distinct queries.

This led me to believe that the model likely requires fine-tuning for better performance. To test this hypothesis, I decided to take a simpler approach and use the demo project provided with the pretrained model.

Testing with a Controlled Dataset

The demo focuses on three Python functions:

1) download_and_save_image

def f(image_url, output_dir):
    import requests
    r = requests.get(image_url)
    with open(output_dir, 'wb') as f:
        f.write(r.content)

2) save_image_to_file

def f(image, output_dir):
    with open(output_dir, 'wb') as f:
        f.write(image)

3) fetch_image

def f(image_url, output_dir):
    import requests
    r = requests.get(image_url)
    return

Modified Query Results

Below is the table reflecting my findings when testing slightly modified queries against the three functions.

User Query	1	2	3
Download an image and save the content in output_dir	0.97	9.7e-05	0.03
Download and save an image	0.56	0.0002	0.44
Retrieve and store an image	0.004	7e-06	0.996
Get a photo and save it	0.0001	4e-08	0.999
Save a file from URL	0.975	6e-07	0.025
Process downloaded data and reshape it	0.025	0.0002	0.975
Go to the moon and back as soon as possible	0.642	0.006	0.353

Observations

From the table, it’s evident that the model correctly identifies the function only when the query is very specific and closely matches the original wording. When queries are slightly modified or synonyms are used, the results seem almost random. The same issue occurs with abstract queries or those unrelated to any function in the database.
It’s also evident that for 2 of the functions, every query returns very low similarity scores, which seems suspicious. This raises questions about whether the model is properly capturing meaningful distinctions for these cases or if there’s an issue with the embeddings or similarity calculations.

Concluding Thoughts

After experimenting with the demo version, I concluded that further exploration of this model for code search in larger repositories may not be worthwhile—at least not in its current form. It appears that code search based on natural language queries cannot yet be solved by a single AI model. Instead, a hybrid solution might be more effective, grouping classes or functions based on logical and business-related criteria and then searching these groups for code that addresses the specified problem.

I plan to continue exploring this area further. If you have any insights, suggestions, or experiences with code search models or techniques, please don’t hesitate to share them in the comments. Let’s discuss and learn together!

Source link
lol