CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code, by Batu Guan and 6 other authors

View PDF
HTML (experimental)

Abstract:Large Language Models (LLMs) have achieved remarkable progress in code generation. It now becomes crucial to identify whether the code is AI-generated and to determine the specific model used, particularly for purposes such as protecting Intellectual Property (IP) in industry and preventing cheating in programming exercises. To this end, several attempts have been made to insert watermarks into machine-generated code. However, existing approaches are limited to inserting only a single bit of information or overly depending on particular code patterns. In this paper, we introduce CodeIP, a novel multi-bit watermarking technique that embeds additional information to preserve crucial provenance details, such as the vendor ID of an LLM, thereby safeguarding the IPs of LLMs in code generation. Furthermore, to ensure the syntactical correctness of the generated code, we propose constraining the sampling process for predicting the next token by training a type predictor. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP in watermarking LLMs for code generation while maintaining the syntactical correctness of code.

Submission history

From: Batu Guan [view email]
[v1]
Wed, 24 Apr 2024 04:25:04 UTC (3,766 KB)
[v2]
Sun, 8 Sep 2024 08:50:38 UTC (3,427 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.