View a PDF of the paper titled FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics, by ChenRui Duan and 8 other authors
Abstract:Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons highlight FGBERT’s capability for functional recognition and its biological relevance in metagenomic research.
Submission history
From: Chenrui Duan [view email]
[v1]
Sat, 24 Feb 2024 13:13:17 UTC (45,582 KB)
[v2]
Fri, 27 Dec 2024 06:40:39 UTC (9,580 KB)
Source link
lol