Skip to main content

NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model

Abstract

Background

Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5\(^\prime \) end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.

Results

We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.

Conclusion

By leveraging “protein-ness”, NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/.

Peer Review reports

Background

Eukaryotic translation initiation is a highly regulated process that marks the beginning of protein synthesis. Thousands of proteins that possess important structural, catalytic, and regulatory roles are encoded in eukaryotic genomes, and the identification of the right translation initiation site (TIS), defined as the codon from which translation is initiated, is an important task for ensuring proper translation of mRNAs. For most eukaryotic mRNAs, this process is accomplished by the widely accepted “scanning mechanism”, which was first proposed by Marilyn Kozak in 1978 [1]. This mechanism describes how the 40S ribosomal subunit scans along the 5\(^\prime \) leader of the mRNA, base by base, until it encounters a start codon in a favorable context for initiating translation [2,3,4,5,6]. In vertebrates, the preferred context flanking the TIS is commonly known as the Kozak sequence, denoted as GCCRCCAUGG (where R represents a purine and AUG is the initiating codon) [4,5,6,7,8]. In particular, the presence of a purine three nucleotides upstream, and a guanine immediately downstream of the start codon has been shown to strongly influence the TIS selection in vertebrates, while the importance of the remaining positions in the Kozak sequence appears to be more variable [6, 7, 9,10,11]. Expanding beyond vertebrate context, studies of phylogenetically diverse eukaryotic transcripts have shown substantial variation in initiation signals among different eukaryotic groups, suggesting that the preferred initiation context roughly reflects the evolutionary relationships among species [7, 8, 12, 13].

The importance of the local context for TIS selection is evident in the event of leaky scanning, where an AUG codon in a weak context is bypassed by the 40S ribosomal subunit, leading to translation initiation at a downstream start codon [14, 15]. Notably, approximately 40% of eukaryotic mRNAs in GenBank [16] contain at least one AUG upstream of the annotated main open reading frame (mORF) [11, 17]. With the advent of ribosome profiling techniques, recent studies have shown that short ORFs (sORFs) with start codons located in the 5\(^\prime \) untranslated region (UTR) are very prevalent [8], appearing in approximately 64% of human mRNAs and 54% of Arabidopsis mRNAs [18]. These upstream ORFs (uORFs) generally play regulatory roles by influencing the translation of downstream mORFs, either through ribosome sequestering or competition, rather than by encoding functional proteins [2, 8, 11, 18, 19]. Zhang et al. [8], for instance, found that the start codon contexts of uORFs tend to deviate more from the Kozak consensus than those of mORFs, based on data from 478 phylogenetically diverse eukaryotic species. All of the above findings highlight that the identification of mORF TISs is complex and non-trivial.

Integrating detailed knowledge of translation initiation with advanced machine learning enables a wide range of biologically important tasks, including the discovery of novel proteins and alternative TISs, genome and transcriptome annotation, and deeper insight into protein synthesis, RNA coding potential, and the impact of nucleotide mutations on protein products [20,21,22,23,24,25]. Over time, various computational methods for TIS prediction in eukaryotes have been suggested, evolving from simple neural networks such as NetStart 1.0, which was developed in 1997 [12], to a range of more complex frameworks [17, 20, 26,27,28,29]. In particular, deep learning models have become popular because of their automated feature learning when trained on large, credible datasets [30]. An example is the model TIS Transformer, developed by Clauwaert et al. [20], which is trained from scratch on the human transcriptome and uses the transformer architecture with self-attention to predict multiple TIS locations in transcripts, including those of sORFs and within long non-coding RNAs. In addition to transcript-level TIS predictors, several gene prediction tools incorporate TIS prediction as part of their pipelines. A well-established example is AUGUSTUS, which employs a fourth-order interpolated generalized hidden Markov model to classify sequence features, including exons, introns, splice sites and TISs. AUGUSTUS is trained to predict alternative splice sites and has a broad range of species-specific models available [31,32,33,34,35]. Recently, deep learning models such as Tiberius [36] have further refined the accuracy of eukaryotic gene prediction. Tiberius integrates convolutional and long short-term memory layers with a differentiable HMM layer, predicting probabilities for 15 gene structure classes, including the initial CDS (Coding Sequence) where the TIS is located [36]. Tiberius is trained on data from 34 mammalian genomes, and does not predict alternative splice forms [36].

The advent of nucleotide and protein language models has significantly enhanced capabilities in biological sequence modeling. These models learn grammatical and semantic relationships between tokens by learning patterns in the training data, enabling them to assign probabilities to previously unseen tokens [37]. This ability is particularly effective for understanding the contextual dependencies inherent in DNA, RNA, and protein sequences [20, 38, 39]. A profound advancement in sequence analysis has been the introduction of the transformer architecture, which employs a self-attention mechanism to efficiently capture long-range dependencies across entire sequences [40]. Given the extensive amounts of unlabeled biological sequence data available, pretraining language models in a self-supervised setting allows them to learn the ‘language’ of biological sequences [38]. In this setup, models often predict the identities of randomly masked tokens in a sequence on the basis of their surrounding context, as seen in protein language models such as ProtT5 [38] and ESM-2 [39]. Following pretraining, these models can be fine-tuned on smaller labeled datasets for specific downstream tasks, leveraging their understanding of general sequence patterns to enhance both task-specific performance and computational efficiency [41].

In this paper, we introduce NetStart 2.0, a novel deep learning-based model designed to predict TISs of protein-coding ORFs in eukaryotic transcripts. NetStart 2.0 takes as input a transcript sequence and the corresponding species name. Its main objective is to accurately identify the correct mORF TIS within transcripts containing several ATG codons. As part of the modeling framework, NetStart 2.0 leverages peptide-level information for its nucleotide-level predictions, using the pretrained protein language model ESM-2 [39] to encode translated transcript sequences. With this approach, NetStart 2.0 integrates protein context with nucleotide-level features that capture the local start codon context, across sequences from 60 phylogenetically diverse eukaryotic species. After optimizing and defining the model architecture, we benchmarked NetStart 2.0 against state-of-the-art methods in TIS prediction. Our results highlight the potential of incorporating peptide-level context into transcript-level prediction tasks, indicating a promising direction for future research.

Implementation

Dataset creation

The raw datasets consisted of RefSeq-assembled genomes and corresponding annotation data from NCBI’s Eukaryotic Genome Annotation Pipeline Database, which was collected for 60 diverse eukaryotic species (Supplementary Table A1) [42, 43].

For the positive-labeled part of the dataset (hereafter referred to as the TIS-labeled dataset) we extracted all mRNA transcripts from nuclear genes with an annotated TIS ATG, labeling the position of the A in the translation initiating ATG. The sequences from the assembled genomes were processed by splicing out introns as defined by the annotated exons, and locating the TIS as defined by the beginning of the first CDS annotation (Fig. 1A). We removed all poorly annotated mRNA sequences that did not fulfill the following criteria: (1) the CDS had a stop codon (TAG, TAA or TGA) as the last codon, (2) the CDS did not have an in-frame stop codon, (3) the CDS had a complete number of codon triplets, and (4) the CDS contained only known nucleotides (A, T, G, C).. Kin\\\

The negative-labeled part of the dataset (hereafter referred to as the non-TIS labeled dataset) consisted of intergenic sequences, intron sequences, and sequences from mRNA transcripts, where a non-TIS ATG was labeled (Fig. 1B–D). For each non-TIS labeled sequence, we randomly picked out an ATG, labeled it, and extracted a subsequence of 500 nucleotides upstream and downstream of it. We extracted approximately an equal number of intron and intergenic samples compared to the TIS-labeled sequences of each species, sampling randomly to get sequences spread across the genome. Distinct proteoforms of the same gene can have different TISs, which we kept track of, and then located all ATGs upstream and downstream, respectively. We extracted all non-TIS ATGs located upstream of the first annotated TIS where the 5\(^\prime \) UTR was known, and randomly extracted three non-TIS ATGs downstream of the last annotated TIS. Pilot studies showed that the model had the most difficulty classifying downstream ATGs in the same reading frame as the TIS ATG. To better represent these challenging cases, we extracted three non-TIS ATGs downstream of the last annotated TIS: two in the same reading frame as the TIS ATG and one in an alternative reading frame.

When available, we extracted annotations sourced from RefSeq. In cases where RefSeq annotations were not available, we collected annotations from Gnomon, which are based on a combination of homology searching and ab initio modeling [44]. The reason for including Gnomon annotations was to increase the range of species covered in our training data.

Fig. 1
figure 1

High-level visualization of the dataset creation approach. The red marks illustrate non-TIS ATGs, and the green marks illustrate TIS ATGs

Due to the intrinsic similarities found in biological sequences, our dataset contains several highly similar entries, including genes belonging to the same family, mRNA splice variants of the same gene, and homologous genes present in different organisms, resulting in redundancy in our data [12, 45]. To address this issue, we employed the homology partitioning algorithm GraphPart [45] to partition the data prior to training NetStart 2.0. We applied MMseqs2 [46] for alignment and chose a pairwise identity threshold of 50% at the nucleotide level to ensure that no pair of sequences with a higher sequence identity would end up in different partitions [45]. We extracted subsequences of 603 nucleotides from each sample, with the labeled ATG being positioned at the center of each subsequence. This was done to execute GraphPart faster, and because NetStart 2.0 takes this sequence window as input. GraphPart was run separately on the different sequence types, while we specified the organism origin for each sequence to ensure an approximately even distribution in each partition considering both organism origin and sequence type. Using this approach, the data was divided into five equally-sized partitions (k = 5). The final dataset, distributed on the 5 partitions, contains 9,912,708 sequences, of which 1,162,194 (\(11.724 \%\)) were TIS-labeled, and 8,750,514 (\(88.276 \%\)) were non-TIS labeled. The complete composition of the dataset can be seen in Supplementary Table A2.

NetStart 2.0 architecture and training

Prediction task and objective

The aim of NetStart 2.0 is to predict TISs of protein-coding mORFs in mRNA sequences from diverse species across the eukaryotic domain. Although the translation of protein-coding mORFs can occasionally initiate at non-ATG codons, such instances are, to current knowledge, relatively rare and not often annotated [4] (Supplementary Table A3). For this reason, we defined the modeling objective of NetStart 2.0 to be a binary classification task aimed at predicting whether each occurrence of an ATG in a sequence is a TIS or not. Upon prediction of the TIS, the translation termination site can be identified directly from the sequence as the first occurring downstream in-frame stop codon.

Model architecture

The NetStart 2.0 architecture integrates three windows that each process input independently to extract different kinds of information relevant for predicting TIS (Fig. 2).

Fig. 2
figure 2

High-level schematic of the full NetStart 2.0 architecture. As input, NetStart 2.0 takes a transcript sequence centered on a candidate ATG, and a species name, which are processed through three separate windows. The purple window encodes the species input using learned taxonomic embeddings, summed across taxonomic levels. The blue (‘local’) window extracts a short nucleotide subsequence surrounding the candidate ATG and passes it through multiple feed-forward layers to capture patterns in the start codon context. The green (‘global’) window extracts a longer nucleotide subsequence and translates it into the corresponding amino acid sequence. The amino acid sequence is processed by a fine-tuned version of the pretrained protein language model ESM-2 to identify shifts from non-coding to coding regions by evaluating the likelihood that a given subsequence is protein-coding. The resulting embeddings from the three windows are concatenated and passed through a shared feed-forward layer connected to a binary classification head. The model runs independently on each candidate ATG within the input transcript

The first window takes a species name as input and represents the distinct sequence patterns of the individual species (Fig. 2, purple window). Given the evolutionary relationships among organisms, they can be classified at various taxonomical levels. Adopting the approach by Teufel et al. [47], we represented individual species based on seven taxonomical levels denoted as L: Kingdom, Phylum, Class, Order, Family, Genus, and Species, as defined by the NCBI Taxonomy classification system [48]. For each level, embeddings of the same dimension were learned independently from scratch across all species present in the dataset. The embedding dimension was defined as a hyperparameter (Supplementary Tables A4 and A5), and the full representation of a species was expressed as the sum of its taxonomic rank embeddings:

$$\begin{aligned} \text {embedding(organism)} = \sum _{l\in L} \text {embedding}_l. \end{aligned}$$
(1)

The remaining two windows take the nucleotide sequence as input and process it independently. The ‘local’ sequence window (Fig. 2, blue window) takes a short nucleotide region surrounding the candidate ATG as input. The window width is defined as a hyperparameter in the range from 10 to 30 nucleotides upstream and downstream of the ATG, respectively, to identify patterns in the immediate start codon context (see Supplementary Tables A4 and A5). This input is one-hot encoded as \(\text {A} = [1,0,0,0], \text {C} = [0,1,0,0], \text {G} = [0,0,1,0], \text {T} = [0,0,0,1]\), and \(\text {N} = [0,0,0,0]\) and processed by a number of separate feed-forward layers, defined as a hyperparameter that ranged in depth from 2 to 5 for the distinct data splits (Supplementary Table A5). Initially, we experimented with letting a nucleotide language model [49] encode the local sequence window, but it did not improve performance.

The ‘global’ sequence window (Fig. 2, green window) aims to identify shifts from non-coding to coding regions, examining whether subsequences upstream or downstream of an ATG can be translated into protein-like structures. This window takes a translated nucleotide sequence of 100 amino acids upstream and downstream of the labeled ATG, respectively, as input (this input size was determined based on initial experiments and the results shown in Fig. 3). The amino acid sequence was tokenized and encoded with the smallest version of the pretrained protein language model ESM-2 (8 M parameters) [39]. The input embedding is denoted as \(\textbf{H}^{\text {in}} \in \mathbb {R}^{n\times d_{\text {model}}}\), with n being the number of input tokens (i.e., amino acids), \(d_{\text {model}}\) being the embedding dimension of each token (320), and \(\textbf{H}^{\text {in}}_i\) denoting the vector embedding at token \(i \in {1,\ldots,n}\) [20, 40]. Stop codons (TAA, TAG, and TGA) were encoded as unknown tokens (\(\mathtt {<}\)unk\(\mathtt {>}\)). Sequences labeled with a TIS, that had a 5\(^\prime \) UTR shorter than the pre-defined input length (25% of the dataset) were padded with \(\mathtt {<}\)pad\(\mathtt {>}\) tokens. To ensure consistency between TIS- and non-TIS labeled sequences, we masked the upstream nucleotides in 25% of non-TIS sequences, matching the padding-length distribution observed in the TIS-labeled sequences. We obtained a contextualized representation of the amino acid sequence as the last hidden state of the encoder, \(\textbf{H}^{\text {out}} \in \mathbb {R}^{n\times d_{\text {model}}}\), which was fed into a separate, down-scaling feed-forward layer. The embeddings from the three separate windows (i.e., the organism embedding window, the ‘local’ nucleotide sequence window, and the ‘global’ amino acid sequence window) were then concatenated and fed through a shared feed-forward layer, directed to a binary classification layer outputting the probability of an ATG being a TIS.

Training procedure

The full training procedure was conducted utilizing unnested 4-fold cross validation, with four data partitions being used as training and validation sets in rotation, and the fifth data partition serving as the independent test set. To reduce computational time, the ESM-2 encoder was fine-tuned separately prior to end-to-end training. The fine-tuning was performed using the EsmForSequenceClassification class from the Hugging Face transformers library, which attaches a classification head to the embedding of the first token from the last hidden layer (used as a sequence summary) [50]. The fine-tuning objective was defined as a binary classification task, predicting whether the labeled ATG within a given sequence represents a TIS or not. All ESM-2 weights were updated during fine-tuning. The fine-tuning allowed ESM-2 to adapt to the task of detecting shifts from non-coding to coding sequence, while this discrimination has not been part of the pretraining. We conducted preliminary experiments involving fine-tuning the model on different sequence input lengths (cf. Fig. 3) and treating the amino acid sequence input length as a hyperparameter. These results showed that including a longer upstream context consistently improved performance. Based on these findings, we used a context window of 201 amino acids centered around the labeled ATG in the final model.

Subsequently, a range of hyperparameters were tuned using Optuna (Supplementary Tables A4 and A5). We then trained the full model end-to-end with the fine-tuned ESM-2 and optimized hyperparameters defined for each data split. The final model was constructed as an ensemble, averaging the probabilities predicted by the four models trained on distinct data splits. The fine-tuning and training took approximately 22 h and 20 h, respectively, on an NVIDIA L40S GPU.

For both the fine-tuning of ESM-2 and the training of the full NetStart 2.0 model, we used weighted Binary Cross-Entropy (BCE) as the loss function, employed the Adam algorithm as optimizer [51], and implemented early stopping by monitoring the BCE loss on the validation set [52]. For the weighted BCE loss, the weight for TIS-labeled sequences was set to three times that of non-TIS labeled sequences, increasing the penalty for incorrect predictions of TIS samples to address the class imbalance. These weights were chosen based on initial experiments. For NetStart 2.0, we applied dropout to each feed-forward layer [53].

Ablation studies

To assess the relative performance contributions provided by the different input windows to NetStart 2.0, we conducted two ablation studies following the same procedure of training on the 4 data splits and constructing the models as ensembles. The first ablation model uses only the fine-tuned ESM-2 encoder with an attached classification head and is referred to as “NetStart 2.0A”. With the second ablation model, we aimed to mimic the architecture of NetStart 1.0 [12], using the local nucleotide input window of NetStart 2.0 but with a larger subsequence surrounding the labeled ATG, optimized as a hyperparameter (Supplementary Table A6). We additionally included the organism input due to the high diversity of species in the training data, and refer to this ablation model as “NetStart 1.0A”.

Model evaluation and benchmarking

We compared the performance of NetStart 2.0 to that of TIS Transformer [20], as well as that of the ab initio gene finders AUGUSTUS [32] and the recently developed Tiberius [36].

TIS Transformer is trained on the human transcriptome, Tiberius on a range of vertebrate genomes, and AUGUSTUS on diverse eukaryotic species. Since AUGUSTUS requires the user to specify a species, we selected the most closely related available species based on the NCBI Taxonomy classification [48] for those not directly supported in our dataset (Supplementary Table A7). Tiberius was run in ab initio-mode, and had high memory- and time demands (run on an NVIDIA L40S GPU node), likely due to it being a novel model [36] with experimental code. For these reasons, we included predictions for only one species per defined organism group with Tiberius, namely Homo sapiens, Drosophila melanogaster, Cryptococcus neoformans, Toxoplasma gondii, and Arabidopsis thaliana, selected on the criterion of having good RefSeq coverage (Supplementary Table A8). By default, the raw datasets were already softmasked, and Tiberius and AUGUSTUS were run with the softmasking input [54, 55].

Construction of benchmark test sets

NetStart 2.0 is trained on nucleotide sequences up to 603 nucleotides, TIS Transformer is trained on full human transcript sequences (up to 30,000 nucleotides), and AUGUSTUS and Tiberius on genomic sequences. Given these differences in training data, we aimed at creating test sets that would provide a fair evaluation of all models. The test sets were based on the NetStart 2.0 test partition, with a few modifications: (1) We extracted the full transcripts for the TIS-labeled sequences. (2) For transcripts without an annotated transcription start site (where mRNA and CDS annotations begin at the same position), we added 180 nucleotides upstream of the TIS to approximate a 5\(^\prime \) UTR (Supplementary Fig. A1). (3) We excluded transcript sequences longer than 30,000 nucleotides. (4) We extracted 500 nucleotides upstream and downstream of the labeled ATG for non-TIS sequences. (5) We excluded sequences with unknown nucleotides (denoted by any other letter than A, T, G or C). This test set was used as the foundation for the benchmark, referred to as the non-homologous test set (Supplementary Table A9).

We used the non-homologous test set, encompassing sequence with a single labeled TIS ATG or non-TIS ATG as the foundation for two additional test sets. We extracted all transcripts with a labeled TIS from the non-homologous test set and with an annotated transcription start site to assess the transcript-level accuracy. We refer to this test set as the transcript-level test set. While NetStart 2.0 is trained for transcript-level predictions, we also wanted to assess its applicability on the genomic level, and extracted all genes corresponding to the transcripts with a labeled TIS from the non-homologous test set, which we merged with the non-TIS labeled sequences from the non-homologous test set. We refer to this test set as the genomic test set. As the promoter region is very rarely annotated, we added 1000 nucleotides upstream of each gene to account for this (eukaryotic promoter regions typically span from 100 to 1000 nucleotides) [56]. For AUGUSTUS and Tiberius, we included an additional DNA region of 1000 nucleotides upstream and downstream of each gene to represent a more realistic use case for these models with surrounding context. We removed all duplicates arising from distinct mRNA variants with the same TIS, as well as all genes longer than 30,000 nucleotides (Supplementary Table A10).

Performance metrics

For evaluation, we calculated several performance metrics to provide a comprehensive assessment. We calculated the area under the Receiver Operating Characteristic curve (AUC) and the Average Precision Score (APS) as threshold-independent measures (i.e., they summarize model performance across all possible classification thresholds). AUC reflects the model’s ability to distinguish between classes [57]. The APS is calculated as the weighted mean of the precisions obtained along the precision-recall curve, and is approximately equivalent to the area under the precision-recall curve, but unlike interpolated AUPR estimates, it is less sensitive to local fluctuations and data sparsity [58, 59].

We also used Matthews correlation coefficient (MCC), which incorporates counts of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) to provide a balanced evaluation of model performance. MCC was calculated across thresholds, with the optimal threshold defined as the one maximizing MCC:

$$\begin{aligned} \text {MCC} = \frac{\text {TP}\times \text {TN}-\text {FP}\times \text {FN}}{\sqrt{(\text {TP}+\text {FP})(\text {TP}+\text {FN})(\text {TN}+\text {FP})(\text {TN}+\text {FN})}}. \end{aligned}$$
(2)

At the optimal threshold, error rates were calculated for the different sequence types in the dataset. For example, the error rate on non-TIS ATGs placed upstream of and in the same reading frame as the TIS, was calculated as:

$$\begin{aligned} \text {Error rate}_{\text {Upstream, In Frame}} = \frac{\text {Upstream, in frame ATG predicted as TIS}}{\text {All upstream, in frame ATGs}}. \end{aligned}$$
(3)

Results

Impact of fine-tuning of ESM-2

We analyzed the sequence representations learned by the ESM-2 encoder for varying sequence input lengths, both before and after fine-tuning. Amino acid embeddings were extracted from the last hidden state of the encoder, where each amino acid in a sequence is represented by a vector of length 320 (the embedding dimension). For each input sequence, the per-residue embeddings were flattened into a single vector by concatenation, resulting in a representation with dimensionality equal to the number of amino acids multiplied by 320. Each vector was standardized, and a principal component analysis (PCA) was performed on the standardized embeddings. The objective of the PCA was to visualize the learned representations in two-dimensional space and assess the extent to which fine-tuning ESM-2 altered the underlying amino acid representations. We considered three input sequence windows, each with a sequence length of 100 amino acids downstream of the TIS but with a varying sequence length upstream of the TIS, specifically: (1) 0 amino acids, (2) 50 amino acids, and (3) 100 amino acids (Fig. 3).

The aim was to qualitatively assess the influence that fine-tuning and including translated nucleotide context would have on distinguishing TIS ATGs from non-TIS ATGs. Note that there are two feed-forward layers with non-linear (ReLU) activation between this representation and the model output (cf. Fig. 2) and that the first two principal components explain only a minor part of the variance. These circumstances mean that there is not a 1:1 correspondence between the PCA and the model performance. Still, visual separation of clusters in the PCA serves as a strong indication that the downstream layers will be able to computationally separate the groups.

Without fine-tuning, there was substantial overlap between the ESM-2 embeddings for TIS- and non-TIS labeled ATGs (Fig. 3A–C). However, including 100 amino acids upstream and downstream of the TIS, resulted in the embeddings for a subset of the TIS-labeled sequences to be clearly separated from those of non-TIS labeled sequences, indicating that ESM-2 can capture relevant information without being fine-tuned, if provided enough context (Fig. 3C). After fine-tuning, the model version which was provided only downstream context still struggled to distinguish TIS ATGs from certain non-TIS ATGs located downstream and in-frame relative to the true TIS (Fig. 3D). This outcome was expected, as the input windows for both sequence types are fully protein-coding. In contrast, when using a sequence window that included 100 amino acids both upstream and downstream of the TIS, the embeddings of the fine-tuned model showed clear separation of TIS- and non-TIS ATGs, highlighting the importance of the model being allowed to learn the transition from the non-coding to coding region (Fig. 3F). Three well-defined clusters were observed: one comprised of TIS-labeled sequences, one comprised of downstream, in-frame, non-TIS mRNA sequences, and a broader cluster encompassing the remaining non-TIS sequence types. The observed patterns suggest that some downstream, in-frame, non-TIS ATGs are learned differently compared to other non-TIS sequence types, despite the fine-tuning objective solely being to distinguish TIS from non-TIS ATGs. This could be expected as this non-TIS sequence type is protein-coding while the remaining non-TIS sequence types are not. Overall, these results highlight the importance of including context both upstream and downstream of the labeled ATG, in order for the encoder to utilize transitions from non-coding to protein-coding regions in the overall assessment.

Fig. 3
figure 3

Learned sequence representations obtained from the last hidden state of the ESM-2 encoder, projected onto the first two principal components for a random subset of \(n = 20,000\) sequences from the non-homologous test set. The objective is to distinguish TIS ATGs from non-TIS ATGs. Sequences are colored by origin: Green = TIS ATG from mRNA sequence. Dark blue = non-TIS ATG from intergenic sequence. Brown = non-TIS ATG from intron sequence. Red = non-TIS ATG from mRNA sequence, upstream of and in an alternative reading frame to the TIS ATG. Orange = non-TIS ATG from mRNA sequence, upstream of and in the same reading frame as the TIS ATG. Cyan = non-TIS ATG from mRNA sequence, downstream of and in an alternative reading frame to the TIS ATG. Pink = non-TIS ATG from mRNA sequence, downstream of and in the same reading frame as the TIS ATG. A–C Results for the ESM-2 encoder without fine-tuning for different sequence lengths. D–F Results after fine-tuning, trained on one of the four data splits for different sequence lengths. Supplementary Figs. A2, A3, and A4 show results for fine-tuning on the other data splits

Benchmarking of TIS prediction models

Transcript-level accuracy

To evaluate NetStart 2.0’s performance on identifying the correct mORF TIS in a transcript containing several ATG codons, we calculated the transcript-level accuracy (based on the transcript-level test set, see “Model evaluation and benchmarking” section, Construction of Benchmark Test Sets). We define the transcript-level accuracy as the fraction of transcripts for which the annotated TIS ATG received the highest predicted probability among all ATGs present in that transcript. It should be noted that TIS Transformer is trained to predict various kinds of TISs, including those of sORFs within mRNA transcripts. Quantifying this potential bias is challenging since TIS Transformer does not differentiate between specific types of TIS in its predictions, meaning that all predictions are given as the binary output of “TIS” or “Non-TIS”. In this setup, we assume TIS Transformer to predict the TIS of the mORF with a higher probability than of any other potential ORF in the transcript.

Table 1 Transcript-level accuracies (%), defined as the percentage of transcripts in which the TIS ATG is predicted with higher probability than all other ATGs in that transcript

Overall, NetStart 2.0 achieves the highest transcript-level accuracies among the evaluated models across nearly all organism groups (Table 1). Notably, NetStart 2.0 outperforms other methods on human transcripts (H. sapiens), particularly when considering only RefSeq-annotated sequences. TIS Transformer, despite being exclusively trained on human transcripts, achieves the highest accuracy within the fungal phylum, surpassing NetStart 2.0 in this specific group.

The ablation study reveals that the simplified model utilizing the ESM-2 protein language model to encode translated transcript sequences, NetStart 2.0A, still performs exceptionally well, suggesting that peptide-level context alone captures substantial predictive power. In contrast, NetStart 1.0A, which mimics the original NetStart 1.0 architecture, achieves considerably lower accuracy compared to all other benchmarked models.

AUGUSTUS consistently performs substantially worse than NetStart 2.0, especially for plant and protozoan transcripts, while Tiberius generally outperforms AUGUSTUS, but still attains markedly lower accuracies than NetStart 2.0. A notable exception is the fungus C. neoformans, where Tiberius performed worst among all evaluated methods (see Supplementary Tables A11 and A12).

Performance on the non-homologous test set

We evaluated performance of each model on the non-homologous test set, excluding introns and intergenic sequences (see “Model evaluation and benchmarking” section, Construction of Benchmark Test Sets). Additionally, we calculated sequence identity to TIS Transformer’s training set and removed transcripts with more than \(50 \%\) sequence identity (Supplementary Table A13). The threshold-independent metrics AUC and APS were calculated for the TIS prediction models (AUGUSTUS and Tiberius provide the TIS only if predicted, and not a probability), and MCC was calculated for all models. For the models outputting a probability for TIS, we defined the optimal threshold as the one maximizing MCC, which was found at 0.05 for TIS Transformer, and 0.625 for NetStart 2.0 and the ablation models (Table 2 and Supplementary Fig. A5). Our results indicate that NetStart 2.0 consistently achieves slightly higher performance than the other evaluated models across organism groups, although the performances of both TIS Transformer and NetStart 2.0A are generally comparable. The results further indicate that using the full NetStart 2.0 architecture rather than NetStart 2.0A has the most substantial impact on underrepresented groups, especially protozoan and fungal species. In contrast, the differences between NetStart 2.0 and NetStart 2.0A were minimal for the remaining organism groups.

Fig. 4
figure 4

Distributions of predicted probabilities with NetStart 2.0 and TIS Transformer, based on the non-homologous test set

Table 2 Predictive performance of models measured on the non-homologous test set of each organism group (binary classification of TIS versus non-TIS ATGs): Matthews Correlation Coefficient (MCC), Area Under the ROC Curve (AUC), and Average Precision Score (APS)

The optimal thresholds found for the distinct models align well with the observed probability distributions (Fig. 4 and Supplementary Fig. A6). NetStart 2.0’s predicted probabilities are concentrated towards 0 and 1 for non-TIS and TIS ATGs, respectively, whereas TIS Transformer’s non-TIS predictions are highly concentrated near 0 but with the TIS ATGs showing a clear two-peaked distribution of probabilities.

We also calculated error rates for each specific sequence type based on the MCC-optimized thresholds for each model (Fig. 5 and Supplementary Fig. A7). Both AUGUSTUS and Tiberius exhibit high specificity but low sensitivity, resulting in low error rates across all non-TIS ATGs but high error rates on the TIS ATGs. Among the different types of non-TIS sequences, TIS Transformer, NetStart 2.0A, and NetStart 2.0 show the highest error rates for ATGs located downstream of and in the same reading frame as the TIS, indicating that distinguishing these ATGs from true TIS ATGs remains the greatest challenge for these models (see Supplementary Figs. A8 and A9 for further results regarding sequence type- and species-specific error rates calculated for NetStart 2.0).

Fig. 5
figure 5

Error rates calculated for the various non-TIS sequence types and the TIS sequences separately on the non-homologous test set, based on the best universal threshold for each model (0.05 for TIS Transformer, and 0.625 for NetStart 2.0 and the ablation models). “Upstream”, refers to the labeled ATG being placed in the 5\(^\prime \) UTR of the transcript, whereas “Downstream” refers to the labeled ATG being placed in the coding region of the transcript. “In frame” refers to the labeled ATG being placed in the same reading frame as the TIS ATG, and “out of frame” refers to the labeled ATG being placed in a reading frame alternative to the TIS ATG

Performance at the genomic level

To assess the applicability of NetStart 2.0 on the genomic level, we benchmarked it using the genomic test set (see “Model evaluation and benchmarking” section, Construction of Benchmark Test Sets). Although NetStart 2.0 outperforms TIS Transformer across most organism groups, both models exhibit a substantial drop in performance at the genomic level, limiting their applicability for this problem. The predicted TIS probability for both TIS Transformer and NetStart 2.0 is strongly influenced by the position of the first downstream intron relative to the TIS (Fig. 6 and Supplementary Fig. A10). In contrast, the gene finders generally outperform the transcript-level TIS predictors across most organism groups (Table 3 and Supplementary Table A14). An exception occurs in protozoan species, where NetStart 2.0 achieves the best performance, likely due to the low prevalence of introns in certain protozoan groups, such as Trypanosomes and Leishmania species [60, 61]. However, the overall shift in relative performance arises primarily from a larger decline in the accuracy of transcript-level predictors, as the gene finders also show slight performance decreases at the genomic level compared to transcript level.

Fig. 6
figure 6

Predicted TIS probability for TIS ATGs at gene level, shown as a function of the position of the first downstream intron relative to the TIS. The dashed lines symbolize the MCC-optimized threshold for each model on the genomic dataset

Table 3 MCC Scores with best universal threshold on the genomic test set for each model (0.025 for TIS Transformer, 0.75 for NetStart 1.0A, 0.7 for NetStart 2.0A, and 0.625 for NetStart 2.0)

Additional evaluations of NetStart 2.0

Impact of taxonomic input information

To evaluate the robustness of NetStart 2.0 when full species-level information is unavailable, we tested the model under three conditions (see Fig. 2, purple window): (a) organism embedding included species-level detail (using learned embeddings from all 7 taxonomic ranks), (b) organism embeddings were limited to phylum-level detail (using learned embeddings from only kingdom and phylum ranks), and (c) no organism information was provided (organism embeddings were set to 0). We calculated group-specific MCCs based on the non-homologous test set for each condition based on the optimized threshold (Fig. 7). For vertebrates, reducing taxonomic detail from species level to phylum level does not affect MCC performance, which only slightly decreases when no taxonomic information is included. For plants, a similar trend is observed with a drop in MCC of only 0.003 when providing phylum-level information and an additional drop of 0.004 when no taxonomic information is provided. This indicates that phylum-level embeddings retain most of the useful taxonomic information for predictions made on organisms of vertebrate and plant origin. For the invertebrate, fungal, and protozoan species, the declines in performance are more pronounced. Using phylum-level information, the drop in MCC within these groups ranges from 0.009 for the invertebrates to 0.02 for the protozoans. Performances further decline when no taxonomic information is provided, with the largest decreases observed in the fungal and protozoan groups. These findings highlight that correct taxonomic information is more important for these groups, which generally exhibit more diverse start codon contexts (Nielsen, L.S. et al., manuscript in preparation). It should be emphasized that these results were obtained on the test set comprising sequences from the same organisms that are represented in the training set. Thus, performance on sequences from entirely unseen species could be significantly lower and remains to be investigated.

Fig. 7
figure 7

MCCs measured on the non-homologous test set with NetStart 2.0 under three species input settings at the universally optimized threshold (0.625). “Species input” uses the learned embeddings from all 7 ranks for the organism embedding, “Phylum input” truncates the organism embedding to use information until phylum level, and “No organism input” omits taxonomic information entirely

Effect of group-specific fine-tuning

To assess the extent to which NetStart 2.0 learned sequence patterns specific to each organism group, we fine-tuned it independently using training data exclusively from each group (vertebrates, invertebrates, plants, fungi, and protozoa) saving individual checkpoints (i.e., snapshots of the optimized model parameters for each group). This yielded only a marginal increase in performance, mostly notable for protozoan species (Supplementary Table A15). Given the limited improvement and reduced flexibility, we opted to implement NetStart 2.0 with a shared checkpoint for all species.

Gnomon annotations: impact on performance

It may seem surprising that the transcript-level accuracies (Table 1) for NetStart 2.0, NetStart 2.0A, and TIS Transformer are slightly lower for the RefSeq-annotated transcripts only (numbers in parentheses). However, it is not immediately obvious that one should expect a higher performance for data with stronger experimental support (RefSeq). On the one hand, the Gnomon data could be expected to have more noise, since annotations are of a lower quality. On the other hand, Gnomon data could be expected to be more regular, since they may be biased towards genes that are easy to predict. In order to test whether including Gnomon-annotated data had an adverse effect on training, we trained a version of the model using only the vertebrate sequences annotated with RefSeq for each of the 4 data partitions used for model development (473,481 sequences across the 4 partitions, of which \(14.38\%\) were TIS-labeled). The training procedure described in “NetStart 2.0 architecture and training” section, Training Procedure was followed. We selected the vertebrate group for this experiment, as it is the systematic group with both most species and sequences annotated with RefSeq included in our dataset. The trained model was run on the vertebrate sequences from both the transcript-level test set and the non-homologous test set. Across all metrics (transcript-level accuracy, MCC, AUC, and APS), the performance was slightly lower compared to the original NetStart 2.0 model, even when evaluating on RefSeq-annotated test sequences only (see Supplementary Tables A16 and A17). We interpret these results as demonstrating that the bias introduced by including Gnomon-annotated sequences is of negligible order for this scope, as well as showcasing the consequence of decreasing the amount of data used for training.

Discussion

The emergence of pretrained protein language models has significantly advanced protein sequence modeling, and the performance of NetStart 2.0 demonstrates the potential of integrating peptide-level information to improve nucleotide-level sequence predictions. The comparison of NetStart 2.0A to NetStart 2.0 showed that its predictive ability is largely derived from the “global” sequence window (Fig. 2, green window), which was implemented to understand the “protein-ness” of the sequence, assessing shifts from non-coding to coding regions. Notably, this remained true despite the broad phylogenetic diversity represented in the NetStart 2.0 training set. Our results specifically show that species-level taxonomic information had only minor impact on vertebrates and plants, indicating that NetStart 2.0 relies minimally on detailed species-specific representations for these groups. In contrast, including species-level taxonomical information has a bigger impact on the remaining organisms. Considering diversity in local start codon contexts among organisms, these findings could be expected, given that organisms of plant or vertebrate origins, respectively, have very similar start codon contexts, whereas these patterns vary more for the remaining organism groups (Nielsen, L.S. et al., manuscript in preparation). Our observations suggest that while most of NetStart 2.0’s predictive power arises from the “global” sequence window, assessment of the start codon context also contributes to the overall performance. However, across all groups there was no substantial loss in performance when representation of a species to the model was reduced to the phylum level (Fig. 7). These findings illustrate the remarkable potential of utilizing protein language models in detecting biologically relevant signals, even beyond their primary training scope.

The biggest challenge for NetStart 2.0 was distinguishing true TIS ATGs from downstream non-TIS ATGs located within the same reading frame. This challenge likely originates from the architectural reliance on the global sequence window, as these two sequence types share similar protein-like contexts downstream of the labeled ATG. However, the overall ability of NetStart 2.0A to identify transitions between coding and non-coding regions raises compelling questions about the extent to which such models can comprehend coding potential or “protein-ness” within sequences. Despite being pretrained on full-length proteins, our findings raise the intriguing possibility of using protein language models to detect functional subsequences within proteins.

The substantial performance gap between NetStart 1.0A and TIS Transformer underscores the importance of architectural complexity in TIS prediction, given the similarity in their underlying data inputs. Despite being trained exclusively on human transcript sequences, TIS Transformer excelled in learning sequence patterns beyond human context, achieving high performance across diverse eukaryotic groups and pointing to a universal signal in TIS prediction transcending species boundaries despite high diversity in start codon context patterns (Nielsen, L.S. et al., manuscript in preparation).

Although fine-tuning ESM-2 substantially altered its amino acid representations (Fig. 3), we cannot exclude the possibility that downstream feed-forward layers alone could adequately adapt to TIS prediction potentially making fine-tuning unnecessary. However, early modeling experiments showed less optimal results without fine-tuning, leading us to focus on the fine-tuned version of ESM-2. While some tasks using full protein sequences might negate this necessity, fine-tuning has proven beneficial for a broad range of downstream tasks [62]. Furthermore, our approach specifically leveraged ESM-2 to detect transitions from non-coding to coding regions, slightly shifting its scope from the original pretraining objective.

While NetStart 2.0 slightly outperformed TIS Transformer at the transcript level, both models exhibited limited performance on the genomic test set. This performance drop was strongly influenced by the position of the first downstream intron, highlighting a broader challenge in applying transcript-level TIS predictors to genomic contexts. While both Tiberius and AUGUSTUS in comparison performed notably better on the genomic test set, their performance also declined compared to the transcript-level tests. This suggests considerable potential for improving models specifically designed for genomic-level TIS prediction.

Finally, NetStart 2.0’s training was limited to predict canonical start codons (ATG) of main protein-coding ORFs. Recent evidence from ribosome profiling has revealed a substantial presence of sORFs with both canonical and non-canonical start codons in mRNA transcripts [8, 18, 20]. However, annotations of such features are currently very limited in genomic annotation databases such as NCBI’s Eukaryotic Genome Annotation Pipeline [43]. Future research incorporating predictions on both sORFs and non-canonical start codons could significantly expand the scope and practical utility of TIS prediction models such as NetStart 2.0.

Conclusions

The integration of peptide-level information through a pretrained protein language model significantly improves the accuracy of predicting translation initiation sites in eukaryotic mRNAs, achieving state-of-the-art performance. The success of NetStart 2.0 in leveraging protein-level context highlights the broader potential of protein language models in bridging nucleotide- and peptide information for diverse biological prediction tasks.

Availability of data and materials

The datasets supporting the conclusions of this article are available at the NetStart 2.0 Webserver site, accessible at: https://services.healthtech.dtu.dk/services/NetStart-2.0/ (See the “Data” section at the webserver site). The raw datasets, comprised by the RefSeq genomes and corresponding annotations, were collected from the FTP server at NCBI: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/. Specifically, for each species the subpath./<organism_group>/<genus_species>was followed. The NetStart 2.0 online server is available at https://services.healthtech.dtu.dk/services/NetStart-2.0/. For large datasets, the program can be downloaded locally from github repository https://github.com/lsandvad/netstart2.

Abbreviations

APS:

Average precision score

AUC:

Area under the receiver operating characteristic curve

BCE:

Binary cross entropy

CDS:

Coding sequence

ESM:

Evolutionary scale modeling

MCC:

Matthews correlation coefficient

mORF:

Main open reading frame

ORF:

Open reading frame

sORF:

Short open reading frame

TIS:

Translation initiation site

uORF:

Upstream open reading frame

UTR:

Untranslated region

References

  1. Kozak M. How do eucaryotic ribosomes select initiation regions in messenger RNA? Cell. 1978;15(4):1109–23. https://doi.org/10.1016/0092-8674(78)90039-9.

    Article  CAS  PubMed  Google Scholar 

  2. Jackson RJ, Hellen CU, Pestova TV. The mechanism of eukaryotic translation initiation and principles of its regulation. Nat Rev Mol Cell Biol. 2010;11(2):113–27. https://doi.org/10.1038/nrm2838.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Kozak M. The scanning model for translation: an update. J Cell Biol. 1989;108(2):229–41. https://doi.org/10.1083/jcb.108.2.229.

    Article  CAS  PubMed  Google Scholar 

  4. Andreev DE, Loughran G, Fedorova AD, Mikhaylova MS, Shatsky IN, Baranov PV. Non-AUG translation initiation in mammals. Genome Biol. 2022;23(1):111. https://doi.org/10.1186/s13059-022-02674-2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kozak M. Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell. 1986;44(2):283–92. https://doi.org/10.1016/0092-8674(86)90762-2.

    Article  CAS  PubMed  Google Scholar 

  6. Kozak M. Pushing the limits of the scanning mechanism for initiation of translation. Gene. 2002;299(1–2):1–34. https://doi.org/10.1016/S0378-1119(02)01056-9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Hernández G, Osnaya VG, Pérez-Martínez X. Conservation and variability of the AUG initiation codon context in eukaryotes. Trends Biochem Sci. 2019;44(12):1009–21. https://doi.org/10.1016/j.tibs.2019.07.001.

    Article  CAS  PubMed  Google Scholar 

  8. Zhang H, Wang Y, Wu X, Tang X, Wu C, Lu J. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat Commun. 2021;12(1):1076. https://doi.org/10.1038/s41467-021-21394-y.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Miesfeld RL, McEvoy MM. Biochemistry of mRNA translation. In: Biochemistry, 1st ed. New York: W. W. Norton & Company; 2017. pp. 1118–1119.

  10. Pisarev AV, Kolupaeva VG, Pisareva VP, Merrick WC, Hellen CU, Pestova TV. Specific functional interactions of nucleotides at key\(-\) 3 and + 4 positions flanking the initiation codon with components of the mammalian 48S translation initiation complex. Genes Dev. 2006;20(5):624–36. https://doi.org/10.1101/gad.1397906.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Dever TE, Ivanov IP, Hinnebusch AG. Translational regulation by uORFs and start codon selection stringency. Genes Dev. 2023;37(11–12):474–89. https://doi.org/10.1101/gad.350752.123.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Pedersen AG, Nielsen H. Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis. In: Gaasterland T, Karp PD, Karplus K, Ouzounis C, Sander C, Valencia A, editors. Proceedings of the 5th international conference on intelligent systems for molecular biology. Washington, DC: AAAI Press; 1997. pp. 226–233.

  13. Nakagawa S, Niimura Y, Gojobori T, Tanaka H, Miura K. Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes. Nucleic Acids Res. 2008;36(3):861–71. https://doi.org/10.1093/nar/gkm1102.

    Article  CAS  PubMed  Google Scholar 

  14. Krebs JE, Goldstein ES, Kilpatrick ST. Translation. In: Lewin’s Genes XII, 12th ed. Burlington: Jones & Bartlett Learning; 2018. pp. 592–596.

  15. Xu H, Wang P, Fu Y, Zheng Y, Tang Q, Si L, et al. Length of the ORF, position of the first AUG and the Kozak motif are important factors in potential dual-coding transcripts. Cell Res. 2010;20(4):445–57. https://doi.org/10.1038/cr.2010.25.

    Article  CAS  PubMed  Google Scholar 

  16. Sayers EW, Cavanaugh M, Clark K, Pruitt KD, Sherry ST, Yankie L, et al. GenBank 2024 update. Nucleic Acids Res. 2023;52(D1):D134–7. https://doi.org/10.1093/nar/gkad903.

    Article  CAS  PubMed Central  Google Scholar 

  17. Goel N, Singh S, Aseri TC. Global sequence features based translation initiation site prediction in human genomic sequences. Heliyon. 2020;6(9):e04825. https://doi.org/10.1016/j.heliyon.2020.e04825.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Xiang Y, Huang W, Tan L, Chen T, He Y, Irving PS, et al. Pervasive downstream RNA hairpins dynamically dictate start-codon selection. Nature. 2023;621(7978):423–30. https://doi.org/10.1038/s41586-023-06500-y.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Cao X, Slavoff SA. Non-AUG start codons: expanding and regulating the small and alternative ORFeome. Exp Cell Res. 2020;391(1):111973. https://doi.org/10.1016/j.yexcr.2020.111973.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Clauwaert J, McVey Z, Gupta R, Menschaert G. TIS transformer: remapping the human proteome using deep learning. NAR Genomics Bioinform. 2023;5(1):lqad021. https://doi.org/10.1093/nargab/lqad021.

    Article  CAS  Google Scholar 

  21. Jin Y, Ivanov M, Dittrich AN, Nelson AD, Marquardt S. LncRNA FLAIL affects alternative splicing and represses flowering in Arabidopsis. EMBO J. 2023;42(11):e110921. https://doi.org/10.15252/embj.2022110921.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Frikstad K, Molinari E, Thoresen M, Ramsbottom SA, Hughes F, Letteboer SJ, et al. A CEP104-CSPP1 complex is required for formation of primary cilia competent in hedgehog signaling. Cell Rep. 2019;28(7):1907–22. https://doi.org/10.1016/j.celrep.2019.07.025.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Fuster-García C, García-García G, Jaijo T, Fornés N, Ayuso C, Fernández-Burriel M, et al. High-throughput sequencing for the molecular diagnosis of Usher syndrome reveals 42 novel mutations and consolidates CEP250 as Usher-like disease causative. Sci Rep. 2018;8(1):17113. https://doi.org/10.1038/s41598-018-35085-0.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Williams JL, Paudyal A, Awad S, Nicholson J, Grzesik D, Botta J, et al. Mylk3 null C57BL/6N mice develop cardiomyopathy, whereas Nnt null C57BL/6J mice do not. Life Sci Alliance. 2020. https://doi.org/10.26508/lsa.201900593.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Lenglez S, Sablon A, Fénelon G, Boland A, Deleuze JF, Boutoleau-Bretonnière C, et al. Distinct functional classes of PDGFRB pathogenic variants in primary familial brain calcification. Hum Mol Genet. 2022;31(3):399–409. https://doi.org/10.1093/hmg/ddab258.

    Article  CAS  PubMed  Google Scholar 

  26. Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics. 2022;16(1):7. https://doi.org/10.1186/s40246-022-00376-1.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Kalkatawi M, Magana-Mora A, Jankovic B, Bajic VB. DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions. Bioinformatics. 2019;35(7):1125–32. https://doi.org/10.1093/bioinformatics/bty752.

    Article  CAS  PubMed  Google Scholar 

  28. Zhang S, Hu H, Jiang T, Zhang L, Zeng J. TITER: predicting translation initiation sites by deep learning. Bioinformatics. 2017;33(14):i234–42. https://doi.org/10.1093/bioinformatics/btx247.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Liu Q, Fang H, Wang X, Wang M, Li S, Coin LJ, et al. DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions. Bioinformatics. 2022;38(17):4053–61. https://doi.org/10.1093/bioinformatics/btac454.

    Article  CAS  PubMed  Google Scholar 

  30. Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20(7):389–403. https://doi.org/10.1038/s41576-019-0122-6.

    Article  CAS  PubMed  Google Scholar 

  31. Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004;32(suppl-2):W309–12. https://doi.org/10.1093/nar/gkh379.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005;33(suppl-2):W465–7. https://doi.org/10.1093/nar/gki458.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics. 2020;21:1–20. https://doi.org/10.1186/s12864-020-6707-9.

    Article  CAS  Google Scholar 

  34. Hoff KJ, Stanke M. WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res. 2013;41(W1):W123–8. https://doi.org/10.1093/nar/gkt418.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34(suppl-2):W435–9. https://doi.org/10.1093/nar/gkl200.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Gabriel L, Becker F, Hoff KJ, Stanke M. Tiberius: end-to-end deep learning with an HMM for gene prediction. Bioinformatics. 2024;40(12):btae685. https://doi.org/10.1093/bioinformatics/btae685.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023;82(3):3713–44. https://doi.org/10.1007/s11042-022-13428-4.

    Article  PubMed  Google Scholar 

  38. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2021;44(10):7112–27. https://doi.org/10.1109/TPAMI.2021.3095381.

    Article  Google Scholar 

  39. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. https://doi.org/10.1126/science.ade2574.

    Article  CAS  PubMed  Google Scholar 

  40. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, Wallach H, Fergus S, Vishwanathan S, et al., editors. Advances in neural information processing systems, vol. 30. Red Hook: Curran Associates, Inc.; 2017. Available from: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

  41. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38(8):2102–10. https://doi.org/10.1093/bioinformatics/btac020.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015;44(D1):D733–45. https://doi.org/10.1093/nar/gkv1189.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Thibaud-Nissen F, Souvorov A, Murphy T, DiCuccio M, Kitts P. Eukaryotic genome annotation pipeline. In: The NCBI handbook [Internet], 2nd ed. Bethesda: National Center for Biotechnology Information (US); 2013. Available from: https://www.ncbi.nlm.nih.gov/books/NBK169439/.

  44. Souvorov A, Kapustin Y, Kiryutin B, Chetvernin V, Tatusova T, Lipman D. Gnomon—NCBI eukaryotic gene prediction tool. National Center for Biotechnology Information; 2010. Available from: https://www.ncbi.nlm.nih.gov/core/assets/genome/files/Gnomon-description.pdf.

  45. Teufel F, Gíslason MH, Almagro Armenteros JJ, Johansen AR, Winther O, Nielsen H. GraphPart: homology partitioning for biological sequence analysis. NAR Genomics Bioinform. 2023;5(4):lqad088. https://doi.org/10.1093/nargab/lqad088.

    Article  CAS  Google Scholar 

  46. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–8. https://doi.org/10.1038/nbt.3988.

    Article  CAS  PubMed  Google Scholar 

  47. Teufel F, Stahlhut C, Refsgaard J, Nielsen H, Winther O, Madsen D. SecretoGen: towards prediction of signal peptides for efficient protein secretion. In: NeurIPS 2023 generative AI and biology (GenBio) workshop; 2023. Available from: https://openreview.net/forum?id=vXXEfmYsvS.

  48. Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, et al. NCBI taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020:baaa062. https://doi.org/10.1093/database/baaa062.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, Grzywaczewski AH, Oteri F, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2024. https://doi.org/10.1038/s41592-024-02523-z.

    Article  PubMed  PubMed Central  Google Scholar 

  50. ESM. Documentation of ESM-2 from HuggingFace. Available from: https://huggingface.co/docs/transformers/v4.52.2/en/model_doc/esm#transformers.

  51. Kingma DP. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014. https://doi.org/10.48550/arXiv.1412.6980.

  52. Prechelt L. Early stopping—But when? In: Orr GB, Müller KR, editors. Neural networks: tricks of the trade. Berlin: Springer; 2002. p. 55–69.

    Google Scholar 

  53. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

    Google Scholar 

  54. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2005;22(2):134–41. https://doi.org/10.1093/bioinformatics/bti774.

    Article  CAS  PubMed  Google Scholar 

  55. Zhi D, Raphael BJ, Price AL, Tang H, Pevzner PA. Identifying repeat domains in large genomes. Genome Biol. 2006;7:1–14. https://doi.org/10.1186/gb-2006-7-1-r7.

    Article  CAS  Google Scholar 

  56. Le N, Yapp E, Nagasundaram N, Yeh HY. Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous fasttext N-grams. Front Bioeng Biotechnol. 2019;7:305. https://doi.org/10.3389/fbioe.2019.00305.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024. https://doi.org/10.1016/j.patter.2024.100994.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Conte AD, Mehdiabadi M, Bouhraoua A, Miguel Monzon A, Tosatto SC, Piovesan D. Critical assessment of protein intrinsic disorder prediction (CAID)-Results of round 2. Proteins: structure. Funct Bioinform. 2023;91(12):1925–34.

    Article  CAS  Google Scholar 

  59. Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted data. PLoS ONE. 2014;9(3):e92209. https://doi.org/10.1371/journal.pone.0092209.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Günzl A. The pre-mRNA splicing machinery of trypanosomes: Complex or simplified? Eukaryot Cell. 2010;9(8):1159–70. https://doi.org/10.1128/ec.00113-10.

    Article  PubMed  PubMed Central  Google Scholar 

  61. Grünebast J, Clos J. Leishmania: responding to environmental signals and challenges without regulated transcription. Comput Struct Biotechnol J. 2020;18:4016–23. https://doi.org/10.1016/j.csbj.2020.11.058.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Schmirler R, Heinzinger M, Rost B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat Commun. 2024;15(1):7407. https://doi.org/10.1038/s41467-024-51844-2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge Peter Wad Sackett for his help with setting up the NetStart 2.0 webserver.

Funding

Open access funding provided by Copenhagen University. L.S.N. and O.W. were in part funded by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606) and CAZAI (NNF22OC0077058). O.W. and L.S.N. acknowledge support from the Pioneer Center for AI, DNRF Grant Number P1.

Author information

Authors and Affiliations

Authors

Contributions

H.N. proposed the conceptual framework. L.S.N. developed the source code and models. L.S.N. drafted the manuscript with support from H.N. and A.G.P. A.G.P., H.N., and O.W. supervised the project. All authors provided substantial inputs to and reviewed the manuscript.

Corresponding authors

Correspondence to Line Sandvad Nielsen or Henrik Nielsen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nielsen, L.S., Pedersen, A.G., Winther, O. et al. NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model. BMC Bioinformatics 26, 216 (2025). https://doi.org/10.1186/s12859-025-06220-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-025-06220-2

Keywords