Abstract:
Long non-coding RNAs (lncRNAs) play important roles in many biological processes and are found to be associated with several diseases. The development of next-generation sequencing technologies has discovered numerous unannotated transcripts. However, classifying these unannotated transcripts by using biological experiments is very time-consuming and expensive. Thus, a computational approach is considered as an alternative solution which is faster and cheaper. Many existing lncRNA identification tools are available, these tools lack an explanation of which features contributed to their prediction results. Here, we present Xlnc1DCNN, a tool for distinguishing long non-coding RNAs (lncRNAs) from protein-coding transcripts (PCTs) together with a prediction explanation. We developed the model by using a one-dimensional convolutional neural network integrated with DeepSHAP. On the human test dataset, we showed that Xlnc1DCNN outperformed other lncRNA identification tools in terms of accuracy and F1-score and had a generalization to other species. We also explained the prediction result to understand further how the model makes predictions. The explanation results revealed that most of the lncRNA transcripts were identified without any conserved regions, short patterns with unknown functions, or only regions of transmembrane helices while protein-coding transcripts were mostly identified with protein domains or families. Some of the incorrect predictions of the model also found inconsistent annotations among the public databases with lncRNA transcripts containing protein domains, protein families, or intrinsically disordered regions (IDRs). Xlnc1DCNN is freely available at https://github.com/cucpbioinfo/Xlnc1DCNN.