Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease

Jithin S. Sunny, Atul Kumar, Khairun Nisha, Lilly M. Saleena

Research output: Contribution to journalArticlepeer-review

Abstract

Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.

Original languageEnglish
Pages (from-to)3615-3622
Number of pages8
JournalBiologia
Volume77
Issue number12
DOIs
Publication statusPublished - 2022 Dec

Subject classification (UKÄ)

  • Medical Genetics and Genomics (including Gene Therapy)

Free keywords

  • Machine learning
  • Protein engineering
  • Random forest
  • Serine protease
  • Thermophilic

Fingerprint

Dive into the research topics of 'Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease'. Together they form a unique fingerprint.

Cite this