Abstract
Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.
Original language | English |
---|---|
Pages (from-to) | 3615-3622 |
Number of pages | 8 |
Journal | Biologia |
Volume | 77 |
Issue number | 12 |
DOIs | |
Publication status | Published - 2022 Dec |
Subject classification (UKÄ)
- Medical Genetics and Genomics (including Gene Therapy)
Free keywords
- Machine learning
- Protein engineering
- Random forest
- Serine protease
- Thermophilic