Abstract
Identification of genes that lead other genes towards disease with neurological disorders like Parkinson's disease (PD) is an important factor in biomedical research. Machine learning techniques have been extensively used in recent years for effective identification of genes associated with the disease. However, the data used in these methods were based on protein–protein interactions, gene expression, and gene ontology. These data may contain incomplete previous knowledge that is used to construct features for each gene. Therefore, in this study, the physicochemical properties of amino acid as a universal knowledge are used to extract features from the sequences. Also, the several machine learning models are used to classify genes associated with PD. In this study, the ensemble method is designed in such a way, so as to improve the diagnosis accuracy based on top four highest performing classifiers. The comparative analysis reveals that gradient boosting performs better having accuracy of 77.50% and area under curve of 0.774 with respect to other six methods. However, ensemble method achieves an accuracy of 83.75%. Ensemble method is evaluated against existing disease gene identification methods; the results suggest that this approach is more accurate and effective for identification of PD genes. Re-sampling techniques for resolving class imbalance issues have been shown to increase classification accuracy by reducing the bias introduced by class size differences. The proposed model can also be used as a prediction tool for diagnosis Alzheimer’s disease protein sequences.
Original language | English |
---|---|
Article number | 483 |
Pages (from-to) | 1-11 |
Number of pages | 11 |
Journal | SN Computer Science |
Volume | 5 |
Issue number | 5 |
DOIs | |
Publication status | Published - Jun 2024 |
MoE publication type | A1 Journal article-refereed |
Keywords
- Machine learning
- Parkinson’s disease
- Physicochemical properties of amino acids
- Protein sequences
- Re-sampling