PREDICTING HUMAN PATHOGENICITY OF BACTERIA USING RANDOM FOREST AND TEXT-BASED FEATURE ENGINEERING

Authors

  • Loan Thi Dang Thang Long University Author
  • Tran Tuan Tu Thai Nguyen University Author
  • Herry Susanto Universitas Islam Sultan Agung Author
  • Zulhenry Natiomal Taipei Unersity Author
  • Andri Triyono Universitas An Nuur Author

Keywords:

Bacterial Classification, Random Forest, TF-IDF Vectorization, Feature Engineering, Microbiological Risk Assessment

Abstract

Classifying bacteria based on their potential harm to humans is very important in microbiology, especially for early detection and prevention of pathogenic threats. This study aims to develop a classification model that can predict whether a bacterial species is harmful or not, using habitat descriptions and taxonomic information as input features.

This dataset consists of 200 bacterial species, each with “Where Found” (habitat) and “Family” (taxonomy) attributes. Preprocessing steps include label normalization, TF-IDF transformation for text data, and one-hot encoding for categorical features. The resulting feature set is used to train a Random Forest classifier. Model performance is evaluated using an 80/20 stratified training-testing split, followed by accuracy metrics, classification reports, and 5-fold cross-validation. Further optimization is performed via GridSearchCV to identify the best hyperparameter settings.

The model achieved 80% accuracy on the test data set and an average cross-validation accuracy of 71.38%. Feature importance analysis indicates that keywords related to habitat, such as “soil,” “human,” and “infected,” have the strongest influence on classification results. These findings suggest that combining natural language-based feature engineering techniques with ensemble classification algorithms can effectively distinguish harmful bacteria from non-harmful ones.

This research provides an interpretable and efficient machine learning pipeline for microbiological risk assessment, with potential applications in clinical diagnostics, public health surveillance, and environmental microbiology.

Downloads

Published

2025-06-20

How to Cite

PREDICTING HUMAN PATHOGENICITY OF BACTERIA USING RANDOM FOREST AND TEXT-BASED FEATURE ENGINEERING. (2025). International Conference Universitas An Nuur, 1(01), 51-66. https://proceedings.unan.ac.id/index.php/unan/article/view/32