PREDICTING HUMAN PATHOGENICITY OF BACTERIA USING RANDOM FOREST AND TEXT-BASED FEATURE ENGINEERING
Kata Kunci:
Bacterial Classification, Random Forest, TF-IDF Vectorization, Feature Engineering, Microbiological Risk AssessmentAbstrak
Classifying bacteria based on their potential harm to humans is very important in microbiology, especially for early detection and prevention of pathogenic threats. This study aims to develop a classification model that can predict whether a bacterial species is harmful or not, using habitat descriptions and taxonomic information as input features.
This dataset consists of 200 bacterial species, each with “Where Found” (habitat) and “Family” (taxonomy) attributes. Preprocessing steps include label normalization, TF-IDF transformation for text data, and one-hot encoding for categorical features. The resulting feature set is used to train a Random Forest classifier. Model performance is evaluated using an 80/20 stratified training-testing split, followed by accuracy metrics, classification reports, and 5-fold cross-validation. Further optimization is performed via GridSearchCV to identify the best hyperparameter settings.
The model achieved 80% accuracy on the test data set and an average cross-validation accuracy of 71.38%. Feature importance analysis indicates that keywords related to habitat, such as “soil,” “human,” and “infected,” have the strongest influence on classification results. These findings suggest that combining natural language-based feature engineering techniques with ensemble classification algorithms can effectively distinguish harmful bacteria from non-harmful ones.
This research provides an interpretable and efficient machine learning pipeline for microbiological risk assessment, with potential applications in clinical diagnostics, public health surveillance, and environmental microbiology.