PREDICTING HUMAN PATHOGENICITY OF BACTERIA USING RANDOM FOREST AND TEXT-BASED FEATURE ENGINEERING

Loan Thi Dang; Tran Tuan Tu; Herry Susanto; Zulhenry; Andri Triyono

Penulis

Loan Thi Dang Thang Long University Penulis
Tran Tuan Tu Thai Nguyen University Penulis
Herry Susanto Universitas Islam Sultan Agung Penulis
Zulhenry Natiomal Taipei Unersity Penulis
Andri Triyono Universitas An Nuur Penulis

Kata Kunci:

Bacterial Classification, Random Forest, TF-IDF Vectorization, Feature Engineering, Microbiological Risk Assessment

Abstrak

Classifying bacteria based on their potential harm to humans is very important in microbiology, especially for early detection and prevention of pathogenic threats. This study aims to develop a classification model that can predict whether a bacterial species is harmful or not, using habitat descriptions and taxonomic information as input features.

This dataset consists of 200 bacterial species, each with “Where Found” (habitat) and “Family” (taxonomy) attributes. Preprocessing steps include label normalization, TF-IDF transformation for text data, and one-hot encoding for categorical features. The resulting feature set is used to train a Random Forest classifier. Model performance is evaluated using an 80/20 stratified training-testing split, followed by accuracy metrics, classification reports, and 5-fold cross-validation. Further optimization is performed via GridSearchCV to identify the best hyperparameter settings.

The model achieved 80% accuracy on the test data set and an average cross-validation accuracy of 71.38%. Feature importance analysis indicates that keywords related to habitat, such as “soil,” “human,” and “infected,” have the strongest influence on classification results. These findings suggest that combining natural language-based feature engineering techniques with ensemble classification algorithms can effectively distinguish harmful bacteria from non-harmful ones.

This research provides an interpretable and efficient machine learning pipeline for microbiological risk assessment, with potential applications in clinical diagnostics, public health surveillance, and environmental microbiology.

PREDICTING HUMAN PATHOGENICITY OF BACTERIA USING RANDOM FOREST AND TEXT-BASED FEATURE ENGINEERING

Penulis

Kata Kunci:

Abstrak

Unduhan

Diterbitkan

Terbitan

Bagian

Cara Mengutip

ISSN

Kirim Naskah

Bahasa

Informasi

Menu

Template

Visitor