Diabetes Risk Prediction Model
Introduction to Model
The diabetes risk prediction model is built using the Pima Indians Diabetes Database from Kaggle, which contains various health indicators and a target outcome indicating whether an individual is diabetic (1) or non-diabetic (0).
Dataset Preprocessing
1. Handling Missing Data: Columns like Glucose, BloodPressure, SkinThickness, Insulin, and BMI with zero values (interpreted as missing data) were replaced with the column mean to ensure completeness.
2. Feature Standardization: A StandardScaler was used to standardize the features, ensuring that all variables contribute equally to the model’s performance. This scaling brings features to a uniform scale with a mean of 0 and standard deviation of 1.
3. Feature and Target Separation: The dataset was divided into features (X) and the target (y), where X consists of health indicators and y represents the diabetes outcome.
Model Training
The model employs a Logistic Regression algorithm with the following steps:
- Data Splitting: The dataset was divided into training (80%) and testing (20%) subsets to evaluate performance on unseen data.
- Training: The logistic regression model was trained on the standardized training data using the fit method.
Model Evaluation
After training, the model was evaluated using various metrics:
- Accuracy: Achieved an overall accuracy of 76.62% on the test set.
- Precision: The model's precision for identifying diabetic individuals (Outcome 1) is 68.63%.
- Recall: The recall score for diabetic predictions is 63.64%, indicating the proportion of actual diabetics correctly identified.
- F1-Score: The balance between precision and recall is measured at 66.04%.
The classification report provides further insights:
- For non-diabetic individuals (Outcome 0), the precision and recall are 81% and 84%, respectively, reflecting strong performance in identifying non-diabetic cases.
- For diabetic individuals (Outcome 1), the model performs slightly less robustly but still provides useful predictions.
How the Model Predicts
Once trained, the model can accept new input data (e.g., health metrics of an individual) after scaling and predict the diabetes outcome. For instance, when provided with data representing a healthy individual, the model predicted a non-diabetic outcome (0).
Conclusion
This logistic regression model effectively predicts diabetes risk using health metrics, making it a valuable tool for early screening and risk assessment. Although the model's performance is promising, further improvements (e.g., hyperparameter tuning, exploring more complex algorithms like Random Forest or XGBoost) can enhance its predictive accuracy.