To kick things off, the Hawaii Natural Energy Institute (HNEI) dataset was chosen as the starting point for model development. Compared to the NASA Battery Aging Dataset, HNEI offers a simpler, more structured entry point. It provides features for each cycle of 14 LG Chem 18650 lithium-ion batteries.
These features include:
The structured nature of this dataset makes it exceptionally well-suited for traditional machine learning models (see Traditional ML ). These models excel with tabular, cycle-level data, enabling rapid experimentation with various feature combinations and modeling strategies.
Working with the HNEI dataset at this initial stage serves two critical purposes:
I've structured the project into several distinct experimental versions. This approach has been instrumental in surfacing meaningful insights about the data, the models, and the requisites for making robust Remaining Useful Life (RUL) predictions.
The first version served primarily as a diagnostic tool—a rapid, exploratory sweep to confirm whether machine learning could effectively capture degradation patterns within the HNEI dataset. A comprehensive data inspection was conducted, confirming no missing values or duplicates. Early correlation analysis identified relationships between input features and the RUL target.
Crucially, this phase highlighted a significant insight: certain columns, particularly the cycle index, were highly correlated with the target variable, introducing severe data leakage if included, as the results will show
Model | MAE | MSE | RMSE | R² | MAPE |
---|---|---|---|---|---|
Extra Trees Regressor | 5.39 | 255.53 | 15.41 | 0.9976 | 7.22% |
Ensemble of Top Models | 6.69 | 261.10 | 15.61 | 0.9975 | 8.28% |
These results are unrealistically strong—likely due to data leakage. Without a proper train-test split, the model was evaluated on data it had already seen during training, artificially inflating performance. While the features may contain useful signals, true predictive power can only be validated with rigorous holdout testing.
Version 2 marked a significant shift in data approach. Instead of randomly sampling data points, the strategy involved withholding entire batteries from the training set. This battery-wise split more accurately simulates a real-world scenario: deploying a model on a battery it has never seen before.
Consequently, performance dropped—precisely as expected. This revealed that earlier results were likely inflated by leakage between train and test data originating from the same battery, allowing the model to memorize rather than generalize.
Condition | MAE | MSE | RMSE | R² | MAPE |
---|---|---|---|---|---|
With Outliers | 8.8735 | 387.4732 | 19.4128 | 0.9962 | 0.1222 |
Without Battery_ID | 26.9881 | 1343.7324 | 36.6570 | 0.9872 | 0.4437 |
With Battery ID Feature | 6.6877 | 261.0975 | 15.6079 | 0.9975 | 0.0828 |
While the Extra Trees Regressor (selected via PyCaret's compare_models()) achieved strong battery-wise generalization (RMSE=35.97, R²=0.9876), further validation is critical: time-based splits should test temporal robustness, and leave-multiple-battery-out evaluation could reveal dependency on specific units. Though performance aligns with literature (e.g., Zhang et al.'s physics-informed models ( Zhang et al., 2020)), PyCaret's automated feature selection warrants inspection—key degradation features may dominate. Future work should compare against sequential models (e.g., LSTMs) and validate on heterogeneous battery datasets to assess industrial applicability.
The consistent inverse relationship between prediction error and RUL—higher errors near end-of-life (EOL)—reflects fundamental degradation physics: early-stage capacity fade often follows quasi-linear trends (well-captured by PyCaret’s top-performing tree models), while EOL behavior becomes nonlinear due to cascading failures such as lithium plating and SEI layer breakdown (J. Electrochem. Soc. (2019)). This pattern aligns with observations from the NASA PCoE Dataset, where prediction uncertainty spikes below 50 cycles RUL. While PyCaret’s compare_models() efficiently surfaces this trend via aggregated metrics, targeted mitigation—such as EOL-focused ensemble weights or survival analysis techniques (Energy Storage (2022))—could further improve actionable warning times.
Version 3 represents a methodological maturation—moving from ad-hoc experimentation to rigorous ML engineering practices. The key insight: proper validation requires proper structure, both in data splitting and model evaluation.
Building on Version 2's battery-wise splitting revelation, Version 3 introduces leave-multiple-batteries-out validation with systematic preprocessing pipelines. Instead of arbitrary single-battery holdouts, this version implements a more robust approach by leaving out batteries 13 and 14 for testing.
More critically, Version 3 eliminates the battery_id leakage trap that plagued earlier versions. While Version 2 showed that including battery_id as a feature boosted performance (RMSE: 15.61), Version 3 properly excludes it from model training while preserving it for GroupKFold validation—critical for real-world deployment scenarios.
Model | MAE | MSE | RMSE | R² | MAPE |
---|---|---|---|---|---|
Extra Trees Regressor | 58.82 | 5831.39 | 76.36 | 0.9439 | 0.4181 |
With target variance ~103,300, these results reflect genuine generalization capability. The 4x performance degradation from Version 1 isn't failure—it's methodological honesty. Real-world battery RUL prediction means deploying on completely unseen units, and Version 3's approach finally simulates this correctly.
The MAPE of ~42% might seem high, but context is critical. Battery degradation exhibits high variability even within identical units due to manufacturing tolerances, thermal history, and usage patterns. Academic literature often reports overly optimistic results due to similar leakage issues that Version 3 explicitly addresses. The consistent R² of 0.94+ across proper validation suggests the model captures fundamental degradation physics, not just memorized patterns.
The GroupKFold validation strategy ensures that during cross-validation, entire batteries are held out—preventing the model from learning battery-specific artifacts. This mirrors the final train-test split and provides more realistic performance estimates during hyperparameter tuning. The preprocessing pipeline's systematic approach—removing temporal features like Cycle_Index while preserving electrochemical signals—creates a foundation for genuine degradation modeling rather than time-series memorization.
Version 3 establishes the methodological baseline for trustworthy battery RUL prediction. Future work should explore temporal validation with time-based splits, heterogeneous dataset validation across manufacturers, uncertainty quantification beyond point estimates, and sequential modeling with LSTM/Transformer architectures. The journey from Version 1's impressive-but-hollow metrics to Version 3's honest-but-actionable results exemplifies the difference between research theater and production readiness—sometimes, the best progress looks like taking a step backward.
This week, I'll be tackling the End-of-Life (EOL) split challenge—a critical decision that could fundamentally reshape our modeling approach. The question: should we train separate models for different battery life stages, or maintain the unified approach that Version 3 established?
The physics argues for splitting. Early-stage degradation follows quasi-linear capacity fade patterns that tree-based models handle well, while end-of-life behavior becomes chaotic—dominated by cascading failures like lithium plating and SEI breakdown. Our Version 2 observation that "prediction error spikes when RUL is low" suggests these aren't just different data points, but fundamentally different phenomena.
Yet the pragmatics argue against it. EOL data is already scarce and high-variance. Further splitting risks creating unreliable test sets, and real-world deployment will encounter the full RUL spectrum. Where would we even draw the line—50 cycles? 100? The boundary becomes arbitrary.
My current plan: implement a hybrid stratification approach. Keep the battery-wise split that Version 3 established, but analyze performance separately across RUL ranges. This preserves methodological rigor while revealing whether the model truly understands degradation physics or just averages across life stages.
Beyond EOL splits, the broader modeling architecture question looms: traditional ML vs. deep learning? Interpretable tree ensembles vs. sequence-aware LSTMs? The answer will likely involve ensemble learning—because when predicting something as complex as battery failure, perhaps five models can collectively converge on truth where one cannot.
Stay tuned for more updates. And if you've ever felt overwhelmed by a pile of half-working notebooks scattered across multiple "final" versions—rest assured, you're not alone.