Navigating the Non-Linear Path of Battery Life Prediction

When I started this project, the core goal was straightforward: quantify how much extending a battery's lifespan is possible by reducing thermal stress through a supercapacitor-hybrid system. That fundamental objective still guides my work, but as any data scientist or engineer can attest, real progress rarely follows a straight line. It often emerges from a landscape of messy, poorly documented notebooks, abandoned scripts, and perplexing outliers that challenge every assumption.

To kick things off, the Hawaii Natural Energy Institute (HNEI) dataset was chosen as the starting point for model development. Compared to the NASA Battery Aging Dataset, HNEI offers a simpler, more structured entry point. It provides features for each cycle of 14 LG Chem 18650 lithium-ion batteries.

The structured nature of this dataset makes it exceptionally well-suited for traditional machine learning models (see Traditional ML ). These models excel with tabular, cycle-level data, enabling rapid experimentation with various feature combinations and modeling strategies.

Working with the HNEI dataset at this initial stage serves two critical purposes:

Insights: Key Metrics for Linear Regression

R² (R-squared): Measures how much variance your model explains (ranges from 0 to 1; higher values indicate a better fit).
Mean Squared Error (MSE): Average of squared prediction errors (lower is better, but sensitive to outliers). As a rule of thumb, a "good" MSE is about 10% of your target variable’s variance in this case RUL has a variance of 103,961.
Root Mean Squared Error (RMSE): Square root of MSE, expressed in the original units of the target variable (easier to interpret).
Mean Absolute Error (MAE): Average of absolute errors (more robust to outliers, but does not indicate error direction).
Use R² for overall fit, RMSE/MAE for error magnitude—prefer MAE if outliers distort results. Always check residual plots to spot hidden patterns.

The Versions So Far—and What They've Taught Me

I've structured the project into several distinct experimental versions. This approach has been instrumental in surfacing meaningful insights about the data, the models, and the requisites for making robust Remaining Useful Life (RUL) predictions.

Version 1: Data Exploration and First Model Run [ Notebook ]

The first version served primarily as a diagnostic tool—a rapid, exploratory sweep to confirm whether machine learning could effectively capture degradation patterns within the HNEI dataset. A comprehensive data inspection was conducted, confirming no missing values or duplicates. Early correlation analysis identified relationships between input features and the RUL target.

Crucially, this phase highlighted a significant insight: certain columns, particularly the cycle index, were highly correlated with the target variable, introducing severe data leakage if included, as the results will show

Key Findings:

Data Integrity: No missing values or duplicate rows were found
Feature Correlation with RUL:
- Cycle_Index: -0.9998 (Excluded due to data leakage)
- Max. Voltage Discharge: +0.7828
- Min. Voltage Charge: -0.7598
Cycle_Index was identified as a major data leakage source due to its near-perfect correlation with RUL
Battery Id might also be a source of Data leakage

Model	MAE	MSE	RMSE	R²	MAPE
Extra Trees Regressor	5.39	255.53	15.41	0.9976	7.22%
Ensemble of Top Models	6.69	261.10	15.61	0.9975	8.28%

These results are unrealistically strong—likely due to data leakage. Without a proper train-test split, the model was evaluated on data it had already seen during training, artificially inflating performance. While the features may contain useful signals, true predictive power can only be validated with rigorous holdout testing.

Version 2: Battery Split Validation [ Notebook ]

Version 2 marked a significant shift in data approach. Instead of randomly sampling data points, the strategy involved withholding entire batteries from the training set. This battery-wise split more accurately simulates a real-world scenario: deploying a model on a battery it has never seen before.

Consequently, performance dropped—precisely as expected. This revealed that earlier results were likely inflated by leakage between train and test data originating from the same battery, allowing the model to memorize rather than generalize.

Condition	MAE	MSE	RMSE	R²	MAPE
With Outliers	8.8735	387.4732	19.4128	0.9962	0.1222
Without Battery_ID	26.9881	1343.7324	36.6570	0.9872	0.4437
With Battery ID Feature	6.6877	261.0975	15.6079	0.9975	0.0828

Key Observations:

Outlier removal offered the most significant improvement, bringing the RMSE down by over 5 cycles but this is not useful because baseline has not been established yet, as model is unrealistically good and I still need to make it trustworthy.
Adding battery_id as a feature increased accuracy but introduced the risk of leakage
Prediction error is higher when RUL is low and decreases as the estimated RUL increases

While the Extra Trees Regressor (selected via PyCaret's compare_models()) achieved strong battery-wise generalization (RMSE=35.97, R²=0.9876), further validation is critical: time-based splits should test temporal robustness, and leave-multiple-battery-out evaluation could reveal dependency on specific units. Though performance aligns with literature (e.g., Zhang et al.'s physics-informed models ( Zhang et al., 2020)), PyCaret's automated feature selection warrants inspection—key degradation features may dominate. Future work should compare against sequential models (e.g., LSTMs) and validate on heterogeneous battery datasets to assess industrial applicability.

The consistent inverse relationship between prediction error and RUL—higher errors near end-of-life (EOL)—reflects fundamental degradation physics: early-stage capacity fade often follows quasi-linear trends (well-captured by PyCaret’s top-performing tree models), while EOL behavior becomes nonlinear due to cascading failures such as lithium plating and SEI layer breakdown (J. Electrochem. Soc. (2019)). This pattern aligns with observations from the NASA PCoE Dataset, where prediction uncertainty spikes below 50 cycles RUL. While PyCaret’s compare_models() efficiently surfaces this trend via aggregated metrics, targeted mitigation—such as EOL-focused ensemble weights or survival analysis techniques (Energy Storage (2022))—could further improve actionable warning times.

Version 3: Finally Some Structure [ Notebook ]

Version 3 represents a methodological maturation—moving from ad-hoc experimentation to rigorous ML engineering practices. The key insight: proper validation requires proper structure, both in data splitting and model evaluation.

The Structural Revolution

Building on Version 2's battery-wise splitting revelation, Version 3 introduces leave-multiple-batteries-out validation with systematic preprocessing pipelines. Instead of arbitrary single-battery holdouts, this version implements a more robust approach by leaving out batteries 13 and 14 for testing.

More critically, Version 3 eliminates the battery_id leakage trap that plagued earlier versions. While Version 2 showed that including battery_id as a feature boosted performance (RMSE: 15.61), Version 3 properly excludes it from model training while preserving it for GroupKFold validation—critical for real-world deployment scenarios.

Model	MAE	MSE	RMSE	R²	MAPE
Extra Trees Regressor	58.82	5831.39	76.36	0.9439	0.4181

The Leakage Lesson:

Version 1: RMSE ~19 (random split with temporal leakage)
Version 2: RMSE ~36 (battery-wise split, but still some leakage)
Version 3: RMSE ~76 (proper isolation, no feature leakage)

With target variance ~103,300, these results reflect genuine generalization capability. The 4x performance degradation from Version 1 isn't failure—it's methodological honesty. Real-world battery RUL prediction means deploying on completely unseen units, and Version 3's approach finally simulates this correctly.

The MAPE of ~42% might seem high, but context is critical. Battery degradation exhibits high variability even within identical units due to manufacturing tolerances, thermal history, and usage patterns. Academic literature often reports overly optimistic results due to similar leakage issues that Version 3 explicitly addresses. The consistent R² of 0.94+ across proper validation suggests the model captures fundamental degradation physics, not just memorized patterns.

The GroupKFold validation strategy ensures that during cross-validation, entire batteries are held out—preventing the model from learning battery-specific artifacts. This mirrors the final train-test split and provides more realistic performance estimates during hyperparameter tuning. The preprocessing pipeline's systematic approach—removing temporal features like Cycle_Index while preserving electrochemical signals—creates a foundation for genuine degradation modeling rather than time-series memorization.

Version 3 establishes the methodological baseline for trustworthy battery RUL prediction. Future work should explore temporal validation with time-based splits, heterogeneous dataset validation across manufacturers, uncertainty quantification beyond point estimates, and sequential modeling with LSTM/Transformer architectures. The journey from Version 1's impressive-but-hollow metrics to Version 3's honest-but-actionable results exemplifies the difference between research theater and production readiness—sometimes, the best progress looks like taking a step backward.

Up Next

This week, I'll be tackling the End-of-Life (EOL) split challenge—a critical decision that could fundamentally reshape our modeling approach. The question: should we train separate models for different battery life stages, or maintain the unified approach that Version 3 established?

The physics argues for splitting. Early-stage degradation follows quasi-linear capacity fade patterns that tree-based models handle well, while end-of-life behavior becomes chaotic—dominated by cascading failures like lithium plating and SEI breakdown. Our Version 2 observation that "prediction error spikes when RUL is low" suggests these aren't just different data points, but fundamentally different phenomena.

Yet the pragmatics argue against it. EOL data is already scarce and high-variance. Further splitting risks creating unreliable test sets, and real-world deployment will encounter the full RUL spectrum. Where would we even draw the line—50 cycles? 100? The boundary becomes arbitrary.

My current plan: implement a hybrid stratification approach. Keep the battery-wise split that Version 3 established, but analyze performance separately across RUL ranges. This preserves methodological rigor while revealing whether the model truly understands degradation physics or just averages across life stages.

Beyond EOL splits, the broader modeling architecture question looms: traditional ML vs. deep learning? Interpretable tree ensembles vs. sequence-aware LSTMs? The answer will likely involve ensemble learning—because when predicting something as complex as battery failure, perhaps five models can collectively converge on truth where one cannot.

Stay tuned for more updates. And if you've ever felt overwhelmed by a pile of half-working notebooks scattered across multiple "final" versions—rest assured, you're not alone.