← Home

Navigating the Non-Linear Path of Battery Life Prediction

When I started this project, the core goal was straightforward: quantify how much extending a battery's lifespan is possible by reducing thermal stress through a supercapacitor-hybrid system. That fundamental objective still guides my work, but as any data scientist or engineer can attest, real progress rarely follows a straight line. It often emerges from a landscape of messy, poorly documented notebooks, abandoned scripts, and perplexing outliers that challenge every assumption.

Establishing a Foundation: The HNEI Dataset

To kick things off, the Hawaii Natural Energy Institute (HNEI) dataset was chosen as the starting point for model development. Compared to the NASA Battery Aging Dataset, HNEI offers a simpler, more structured entry point. It provides features for each cycle of 14 LG Chem 18650 lithium-ion batteries.

These features include:

The structured nature of this dataset makes it exceptionally well-suited for traditional machine learning models (see Traditional ML ). These models excel with tabular, cycle-level data, enabling rapid experimentation with various feature combinations and modeling strategies.

Insight: Machine Learning and Deep Learning

Working with the HNEI dataset at this initial stage serves two critical purposes:

  1. Baseline Performance Benchmarking: Building models on a clean, standardized dataset allows for precise quantification of model quality using interpretable metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R². These benchmarks will be crucial comparison points when progressing to more complex datasets, like NASA’s, which will demand significant additional preprocessing and sequence modeling.
  2. Controlled Environment for Model Exploration: The HNEI dataset provides a controlled setting to investigate model behavior and sensitivity. For instance, it allows for analysis of how performance changes with different train/test splitting strategies (e.g., by battery, cycle, or end-of-life). It also supports feature importance analysis, outlier detection, and error analysis with minimal preprocessing overhead. These insights are vital for building an intuitive understanding of what works (and what breaks) before transitioning to more sophisticated approaches like LSTMs or transformer-based networks.

Insights: Key Metrics for Linear Regression

The Versions So Far—and What They've Taught Me

I've structured the project into several distinct experimental versions. This approach has been instrumental in surfacing meaningful insights about the data, the models, and the requisites for making robust Remaining Useful Life (RUL) predictions.

Version 1: Data Exploration and First Model Run [ Notebook ]

The first version served primarily as a diagnostic tool—a rapid, exploratory sweep to confirm whether machine learning could effectively capture degradation patterns within the HNEI dataset. A comprehensive data inspection was conducted, confirming no missing values or duplicates. Early correlation analysis identified relationships between input features and the RUL target.

Crucially, this phase highlighted a significant insight: certain columns, particularly the cycle index, were highly correlated with the target variable, introducing severe data leakage if included, as the results will show

Key Findings:

Model MAE MSE RMSE MAPE
Extra Trees Regressor 5.39 255.53 15.41 0.9976 7.22%
Ensemble of Top Models 6.69 261.10 15.61 0.9975 8.28%

These results are unrealistically strong—likely due to data leakage. Without a proper train-test split, the model was evaluated on data it had already seen during training, artificially inflating performance. While the features may contain useful signals, true predictive power can only be validated with rigorous holdout testing.

Version 2: Battery Split Validation [ Notebook ]

Version 2 marked a significant shift in data approach. Instead of randomly sampling data points, the strategy involved withholding entire batteries from the training set. This battery-wise split more accurately simulates a real-world scenario: deploying a model on a battery it has never seen before.

Consequently, performance dropped—precisely as expected. This revealed that earlier results were likely inflated by leakage between train and test data originating from the same battery, allowing the model to memorize rather than generalize.

Condition MAE MSE RMSE MAPE
With Outliers 8.8735 387.4732 19.4128 0.9962 0.1222
Without Battery_ID 26.9881 1343.7324 36.6570 0.9872 0.4437
With Battery ID Feature 6.6877 261.0975 15.6079 0.9975 0.0828

Key Observations:

While the Extra Trees Regressor (selected via PyCaret's compare_models()) achieved strong battery-wise generalization (RMSE=35.97, R²=0.9876), further validation is critical: time-based splits should test temporal robustness, and leave-multiple-battery-out evaluation could reveal dependency on specific units. Though performance aligns with literature (e.g., Zhang et al.'s physics-informed models ( Zhang et al., 2020)), PyCaret's automated feature selection warrants inspection—key degradation features may dominate. Future work should compare against sequential models (e.g., LSTMs) and validate on heterogeneous battery datasets to assess industrial applicability.

The consistent inverse relationship between prediction error and RUL—higher errors near end-of-life (EOL)—reflects fundamental degradation physics: early-stage capacity fade often follows quasi-linear trends (well-captured by PyCaret’s top-performing tree models), while EOL behavior becomes nonlinear due to cascading failures such as lithium plating and SEI layer breakdown (J. Electrochem. Soc. (2019)). This pattern aligns with observations from the NASA PCoE Dataset, where prediction uncertainty spikes below 50 cycles RUL. While PyCaret’s compare_models() efficiently surfaces this trend via aggregated metrics, targeted mitigation—such as EOL-focused ensemble weights or survival analysis techniques (Energy Storage (2022))—could further improve actionable warning times.

Version 3: Finally Some Structure [ Notebook ]

Version 3 represents a methodological maturation—moving from ad-hoc experimentation to rigorous ML engineering practices. The key insight: proper validation requires proper structure, both in data splitting and model evaluation.

The Structural Revolution

Building on Version 2's battery-wise splitting revelation, Version 3 introduces leave-multiple-batteries-out validation with systematic preprocessing pipelines. Instead of arbitrary single-battery holdouts, this version implements a more robust approach by leaving out batteries 13 and 14 for testing.

More critically, Version 3 eliminates the battery_id leakage trap that plagued earlier versions. While Version 2 showed that including battery_id as a feature boosted performance (RMSE: 15.61), Version 3 properly excludes it from model training while preserving it for GroupKFold validation—critical for real-world deployment scenarios.

Model MAE MSE RMSE MAPE
Extra Trees Regressor 58.82 5831.39 76.36 0.9439 0.4181

The Leakage Lesson:

With target variance ~103,300, these results reflect genuine generalization capability. The 4x performance degradation from Version 1 isn't failure—it's methodological honesty. Real-world battery RUL prediction means deploying on completely unseen units, and Version 3's approach finally simulates this correctly.

The MAPE of ~42% might seem high, but context is critical. Battery degradation exhibits high variability even within identical units due to manufacturing tolerances, thermal history, and usage patterns. Academic literature often reports overly optimistic results due to similar leakage issues that Version 3 explicitly addresses. The consistent R² of 0.94+ across proper validation suggests the model captures fundamental degradation physics, not just memorized patterns.

The GroupKFold validation strategy ensures that during cross-validation, entire batteries are held out—preventing the model from learning battery-specific artifacts. This mirrors the final train-test split and provides more realistic performance estimates during hyperparameter tuning. The preprocessing pipeline's systematic approach—removing temporal features like Cycle_Index while preserving electrochemical signals—creates a foundation for genuine degradation modeling rather than time-series memorization.

Version 3 establishes the methodological baseline for trustworthy battery RUL prediction. Future work should explore temporal validation with time-based splits, heterogeneous dataset validation across manufacturers, uncertainty quantification beyond point estimates, and sequential modeling with LSTM/Transformer architectures. The journey from Version 1's impressive-but-hollow metrics to Version 3's honest-but-actionable results exemplifies the difference between research theater and production readiness—sometimes, the best progress looks like taking a step backward.

Up Next

This week, I'll be tackling the End-of-Life (EOL) split challenge—a critical decision that could fundamentally reshape our modeling approach. The question: should we train separate models for different battery life stages, or maintain the unified approach that Version 3 established?

The physics argues for splitting. Early-stage degradation follows quasi-linear capacity fade patterns that tree-based models handle well, while end-of-life behavior becomes chaotic—dominated by cascading failures like lithium plating and SEI breakdown. Our Version 2 observation that "prediction error spikes when RUL is low" suggests these aren't just different data points, but fundamentally different phenomena.

Yet the pragmatics argue against it. EOL data is already scarce and high-variance. Further splitting risks creating unreliable test sets, and real-world deployment will encounter the full RUL spectrum. Where would we even draw the line—50 cycles? 100? The boundary becomes arbitrary.

My current plan: implement a hybrid stratification approach. Keep the battery-wise split that Version 3 established, but analyze performance separately across RUL ranges. This preserves methodological rigor while revealing whether the model truly understands degradation physics or just averages across life stages.

Beyond EOL splits, the broader modeling architecture question looms: traditional ML vs. deep learning? Interpretable tree ensembles vs. sequence-aware LSTMs? The answer will likely involve ensemble learning—because when predicting something as complex as battery failure, perhaps five models can collectively converge on truth where one cannot.

Stay tuned for more updates. And if you've ever felt overwhelmed by a pile of half-working notebooks scattered across multiple "final" versions—rest assured, you're not alone.