Predictive Maintenance Even Without Complete Damage Documentation

Written by Michael Damatov | Jun 29, 2026 2:15:46 PM

A truck breaks down in the middle of the night. The cargo spoils, the customer backs out, and the repair shop charges double for the express estimate. And then, days later, someone realizes: The engine temperature had already been fluctuating unusually for three weeks. The data was there. No one had looked at it.

That is precisely the promise of predictive maintenance: to warn of breakdowns before they happen. The promise is real. The disillusionment sets in as soon as you look at your own database—and realize that nowhere is it clearly documented when, which component, and how it failed.

Many projects fail right here, before they even begin. This article shows that it doesn’t have to be that way. Using the approach of weakly supervised machine learning, it’s possible to build a robust prediction system, even without a perfect database. It’s not an easy path. But it exists.

What is predictive maintenance, and why does it require “learning”?

Today, maintenance essentially takes two forms: you wait until something breaks (reactive), or you replace parts according to a fixed schedule—for example, every 10,000 kilometers (preventive). The former is expensive when things go wrong. The latter is expensive because it’s always done too early.

Predictive maintenance breaks free from this dilemma. The idea: Sensor data collected during operation—temperatures, pressures, operating times, fuel consumption—is continuously analyzed to predict when a component is likely to fail. The part is replaced exactly when it’s needed.

For this to work, a model must learn from historical data which patterns precede a failure. Experts call this approach “supervised machine learning”: The model learns from examples where the outcome is known—just as an experienced workshop manager develops a sense, based on hundreds of similar cases, of which noises are harmless and which are not.

The problem: Supervised machine learning requires labeled data—datasets where, for each measurement period, it is known whether damage occurred afterward or not. And it is precisely these labels that are missing. Not because the data wasn’t collected, but because no one has ever compiled it in this form.

Which measurements are even relevant

Before a model can be trained, there is essentially a detective-like question: Which measurements—known as “features” in technical terms—actually reveal something about the condition of a component? Not every sensor that provides data also provides useful data.

Typical features in the field of vehicle or machine fleets:

  • Engine temperature: If it remains consistently above a certain threshold, this indicates overheating or cooling problems.
  • Fuel consumption: A sudden increase may indicate mechanical problems or leaks.
  • Brake pressure trends: Irregularities indicate wear or defects in the brake system.
  • Idling duration: Excessive idling puts a strain on the engine and accelerates wear and tear.
  • Shifting patterns: Abnormalities in shifting behavior can be early signs of transmission problems.

Often more important than the individual readings is how they change over time: Is the temperature rising? Does fuel consumption fluctuate more than in comparable vehicles? Such derived metrics provide a more precise diagnostic picture than any single snapshot.

This selection process is not a mere technical exercise. It requires people who understand what is actually happening inside a machine and who have learned to distinguish between signal and noise.

The real challenge: Lack of claims data

A model designed to predict failures must learn from past failures. That sounds obvious, yet in practice it is the biggest obstacle. This is because very few companies actually have the answer to the question, “When did what fail?”:

  • Failure reports are recorded incompletely or not at all.
  • The cause of a failure is never documented—only the replacement.
  • Minor signs of wear and tear don’t show up anywhere.
  • Different repair shops document things differently—or not at all. Often, this documentation is simply not accessible.

Why label quality trumps model quality

A model uncritically and completely adopts the truth of its training data. If the dataset says “no damage” even though damage had long since occurred—only unnoticed or undocumented—the model learns exactly that: look the other way. In the worst-case scenario, this results in a system that is wrong with a high degree of confidence—and that is more dangerous than having no system at all.

Unreliable predictions create false confidence. And false confidence in a maintenance system can end up costing more than the failure it was meant to prevent.

Weak labels: Not ideal, but a viable approach

When direct failure labels are missing, proxy indicators can help—events that, while not failures themselves, often correlate with one:

  • Unplanned Stops: A vehicle that stops outside its usual routes or at unusual times may have experienced a breakdown—even if it was never logged.
  • Temperature outliers: Extreme values far outside the normal operating range indicate that something is wrong.
  • Fuel anomalies: Unusual fuel consumption compared to similar trips is rarely without cause.

Additional strategies for generating so-called “weak labels”:

Rule-based

Known technical limits are used as thresholds. Example: “If the oil temperature remains above 120 °C for more than 15 minutes, this is considered a critical event.” These rules are not precise failure labels—but they are more honest than an empty data set.

Usage intensity and estimated remaining service life

Some components wear out based on usage, not calendar time. An engine with 100,000 kilometers under full load is subjected to different stresses than one with the same mileage on the highway. By combining usage intensity and typical service life, we can estimate how long a component is likely to last—experts refer to this as the estimated remaining useful life (eRUL). Important: Labels generated in this way encode assumptions about typical wear curves—they do not replace observed failures but serve as a rough starting point that is refined with each iteration using real data. In Phase 3, the eRUL is then no longer used as a label but as the actual prediction target of the models.

Anomaly Detection Before Failure

After a known failure, it is often possible to identify systematic sensor patterns in the past that preceded it. This correlation can be used to label similar patterns in other vehicles. Caution is advised here: A pattern that preceded a failure in one vehicle is not necessarily generalizable to other vehicles. The labels generated in this way must therefore be validated—for example, by verifying whether the identified pattern actually occurred in other vehicles prior to a failure, rather than merely correlating by chance with a single incident.

None of these approaches is perfect. But together, they provide a foundation on which to build.

One structural limitation must be openly addressed: Historical data only captures failures that were actually noticed and documented. Catastrophic early failures, silent wear-and-tear processes, or vehicles that were taken out of service after sustaining severe damage leave no trace in the database—they are simply not there. The model learns from what was visible, not from what actually happened. This so-called survivorship bias problem cannot be completely solved, but it can be mitigated through the consistent involvement of domain experts: They are aware of the blind spots in the documentation and can identify failure patterns that do not appear in any workshop records.

The Three-Stage Learning Process

Since complete failure data is not available, the model is not trained all at once. Instead, a three-step approach is followed that systematically builds upon the available knowledge.

Phase 1: Unsupervised Machine Learning – Finding patterns without knowing what to look for

In the first step, the sensor data is analyzed without any damage information—experts refer to this as unsupervised machine learning. The initial goal is modest: to understand which vehicles or components behave similarly—and which deviate noticeably.

At the same time, the system looks for points of change: When did a sensor start behaving differently than before? A cooling system that delivered stable readings for weeks and then begins to fluctuate may indicate the onset of wear—even before anyone notices anything.

This phase does not provide predictions. But it does provide something that is at least as valuable: an honest picture of what the data actually reveals.

Phase 2: Learning with weak labels and calibrating with experts

In the second phase, the anomalies identified in Phase 1 are combined with the weak labels. This results in an initial prediction model—still rough, still prone to errors, but capable of learning.

The decisive factor in this phase is not algorithms, but people. Domain experts are systematically involved—and their role goes far beyond occasionally approving results. Experienced technicians and fleet managers help ensure that the data is understood correctly in the first place: What does a specific sensor value mean in the actual operational context? Which thresholds make technical sense—and which ones sound plausible but don’t hold up in reality? Which model assumptions are sound—and which ones are hidden errors that will come back to haunt us later?

The model specifically selects the data points about which it is most uncertain and presents them to the experts: “Take a look at these vehicle trajectories—do you think there was actually a problem there?” The answers rarely turn out the way the model expects. Experts disagree, correct, and express doubts—and that’s productive. However, even experts have systematic blind spots: Experience shows that electrical defects tend to be underestimated compared to mechanical ones because they are less frequently observed directly. Certain failure patterns that are statistically significant may be so normalized in day-to-day operations that no one identifies them as a problem. Expert feedback is therefore a valuable corrective signal, but not an infallible one. It is treated as one input signal among many, not as a final verdict.

Phase 3: Prediction of the estimated remaining useful life (eRUL)

In the third phase, the question shifts: it is no longer just whether a failure is imminent, but when—that is, how long a component is expected to continue functioning, the estimated remaining useful life (eRUL).

Specialized survival time models are used for this purpose—models that can handle incompletely observed failures and do not require every failure to be fully documented. They learn under which circumstances components last longer or shorter. Factors such as driving profile, road conditions, load, and load history are taken into account. Simpler versions of these models rely on statistical assumptions about typical lifespans. More flexible variants recognize more complex relationships but require more data. The most powerful variants analyze the progression of sensor data over time and determine the current stage of a component’s life cycle—much like an experienced physician who identifies a trend from vital signs even before a symptom becomes apparent.

What exactly is being analyzed? The importance of population selection

A long-haul truck and a delivery vehicle operating in urban areas provide sensor data in the same format—but they tell completely different stories. Running both through the same model would be like evaluating a marathon runner and a sprinter using the same performance standards.

That is why vehicles or machines are divided into homogeneous groups—populations—before a model is trained. Within a population, the units are sufficiently comparable that their patterns can be interpreted collectively.

Choosing the right populations is one of the most critical decisions in the entire process. Categories that are too broad dilute the patterns to the point of insignificance. Categories that are too fine leave too few data points to generate stable models. There is no objectively correct answer, only better and worse compromises that are reevaluated with each iteration.

The Process at a Glance

  1. Generate features
    Meaningful metrics are calculated from raw data—averages, trends, outliers, and comparisons with similar vehicles.

  2. Forming populations and training them separately
    A separate model is trained for each vehicle group. In this process, the features within each population are reevaluated: A feature that is missing in the majority of vehicles in this group or has the same value across the board does not provide useful information and is removed. What is relevant across populations may not necessarily be relevant within a specific group.

  3. Generating labels
    The weak labels generated by the strategies described above are used for the training dataset—with the understanding that they are incomplete and with the goal of improving them iteratively.

  4. Evaluating the Model
    After training, the model is evaluated to determine whether it is actually better than its predecessor. Two perspectives are essential:

    • Case-by-case analysis: Why does the model trigger an alert for this specific vehicle? Which features were the deciding factors?
    • Overall analysis: How does the model perform across the entire population? Where are the systematic weaknesses?

Simple hit rates are unsuitable for evaluation: A model that never triggers an alarm is statistically almost always “correct” in the case of rare failures—and is nevertheless worthless. Instead, two metrics are crucial: precision (how many of the triggered alerts were actually justified?) and recall (how many of the actual failures did the model detect?). Since an unjustified alert—an unnecessary trip to the repair shop—is expensive but but manageable, whereas a missed failure can be catastrophic, the Fβ score with β < 1 is used as a combined quality metric—an F1 score that weights precision more heavily than recall. The model should issue reliable warnings when it does so—even if it doesn’t detect every failure in the process.

A model that cannot explain its decisions will not survive in production—nor should it. Visualizations and explanations are therefore passed directly to domain experts. Their reaction is often more revealing than any metric: if they confirm a finding, confidence in the model grows. If they disagree—and this happens regularly—it is not a setback. It is the most valuable information a model can receive.

Not a One-Time Process: Why Iteration Is Not a Weakness

A predictive maintenance model isn’t a product that you build once, deliver, and then forget about. It’s a system that’s only as good as the feedback loops that keep it alive.

Feedback from domain experts

Experts from engineering and operations review the results using visualizations and clear explanations—and they regularly raise objections. Sometimes a threshold value doesn’t match operational reality. Sometimes a feature describes a statistical correlation that doesn’t make technical sense. Such objections are not ignored but fed back into the system as correction signals. With each iteration, the model moves closer to reality.

Refining features

Combining two metrics can be more meaningful than either one alone. A sensor that was previously ignored may turn out to be crucial.

Improve Labels

As operational experience grows, rules and substitute indicators become more precise. What was initially a rough approximation becomes a reliable foundation.

Adjusting populations

A group that turns out to be too heterogeneous is subdivided. Two groups that are more similar than expected are merged.

Optimize model parameters

The model’s internal settings are adjusted—a process that does not require deep insight, but without which no model can reach its full potential.

Model retraining during development

As long as the model is still being improved, all steps are repeated regularly—using a broader and better-labeled dataset than the last time. Each iteration is considered progress only if the metrics (precision, Fβ) improve or, at the very least, do not deteriorate.

Model Training in Production

Even after development, a model that has been trained once is not set in stone. Vehicle fleets change, usage patterns shift, and repair practices evolve. A model that was calibrated on real-world data two years ago may now be systematically incorrect—without anyone noticing. That is why the same evaluation metrics must also be monitored during routine operation. As soon as precision or Fβ falls below defined thresholds, retraining is required.

Anyone who views this process as a sign of immaturity has misunderstood model development. Iteration is not proof that something doesn’t work—it is the mechanism through which it learns to work. And continuous monitoring during operation is not a sign of distrust in the model, but rather respect for the fact that reality is constantly evolving.

Conclusion: No verdict—just another challenge

Many companies are waiting for the right moment: only when the loss data is complete, only when the documentation is in order, only when the conditions are ideal. That moment will never come.

Weakly supervised machine learning isn’t a trick that turns bad data into good data. It’s a structured approach to a reality that cannot be dismissed: the data is incomplete. The labels are fuzzy. Expert knowledge is indispensable, but difficult to formalize.

That is precisely why the danger of false confidence—perhaps the greatest danger of this approach—is not an argument against the system, but rather a requirement for its design. A model that cannot explain its decisions has no place in this process. Domain experts who confirm or reject findings are not an optional addition—they are an indispensable part of this safeguard, along with the requirement for explainability and the metric-based approval criteria between iterations. And an iteration that only advances a model if it is better than its predecessor is not a weakness—it is the only form of quality assurance that works when the dataset is incomplete.

Getting started is feasible. The results improve, but only if the process remains consistent. And the cost of waiting for perfect conditions is evident in every overnight outage on a long-haul route.