It is Monday morning, and after a very long weekend of system trouble the cloud operations team is discussing what happened. It appears that many systems which were correlated with a very innovative, new inventory management system enabled with machine learning had issues over the weekend. The postmortem concluded the following:
The batch process that transferred raw information from the operational database into the training database failed, as well as the auto recovery process. An ops team member that was working over the weekend tried to resubmit but caused not one, but four partial upgrades that abandoned the training database in an unstable state.
This also resulted in the knowledge units in the machine learning systems to train with bad data and required that the new data in the knowledge base be eliminated and the models rebuilt.
Also, several outside data feeds, such as pricing and tax information, were updated at the exact same time to the training database. Though those worked good, they too needed to be backed out of the knowledge database considering that the operational data was not in a fantastic state.
The system was unavailable for two days and the company lost $4 million, contemplating missing productivity, customer reactions, and PR problems.
This isn't 2025; this really is today. As businesses find more uses for"good and cheap" cloud-based machine learning systems we're finding the systems that leverage machine learning are complex to function. The ops teams don't expect the amount of difficulty and the complexity and are finding they are undertrained, understaffed, and underfunded.
The assumption is that the cloud surgeries teams can manage cloud-based databases, cloud-based storage, and cloud-based compute with a fairly easy transition. For the most part that has been the situation, considering that cloud-based systems are similar to traditional systems.
However, systems based on machine learning have not yet been seen for the most part by operations groups. These systems have specialized purposes, as well as specialized systems--for example databases and comprehension engines--that need to be monitored and handled in certain ways. This is where the current operations teams are failing.
The fix is pretty easy to understand, but most enterprises are not likely to enjoy it, considering it means spending additional bucks for ML cloudops or abandoning ML cloudops. Machine learning systems are all technological chainsaws. If used carefully, they are highly effective. If mishandled they can be harmful. Failures can go unnoticed, and should the system automatically uses the resulting bad knowledge, you could get huge problems that may not be discovered until much damage is done. More risk than reward, it seems.