How will AIOps change IT Operations?

How will the application of AI change IT operations? Let’s examine one example and make one prediction for the future.

If “IT Operations” generally means this:

“Managing and maintaining the infrastructure, networks, applications, and services that support the organization’s business functions.”

Then “AIOps” implies this:

“Using AI to assist with managing and maintaining the infrastructure, networks, applications, and services that support the organization’s business functions.”

Easy enough, but what is the range of what is possible? Let’s define a value pyramid of the potential layers for AIOps:

For this post, let’s look at just the layer called “Incident Detection and Alerting.”
All large organizations have server monitoring and alerting tools. You define pre-established “rules and profiles” that trigger when an undesirable threshold is reached, such as being low on memory or storage or having a high CPU utilization rate. Alarms and notifications alert the Operations Team for possible action.

Typically, these systems have agents on each server that collect the data, or there is a mechanism to pull the health metrics of servers into a central database. Then, the rules engine can look at thresholds and time series data to trigger potential alarms or notifications.

Let’s look at a very simplified alternate approach that uses AI.

Here is a graph of the CPU consumption of a large server running AIX(Unix):

Absorb the graph from left to right. You see this server’s weekly activity: It is busier during the week and not so busy for a few days over the weekend. The pattern repeats for several weeks in a row.

Then, the pattern changes on a specific date. Do you see it around July 13th – 14th? Instead of using a traditional monitoring system, let’s see what AI concludes:

INPUT PROMPT:

“Look at the attached graphic image of server performance and determine the date the historical pattern of behavior changes.”

RESPONSE:

“The historical pattern of behavior in the server performance chart appears to change around July 14th. Until this point, the CPU utilization shows a regular pattern of peaks and troughs. However, starting from July 14th, there is a noticeable increase in overall utilization and more frequent high peaks, indicating a change in the server’s workload or behavior.”

The AI correctly identified when the pattern had changed. It did so by examining only the graph image of the server’s performance, without analyzing data points or time series data—only the graphics image.

It’s a clue to what is possible in the very near-term future. Go back to the value pyramid we defined at the beginning. The top layer, called “Agentic Operations,” implies this:

Agentic AI refers to a class of artificial intelligence systems designed to act as autonomous agents capable of performing tasks, making decisions, and interacting with their environments without requiring direct human intervention.

Agentic means that intelligence will be able to perform all the layers of our AIOps pyramid:

Collect the data.
Determine potential problems and solutions.
Implement the changes.
Finally, verify that the changes helped.

That could mean allocating memory, CPU, storage, etc., on the fly, changing the OS tuning parameter related to networking, detecting an intrusion, applying patches, or recommending architectural changes in the apps. The possibilities are endless.

Much of this can be done today with existing traditional tooling, but it requires multiple solutions, extensive configuration, oversight, cost, etc.

Here is the prediction:

Soon, operating systems will have a micro LLM “built-in” to the operating system code itself. This micro LLM will be “fine-tuned” to understand the unique nuances of that particular operating system. It will be self-updating and able to operate in an agentic manner to perform a comprehensive variety of administrative tasks.

This is where AIOps is going.