By Manisha Singh, Advanced Analytics & Tech Strategist



Can you relate to this image ? This is a typical log file that support / dev teams have struggled – manually reading the logs line by line to resolve an outage/anomaly. Such was the era of traditional IT operations where : Process was time consuming, correlation between different layers of platform and multiple log files was difficult; Results could vary & valid for a particular time duration; Results could be lost and history wasn’t saved and Thus this approach did not scale.

This consumed multiple resources from support teams, Dev, Infra over emergency calls running several hours gazing at several Dashboards & tools wondering where to start ? Earlier, we followed a REACTIVE approach where for example we would wait for a cpu metric to breach the 80% threshold before we could act.

What if we could use the power of logs that we don’t really read now and let the machine analyse the trends. What if the machine could predict the fault that tomorrow @8PM this platform will suffer high cpu. Such is the power of AIOPS……

No alt text provided for this image

AIOps can help predict and prevent anomalies much before the user is impacted. It creates realtime baselines of normal behaviour and alert on deviations. It also provides automated root cause analysis for issues so we know where to narrow down on problem areas. It can automatically correlate between different layers – infra, OS, application and database to provide you a single meaningful incident and filtering noise. This greatly reduces MTTR for outages.

AI ops has taken Human machine collaboration to the next level where humans and machines are not just coexisting but are collaborating and working together like team members.

No alt text provided for this image

So AIOPS has great potential and being adopted for lot of use cases at a fast pace.


This raises the need for real, honest dialog about how we build responsible AI systems. How do we enforce human checks & balances on these machines. This responsibility lies with all the engaged citizens.

What responsible means ? it means to be able to justify Automated Decisions.

Designing Responsible AI systems

We must start with answering the important questions around Policy, Technology and process as shown below :

No alt text provided for this image

No alt text provided for this image

Here is a Design pipeline showcasing the important elements of the overall AI Model design process. We always start with purpose i.e what we want out of the Model. Then comes the sampling process where…

Continue reading: