Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Incident response, the process of responding to system disruptions and slowdowns, is a critical aspect of IT operations. It’s also an activity that traaditionally involves a lot of manual, time-consuming processes.
That’s a challenge is Harness is taking aim at with a new incident response service. The technology enters early access today as a module on the company’s eponymous platform. Harness got its start in 2017 with an initial focus on continuous integration/continuous delivery (CI/CD) automation for DevOps. In the years since, the company has expanded into a software delivery platform with multiple modules. In fall 2024 Harness broke into agentic AI, initially to help support software development.
Now the company is extending that same core agentic AI foundation for incident response. The new solution also benefits from licensed capabilities originally developed by development workflow vendor Transposit. Tina Huang, co-founder of Transposit, along with many members of her team, joined Harness in September 2024.
The goal with Harness Incident Response is to accelerate the mean time to resolution (MTTR) for an incident.
“When you think about what DevOps platforms have been up until now, it’s largely been about helping you structure those deployments,” Huang told VentureBeat. “I think the very natural place to go after that is, ‘How do I hand-hold your deployments after they’ve hit production?’”
How Harness enables autonomous incident response with agentic AI
At the core of Harness’ Incident Response module is the company’s AI agent architecture, first introduced in September 2024.
Jyoti Bansal, Harness CEO and co-founder, explained to VentureBeat that its AI agents are designed to provide autonomous assistance, going beyond just alerting engineers to incidents. Traditional incident response technology uses an approach known as a playbook. IT teams, often working with site reliability engineers (SREs), define playbooks that lay out step-by-step processes for recovering from different types of service disruptions.
Rather than relying solely on pre-defined playbooks, the agentic AI agents can suggest actions, identify potential root causes and even create new playbooks on the fly.
“The agentic workflow is suggesting the actions that should be taken,” Bansal said.
Huang explained that AI agents execute multiple steps that are critical to help organizations respond faster to incidents. Even before a playbook can run, there is a certain amount of triage that needs to occur, Bansal explained. General triage can, for instance, identify what services are impacted or determine both upstream and downstream dependencies that will also be impacted by the incident.
Harness’ system has agents that are aware of and plugged into multiple systems, and that can collect information automatically, including information and discussion from Slack channels. That information can then help other agents to alert humans and provide autonomous assistance.
While the system has a high degree of automation, Huang emphasized that humans are still in the loop. But instead of a human being alerted to a problem and then having to figure out if there is a playbook —and if so how to run it — the system recommends the remediation and the human only needs to approve it.
Incident response requires more that just technology
The Harness Incident Response module can run on its own, meaning organizations don’t already need to be running any other Harness modules.
Bansal expects, however, that the combined offering — which could enable integration with multiple other workflows including DevOps or chaos engineering — could be beneficial. Chaos engineering is the process of injecting unexpected variables and events in an application to see how it responds. Harness has had a chaos engineering module as part of its platform since 2022.
Huang explained that as part of the incident response platform, an organization can run ‘fire drills’ alongside the chaos engineering module to test different scenarios.
“Incidents happen infrequently, and they are often the unfortunate result of something that you didn’t catch earlier on,” said Huang. “We want to enable a very proactive approach to incident response.”
How enterprises will benefit from agentic AI driven incident response
One Harness customer using the incident response module is Tyler Technologies, which develops software for the public sector.
The company has been using the Harness platform for continuous deployment, cloud cost management and feature flag development. The addition of incident response could help solve a key challenge the faces, explained Jeff Green, Tyler Technologies’ CTO.
“Our primary challenge is really integrating all the operational data, metrics and processes, then correlating them into a single unified approach to managing incidents and automating our response to them,” he told VentureBeat. “Our portfolio includes over 100 products built on different technologies using a wide variety of devops tools and platforms.”
The incident response capability will complement existing operations Tyler Technologies is already doing with Harness. For example, being able to correlate deployments with incidents, or feature flags with incidents.
“We think the AI capabilities being infused into the product will save a lot of time by helping us with root cause analysis, identifying ways to mitigate or resolve incidents, and with incident prevention,” said Green. “Much of this work today is done by humans pulling data from multiple sources, scouring logs and application performance monitoring (APM) data and looking for patterns, all tasks that AI is better suited to.”
The ROI of agentic AI for incident response
Another Harness customer evaluating the incident response module is Omar Alwattar, Sr DevOps engineer at InStride.
Alwattar told VentureBeat that his firm has been using the Harness Continuous Delivery module. He noted that when it comes to incident response, his organization has two key challenges: preventative monitoring and root cause identification. The new Harness incident response tool is interesting to his company, he said, as it will help with faster issue identification and automated fix suggestions.
“In terms of ROI, the most significant impact would be on downtime reduction, as it directly influences SLA adherence and customer satisfaction,” Alwattar said. “Additionally, by automating aspects of incident response, our 11-person DevOps team could focus more on strategic projects and innovation rather than constant troubleshooting.”
Source link lol