Industrial Holding Increased the Services Availability to 99.7%
☹️ Manual processing of thousands of notifications from IT monitoring systems:
Like many industrial companies, our client had a large situational center that received information from all branches of the business: from IT systems to temperature sensors and air pollution control sensors. One engineer is responsible for one level of information, for example, one engineer is responsible for the network, another for the power supply, a third for applications, a fourth for the data center, etc. All these processes were built manually: dozens of people watch screens 24/7.
☹️ Low SLAs:
During the digital transformation of the holding, the load on IT increased greatly and the current staff could not cope with it. Important alerts were lost in a large number of less important ones. The SLA of critical business services, such as accounting and logistics software, security tools, and video analytics, began to decrease.
☹️ Low service availability:
The number of messages to the helpdesk with complaints about the slow and incorrect work of business applications has sharply increased.
☹️ Contractors inflated their results:
IT support was built on the principle of attracting outsourcing companies and, in the absence of objective controls, not all contractors showed real KPIs.
Deployment: Enterprise on-premise version with priority support.
Period: 2,5 months.
💪 Contractors’ monitoring tools were connected to a customer’s Acure AIOps (15 Zabbix, 4 Splunk, 5 Nagios servers).
💪 Set up monitoring of synthetic transactions (autotests): Once every 10 minutes, 520 tests of all key business processes were launched. At any point in time, the health of service became clear.
💪 A connection between services and resources (Resource Service Model) was built and events were correlated on the infrastructure layer with the business layer.
💪 Launched auto-assignment of incidents. Contractors themselves registered incidents before. Now all events become recorded, stored, and analyzed. The registration was transferred to the system: if the problem occurred at the business level – the incident was placed on the workgroup that had problems at the resource level.
😊 Reduced the risk of the human factor:
100% of the data is processed by robots.
😊 Increased service availability:
From 98.5% to 99.7%.
😊 Reduced crash response time:
From 30 to 15 minutes.
😊 IT from a business point of view has become more transparent:
Bottlenecks and unscrupulous contractors were identified. Outsourcing specialists have become more controlled.