CloudOps and CICD
Reduced Downtime by 10x in One Quarter by Increasing Uptime to 99.95%
99.95%
Improved the service availability and reliability from 99.5 to 99.95
63%
reduction in outages over 2 years.
55%
reduction in ticket volume over 2 years.
5 min
first response time for support tickets was reduced to 5 mins.
Key Challenges :
- Address issues in deployments.
- Service outages.
- Lack of automated tests for voice, video, and screen sharing across geographies.
- Improve time to market & the ability to roll out new features.
GOALS:
- Automate configuration management.
- Implement log monitoring and anomaly detection to avoid outages.
- Automate monitoring and alerting by simulating voice calls, video calls, and screen sharing to proactively identify and fix issues across different geographies.
- Improve service availability.
Solution:
- We built RPA bots for end-to-end application testing to ensure application availability and measure the quality of service being delivered in different geographies.
- Automated the log monitoring for anomaly detection and issue reporting.
- Implemented email alerts and automated Jira ticket creation with logs and screenshots based on the anomaly.
- Developed Keyword Driven Framework for RPA workflow automation.
- Setup a 24/7 support team with Cloud and DevOps skills.
- Automated the log monitoring for anomaly detection and issue reporting.
- Built HA and DR for the infrastructure.
- Automated configuration management with Ansible.
- Automated CI/CD for static content and Kubernetes clusters.
Results:
99.95%
Coordinating with our client’s team, we were able to improve the service availability and reliability from 99.5 (provided by Google Cloud) to 99.95 through regional cluster deployments.
5 mins
First response time for support tickets was reduced to 5 mins.
10 mins
Average resolution time was cut to 10 mins.
During the migration process, we also engaged in process reengineering across the cloud and a few architecture changes leading to-
63%
63% reduction in outages over 2 years.
55%
55% reduction in ticket volume over 2 years.
Successfully eliminated ~2 system-wide outages per month with zero impact on users