CloudOps and CICD

Reduced Downtime by 10x in One Quarter by Increasing Uptime to 99.95%

99.95%

Improved the service availability and reliability from 99.5 to 99.95

63%

reduction in outages over 2 years.

55%

reduction in ticket volume over 2 years.

5 min

first response time for support tickets was reduced to 5 mins.

Client’s development teams were spending a significant amount of time managing operational and CI/CD aspects of their cloud-hosted applications. The Customer was looking for a partner to help them with Cloud Ops so that their development teams could focus on the product roadmaps.

Key Challenges :

  • Address issues in deployments.
  • Service outages.
  • Lack of automated tests for voice, video, and screen sharing across geographies.
  • Improve time to market & the ability to roll out new features.

GOALS:

  • Automate configuration management.
  • Implement log monitoring and anomaly detection to avoid outages.
  • Automate monitoring and alerting by simulating voice calls, video calls, and screen sharing to proactively identify and fix issues across different geographies.
  • Improve service availability.

Solution:

  • We built RPA bots for end-to-end application testing to ensure application availability and measure the quality of service being delivered in different geographies.
  • Automated the log monitoring for anomaly detection and issue reporting.
  • Implemented email alerts and automated Jira ticket creation with logs and screenshots based on the anomaly.
  • Developed Keyword Driven Framework for RPA workflow automation.
  • Setup a 24/7 support team with Cloud and DevOps skills.
  • Automated the log monitoring for anomaly detection and issue reporting.
  • Built HA and DR for the infrastructure.
  • Automated configuration management with Ansible.
  • Automated CI/CD for static content and Kubernetes clusters.

Results:

99.95%

Coordinating with our client’s team, we were able to improve the service availability and reliability from 99.5 (provided by Google Cloud) to 99.95 through regional cluster deployments.

5 mins

First response time for support tickets was reduced to 5 mins.

10 mins

Average resolution time was cut to 10 mins.

During the migration process, we also engaged in process reengineering across the cloud and a few architecture changes leading to-

63%

63% reduction in outages over 2 years.

55%

55% reduction in ticket volume over 2 years.

Successfully eliminated ~2 system-wide outages per month with zero impact on users

Want the full Case Study?