TI2 - 2022/23 - Critical - Moodle SRE

Size

Medium 

Budget Epic Name

CTP Maintenance Budget

Jira Epic

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Feature LeadAlistair Spark
Team

Nikola Bozhkov

Catalyst EU/AU

This feature encapsulates the need for Moodle to be pro-actively monitored and performance issues dealt with before they cause any CIs.

This ties in with the idea of an LA Data Availability team but more generally an Application SRE function  (Site Reliability Engineering: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started).

Key areas of focus for TI2:

  • Continue monitoring auto-scaling, refine it's scaling parameters for peaks
  • Refine the coursemodinfo caching infrastructure to reduce cost while fulfilling the bandwidth requirement
  • Drive load testing of Moodle 4.1 and pipeline based load testing

Some of the key recurring activities encapsulated here:

  • Active monitoring of the Redis / frontends / etc during peaks of load 
  • Drill through any blips in response times and document causes
  • Push for resolution of any identified flaws


Moodle uptime is critical and this feature will always come before anything else. We currently rely on Catalyst to develop fixes for us, this will change over time but we are well resourced so this should not be seen as a barrier.