Agreed TI2 - 2021/22 - Critical - Moodle Service Reliability

Size

Medium 

Budget Epic Name

CTP Maintenance Budget

Jira Epic

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Feature LeadDavid Kwaw & Nikola Bohzkov
Team

Alistair Spark

Ehsan Anwar

David Kwaw

Nikola Bohzkov

This feature encapsulates the need for Moodle to be pro-actively monitored and performance issues dealt with before they cause any CIs.

This ties in with the idea of an LA Data Availability team but more generally an Application SRE function  (Site Reliability Engineering: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started).


Some of the key activities that will need to be progressed:

  • Setup team access to AWS Cloudwatch Monitoring via IAM role ( Unable to locate Jira server for this macro. It may be due to Application Link configuration. )
  • Post CI strands of work (Catalyst development but exchange and test)
  • Cloudfront / S3 signed URLs ( Unable to locate Jira server for this macro. It may be due to Application Link configuration. )
  • Active monitoring of the Redis / frontends / etc during peaks of load 
  • Drill through any blips in response times and document causes
  • Push for resolution of any identified flaws
  • Explore options for automating load testing (will need to time bound the effort on this)
  • Improve CI comms channel - ISD News editing by SO & reach out to Mike Haward about Status page & get this reset to be generic - https://www.ucl.ac.uk/isd/moodle-under-maintenance
  • Create a Moodle maintenance/outage page that can be used for traffic redirection in the event of a Moodle outage. This page needs to be editable by the Moodle team. Consider setting up a Moodle_Status Twitter feed as a short term measure if we are unable to obtain an editable page.


Moodle uptime is critical and this feature will always come before anything else. We currently rely on Catalyst to develop fixes for us, this will change over time but we are well resourced so this should not be seen as a barrier.