TI3 - 2021/22 - Critical - A Moodle Service Reliability

Size

Medium 

Budget Epic Name

CTP Maintenance Budget

Jira Epic

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Feature LeadDavid Kwaw & Nikola Bohzkov
Team

Alistair Spark

Ehsan Anwar

David Kwaw

Nikola Bohzkov

This feature encapsulates the need for Moodle to be pro-actively monitored and performance issues dealt with before they cause any CIs.

This ties in with the idea of an LA Data Availability team but more generally an Application SRE function  (Site Reliability Engineering: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started).

Key deliverables for TI3:

  • Complete S3 offloading
  • Formalise Load Event Investigation & write up standard to be consistent
  • SRE/Ops Team training
  • Expand SRE to Assessment@UCL, UCL eXtend (TI4? tbc)
  • Catalyst - SRE performance related trackers integrated for inclusion in Moodle 4.1 (WRMS tickets pre-fixed with "SRE -  MDL-XXXXX")


Some of the key activities that still need to be progressed:

  • Post CI strands of work (Catalyst development but exchange and test) 
  • Regrading issue - https://wrms.catalyst.net.nz/wr.php?request_id=378838
  • Cloudfront / S3 signed URLs ( Unable to locate Jira server for this macro. It may be due to Application Link configuration. ) - if not completed in TI2
  • Active monitoring of the Redis / frontends / etc during peaks of load 
  • Drill through any blips in response times and document causes
  • Push for resolution of any identified flaws
  • Explore options for automating load testing (will need to time bound the effort on this)
  • Improve CI comms channel - ISD News editing by SO & reach out to Mike Haward about Status page & get this reset to be generic - https://www.ucl.ac.uk/isd/moodle-under-maintenance
  • Create a Moodle maintenance/outage page that can be used for traffic redirection in the event of a Moodle outage. This page needs to be editable by the Moodle team. Consider setting up a Moodle_Status Twitter feed as a short term measure if we are unable to obtain an editable page.


Moodle uptime is critical and this feature will always come before anything else. We currently rely on Catalyst to develop fixes for us, this will change over time but we are well resourced so this should not be seen as a barrier.