Support and Incident Processes
Purpose
This documents the process of how Monitor Space Hazards (MSH) users raise a support query or incident with the service and how it will be resolved.
Context and Scope
Monitor Space Hazards is expected to run reliably at all times and available 24/7. However, like all web services, bugs can be introduced and unexpected downtime can occur, at which point support will be necessary.
UKSA requires technical support for Monitor Space Hazards 9-5, Monday to Friday. This is on the understanding that all events so far have had around 7 days notice and therefore full support 24/7/365 is not necessary.
Note: this means that UKSA and operators will carry risks against unexpected occurrences. This is something which can be monitored throughout public beta and see if more coverage is necessary.
Support Process
User raises support email
We will use the email address ukspaceagency.support@thepsc.co.uk as the primary support tool. As such, we will communicate to users to use this email address when they are onboarded.
Any support requests that come into monitorspacehazards@ukspaceagency.gov.uk will be triaged by the UK Space Agency team and forwarded on if necessary to ukspaceagency.support@thepsc.co.uk.
The proposed support process is that:
- The email is sent to the ukspaceagency.support@thepsc.co.uk mailbox accessible to all support team members (or forwarded from monitorspacehazards@ukspaceagency.gov.uk).
- This provokes a workflow which includes updating the #uksa-sst-beta-alerts-prod Slack channel and acknowledging the user’s report via email.
- The on-call team member is tasked to always be alert to Slack channel inputs especially on: #uksa-sst-beta-alerts-prod on The PSC Slack workspace
- They analyse the request and follow up as appropriate.
- Response time and action determined by classification of the issue identified.
- If required, a Super User can create an incident banner and deploy it to the website, alerting users that there is a situation impacting the service. If the site is down, then users can find out from the Pingdom Status page
- If appropriate, actions taken by on-call team members are determined by runbooks.
- If necessary, a Jira issue is created with the issue and actions taken and/or required.
- The user(s) is notified of actions taken to address the issue.
Support and response expectations and clarifications
Classification | Example | Out of hours Support Available |
---|---|---|
P1: Critical Issue | Service (API and/or web portal) is unavailable to end users due to a problem with the site Serious security breach on platform |
Acknowledgment within 30 minutes Response/update from an on-call member of the team within 1 hour Issue resolved within 4 hours |
P2: Major Issue | Serious degradation of service (API and/or web portal) | Acknowledgment within 30 minutes Response/update from on-call team member within 1 hour Resolved within 1 working day |
P3: Minor Issue | Component failure that is not immediately impacting on service (API and/or web portal) Problem uploading Ephemeris or analysis data User creation and management is unavailable |
Acknowledgment within 30 minutes Response from on-call team member within 1 hour Update within 1 working day |
Incident Process
In the event of an incident or problem with the service, we will support and respond to issues according to the same classification system to the support process.
Incidents or problems with the service may be detected from a number of different sources, for example, a problem may be raised by a user or we may be alerted to an issue via Sentry or Logit.io.
We will use ukspaceagency.support@thepsc.co.uk as the primary tool for reporting problems. As such, this must be made clear to all users when they register. Please note: monitorspacehazards@ukspaceagency.gov.uk will also be monitored for any support requests, which will be forwarded on.
Upon detection of a problem the following process should be followed:
- Email is sent (or forwarded) to ukspaceagency.support@thepsc.co.uk by on support team member or user (whichever is first).
- This provokes a workflow which includes updating the #uksa-sst-beta-alerts-prod Slack channel and acknowledging the user’s report via email.
- The on-call team member (as determined by a rota) is tasked to always be alert to Slack channel inputs, especially on: #uksa-sst-beta-alerts-prod on The PSC Slack workspace
- The issue is investigated and classified and, if necessary, all stakeholders are notified of the issue.
- Response time and action determined by classification of the issue identified.
- If appropriate, actions taken by on-call team members are determined by runbooks.
- If user facing components are affected by the incident, the Pingdom service status page will indicate this.
- If possible, then a Super User can login and create an incident banner to display on Monitor Space Hazards from the /Account page. The Super User can choose whether to schedule this message or activate it straight away and display until deactivated. There are four types of incident banner:
- Major Incident - for example, a collision that has impacted a number of objects and may be causing innacurate information
- Technical issue ingesting conjunction data messages - For instance Space-Track information is down
- Technical issue ingesting object data - For instance Space-Track or ESA information is down
- Technical issue ingesting orbital analyses - For instance there is an issue with the UK Sapce Agency Orbital Analysis systems
- If necessary, a Jira issue is created with the issue and actions taken and/or required.
- Incident review is conducted by the team to establish the route cause and possible preventive actions explored.
Limits to support
- In all instances we will attempt to resolve P1 problems as soon as possible; but in the worst case scenario other solutions will be considered, including rolling back to the previous version; and in the worst case making the service inaccessible until appropriate development team members are available.
- We will not support outside of the dedicated hours. In the event of an issue occuring outside of Monday to Friday 9-5 we will respond the next working day.
- Under exceptional circumstances (e.g. a known, very high probability conjunction event that will make its closest approach on a weekend) we will make best efforts to ensure service availability outside of office hours.
Alternative support systems considered
- A service desk/ticketing system such as Zendesk or Freshdesk. These might be the right solutions in time but were rejected as follows:
- Monitoring
- Both options would introduce another new system which would need linking into current processes on JIRA and Slack to ensure that tasks were monitored. This was probably overkill for the amount of support calls expected (none in the private beta period)
- A telephone number may be added at some stage; in its absence the acknowledgment will need to tell the user about the support level they can expect.
Monitoring while on support
We use Sentry to alert us to any anomalies in our 3 environments. Sentry alert manager is linked to Slack and provides alerts to following channels:
- #uksa-sst-beta-alerts-dev
- #uksa-sst-beta-alerts-staging
- #uksa-sst-beta-alerts-prod
We monitor these channels during support hours to respond to any issues that come to our attention.
Goals and non-Goals
The goal of this design doc is to articulate a design for a predictable and efficient support process.
Goals
- Users need a way of raising queries and knowing that it has been raised successfully
- Support teams need a way of monitoring issues raised so that they can assist as necessary
- Support teams need a way of tracking progress of each ticket, resolving what is possible and necessary on weekends and otherwise communicating effectively with operators
- There should always be an on-call member of the support team during the specified hours, who is prepared and able to deal with the variety of issues that may arise
Non-goals
- A full case management system for Orbital Analysts’ conjunction analysis may be overkill for the scale of system