Run Scanner was built to be monitored by Prometheus. Add Run Scanner as a target of your Prometheus server.
Most variables exported by Run Scanner are prefixed with miso_runscanner_
.
The default JVM metrics are exported with the prefix jvm_
.
As a default, we suggest the following alerts:
groups:
- name: runscanner.rules
rules:
- alert: StuckRuns
expr: sum_over_time(miso_runscanner_new_runs_scanned[1h]) - sum by(instance, environment,
job) (rate(miso_runscanner_waiting_runs[1h])) < 3 and time() - process_start_time_seconds > 8 * 3600
annotations:
description: The runs being processed by {{$labels.instance}} are not being
processed and cleared.
summary: Runscanner {{$labels.instance}} seems to be stuck
- alert: BadRuns
expr: miso_runscanner_directories_attempted - miso_runscanner_directories_accepted
> 0
annotations:
description: The run scanner {{$labels.instance}} has found candidate sequencer
output directories that it does not have permission to read.
summary: Unreadable directories run directories on {{$labels.instance}}
- alert: ScanningStopped
expr: time() - process_start_time_seconds > 30 * 60 and time() - miso_runscanner_last_scan_start_time_seconds > 30 * 60
annotations:
description: Run Scanner has not started a new scan in over 30 minutes
summary: Run Scanner seems to have stopped scanning
- alert: AutoInhibit
expr: time() - process_start_time_seconds{job="runscanner"} < 15 * 60 or sum(miso_runscanner_waiting_runs) by (environment, instance, job) - miso_runscanner_bad_runs > 5
labels:
scope: runscanner
annotations:
description: Run Scanner was restarted recently and probably needs time to finish scanning old runs
summary: Runscanner cache is cold
StuckRuns
will fire when Run Scanner no longer seems to be making progress
extracting data from run directories. This usually occurs if I/O latency has
increased or in the case of CPU starvation.
BadRuns
will fire when there are unreadable runs due to permission errors.
This requires human intervention to correct the directory permissions. Run
Scanner will reattempt scanning the affected runs.
The AutoInhibit
alert will fire after Run Scanner restarts until the cache is
warm. Since it can take a long period of time for Run Scanner to start up and
scan all available data, it may be useful to stop applications from attempting
access until the run cache is warm. To do this, have the application check for
an AutoInhibit
alert firing and wait until later. This alert should not be
sent to humans. There is no action to be taken when firing.
If using MISO to collect runs from Run Scanner, we suggest the following alert:
groups:
- name: miso.rules
- alert: DroppedRuns
expr: miso_runscanner_client_bad_runs > 0
annotations:
description: The runs being received by MISO {{$labels.instance}} are not being
saved to the database.
summary: Runs are failing to save on {{$labels.instance}}
DroppedRuns
will fire when runs are failing to save in MISO. This can happen
for several reasons:
- the run is for an instrument that is not registered in MISO. Add the instrument to MISO or remove that path from Run Scanner's configuration.
- there is a version mismatch between Run Scanner and MISO. Upgrade the lagging software package. Check the release notes when upgrading MISO, as Run Scanner version changes should be noted there.
- there is a conflict with the MISO configuration: this can include mismatched
or missing sequencing parameters or container models. Check the
miso_debug.log
on the MISO instance to determine the mismatch and correct it in MISO. - the run data is irreconcilably mismatched. For example MISO and Run Scanner disagree on the sequencing platform for a run. If MISO is incorrect, delete the run and container in MISO and allow Run Scanner to recreate it.
Once a run is marked as dropped, it will not be sent to MISO again unless: * MISO is restarted * Run Scanner is restarted * the run is updated (and it was not previously completed/failed)
Restarting one of MISO or Run Scanner is necessary to clear this alert after taking corrective action.
If using Grafana, we have included a dashboard that includes basic metrics for Run Scanner.