Monitoring Operations¶

Use the monitoring pages to watch host capacity, deployment resource use, and game health while servers are running.

Swarm Host Actions¶

Swarm host owners and admins can manage monitoring from the swarm host detail page:

enable the monitoring stack
disable the monitoring stack
restart the monitoring stack
repair the monitoring stack
run monitoring diagnostics
change the local metric retention window

These actions are queued for the swarm host agent. If monitoring is unhealthy, the agent reports that state, but game deployment actions continue to be managed independently.

Retention And Storage¶

The default retention window is 30d. Source installs store monitoring data under /var/lib/swarmhost-monitor unless the agent environment overrides it.

Shorter retention reduces disk use. Longer retention gives more history but needs more local storage on the swarm host. Retention changes affect local metric history only; they do not change deployment backups.

Diagnostics¶

Use Run monitoring diagnostics from the swarm host page when charts stop updating or the stack reports unhealthy collectors.

For source-checkout installs, the same diagnostic can be run locally:

.venv/bin/python agents/swarmhost-agent/agent/agent.py diagnostics monitoring

Diagnostics include collector health, configured retention, data directory size, target count, disk guard state, and the last known successful scrape time when the local metrics store can answer that query.

Game Health Metrics¶

Game health charts are shown only when a deployment has an explicit game metrics target, port, or sidecar configuration. If a game does not expose game metrics, the deployment monitor still shows container resource charts when host-local monitoring is available.

Troubleshooting¶

If the monitoring page has no chart data:

confirm the swarm host is online
confirm monitoring is enabled on that swarm host
run monitoring diagnostics
check that the browser can establish a monitor session
wait for the first scrape interval after enabling or repairing the stack

If only game health charts are missing, check whether that game or deployment actually provides a game metrics target.

Product Metrics Stack¶

The central metrics stack runs on VPS_1 and scrapes the control-plane /metrics endpoint. It is separate from swarmhost-local monitoring.

Deploy or repair it with:

scripts/release/deploy_product_metrics.sh

The stack includes:

VictoriaMetrics for product time-series storage
vmagent scraping https://swarmhosts.com/metrics with PRODUCT_METRICS_TOKEN from default/swarmhosts-web-secret
a separate 30-day VictoriaMetrics store for VPS host metrics
a VPS vmagent scraping node-exporter for VPS_1 and VPS_2
node-exporter for VPS_1 as a Kubernetes DaemonSet
vmalert rules for scrape health, login failures, deployment failures, current deployment errors, offline swarmhosts, and VPS CPU, memory, and disk pressure
Alertmanager for alert state
Grafana at https://metrics.swarmhosts.com, including the Swarm Hosts VPS Host Overview dashboard

Product metrics retention is currently configured as 100y, which is treated as effectively indefinite retention until the service has enough history to choose a real threshold.

VPS host metrics retention is configured separately as 30d.

VPS_2 is outside the k3s cluster, so it must run node-exporter on TCP 9100 and allow that port only from VPS_1 (15.204.10.9) or an equivalent private path. On an Ubuntu/Debian VPS with UFW active, install or repair it with:

scripts/ops/install_vps_node_exporter.sh

Grafana credentials are stored in the Kubernetes secret default/swarmhosts-product-grafana-secret. Retrieve them through Kubernetes access when needed; do not copy them into docs, issues, or chat.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search