Monitoring Operations¶
Use the monitoring pages to watch host capacity, deployment resource use, and game health while servers are running.
Swarm Host Actions¶
Swarm host owners and admins can manage monitoring from the swarm host detail page:
- enable the monitoring stack
- disable the monitoring stack
- restart the monitoring stack
- repair the monitoring stack
- run monitoring diagnostics
- change the local metric retention window
These actions are queued for the swarm host agent. If monitoring is unhealthy, the agent reports that state, but game deployment actions continue to be managed independently.
Retention And Storage¶
The default retention window is 30d. Source installs store monitoring data
under /var/lib/swarmhost-monitor unless the agent environment overrides it.
Shorter retention reduces disk use. Longer retention gives more history but needs more local storage on the swarm host. Retention changes affect local metric history only; they do not change deployment backups.
Diagnostics¶
Use Run monitoring diagnostics from the swarm host page when charts stop updating or the stack reports unhealthy collectors.
For source-checkout installs, the same diagnostic can be run locally:
.venv/bin/python agents/swarmhost-agent/agent/agent.py diagnostics monitoring
Diagnostics include collector health, configured retention, data directory size, target count, disk guard state, and the last known successful scrape time when the local metrics store can answer that query.
Game Health Metrics¶
Game health charts are shown only when a deployment has an explicit game metrics target, port, or sidecar configuration. If a game does not expose game metrics, the deployment monitor still shows container resource charts when host-local monitoring is available.
Troubleshooting¶
If the monitoring page has no chart data:
- confirm the swarm host is online
- confirm monitoring is enabled on that swarm host
- run monitoring diagnostics
- check that the browser can establish a monitor session
- wait for the first scrape interval after enabling or repairing the stack
If only game health charts are missing, check whether that game or deployment actually provides a game metrics target.
Product Metrics Stack¶
The central metrics stack runs on VPS_1 and scrapes the control-plane /metrics
endpoint. It is separate from swarmhost-local monitoring.
Deploy or repair it with:
scripts/release/deploy_product_metrics.sh
The stack includes:
- VictoriaMetrics for product time-series storage
- vmagent scraping
https://swarmhosts.com/metricswithPRODUCT_METRICS_TOKENfromdefault/swarmhosts-web-secret - a separate 30-day VictoriaMetrics store for VPS host metrics
- a VPS vmagent scraping node-exporter for VPS_1 and VPS_2
- node-exporter for VPS_1 as a Kubernetes DaemonSet
- vmalert rules for scrape health, login failures, deployment failures, current deployment errors, offline swarmhosts, and VPS CPU, memory, and disk pressure
- Alertmanager for alert state
- Grafana at
https://metrics.swarmhosts.com, including theSwarm Hosts VPS Host Overviewdashboard
Product metrics retention is currently configured as 100y, which is treated as
effectively indefinite retention until the service has enough history to choose a
real threshold.
VPS host metrics retention is configured separately as 30d.
VPS_2 is outside the k3s cluster, so it must run node-exporter on TCP 9100 and
allow that port only from VPS_1 (15.204.10.9) or an equivalent private path.
On an Ubuntu/Debian VPS with UFW active, install or repair it with:
scripts/ops/install_vps_node_exporter.sh
Grafana credentials are stored in the Kubernetes secret
default/swarmhosts-product-grafana-secret. Retrieve them through Kubernetes
access when needed; do not copy them into docs, issues, or chat.