Day-2 operations
Once Care is running on the cluster (see Deploying Care onto the cluster), day-2 operations cover keeping it observable and recoverable: monitoring, log aggregation, and backups for every stateful component.
This deployment is a work in progress. Some capabilities are running today; others are design goals that are partially built or not yet automated. This page is honest about which is which — treat anything marked "evolving" or "planned" as not-yet-tested, and verify against the cluster before relying on it.
Monitoring and logs
Observability is provided by a Prometheus + Grafana + Loki stack, deployed through the monitoring module:
python deployer.py apply-tofu monitoring
The pieces fit together like this:
| Component | Role |
|---|---|
| Prometheus | Scrapes and stores metrics from cluster components. |
| Grafana | Dashboards and visualization over metrics and logs. |
| Loki | Log aggregation and storage, queried from Grafana. |
| Alloy | Collects logs from pods/nodes and ships them into Loki. |
After the stack is stable, finish the setup in Grafana:
- Log in to Grafana using the Kubernetes-autogenerated admin password (read it from the Grafana admin secret in the monitoring namespace — do not hardcode or commit it).
- Add Loki as a data source so logs become queryable alongside metrics.
- Import the dashboards you need.
Initial Grafana setup (data source + dashboards) is a manual step today. The admin password is generated by Kubernetes — retrieve it from the cluster secret rather than setting a known value.
Alloy is known to emit a noisy fsnotifier-related warning in this environment. It is a known
cosmetic issue and does not affect log shipping.
Backups
Each stateful component is backed up on a daily/configurable schedule to S3-compatible external storage. The mechanism differs by component.
PostgreSQL (Care, Odoo, Metabase databases)
PostgreSQL is backed up using the CloudNativePG operator together with the barman-cloud plugin, which performs continuous WAL archiving and base backups to S3-compatible storage. Install the plugin manifest from the CloudNativePG release:
kubectl apply -f \
https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.10.0/manifest.yaml
The plugin needs credentials and a target bucket. Provide an S3 credentials secret and point the archive at your bucket — keep all values in cluster secrets, never in committed YAML:
kubectl create secret generic s3-secret \
--from-literal=access_key=<s3-access-key> \
--from-literal=secret_key=<s3-secret-key>
The same Postgres + barman-cloud backup pattern applies to every Postgres instance in the deployment: Care's database, Odoo's database, and Metabase's database. Configure each with its own bucket/prefix.
Volumes (Longhorn snapshots and backups)
Persistent volumes are managed by Longhorn. Longhorn takes volume
snapshots and ships volume backups to an S3 target. The backup target is configured with an S3
credentials secret in the longhorn-system namespace — generalize the endpoint and keys to your
own provider:
kubectl create secret generic <backup-secret-name> -n longhorn-system \
--from-literal=AWS_ACCESS_KEY_ID=<s3-access-key> \
--from-literal=AWS_SECRET_ACCESS_KEY=<s3-secret-key> \
--from-literal=AWS_ENDPOINTS=<s3-endpoint> \
--from-literal=AWS_REGION=<s3-region>
What gets backed up
Daily/configurable backups are intended for each stateful component:
| Component | What is protected | Mechanism |
|---|---|---|
| PostgreSQL (Care) | Application database | CloudNativePG + barman-cloud → S3 |
| PostgreSQL (Odoo) | Odoo database | CloudNativePG + barman-cloud → S3 |
| PostgreSQL (Metabase) | Metabase database | CloudNativePG + barman-cloud → S3 |
| OpenSearch / Elasticsearch | Search index data | Volume backup to external media |
| Object store (RustFS) | Uploads, audit logs, public data | Incremental sync to external store |
| Persistent volumes (Longhorn) | All attached volumes | Longhorn snapshots/backups → S3 |
Redis/Valkey runs in non-persistent mode by design — it holds no durable state and is not backed up.
Restore
A core design goal of this deployment is that a fresh cluster can be restored from backups — rebuild the Kubernetes cluster, then replay Postgres WAL archives, Longhorn volume backups, and the object-store sync to bring Care back up.
Restore is a design goal, not a finished, tested procedure. End-to-end restore is not yet fully automated or documented in this deployment, and several supporting pieces (such as extending and repopulating Longhorn volumes) are still being worked out. Do not treat restore as production-proven until it has been validated against your own backups. Practice and verify a full restore on a throwaway cluster before depending on it.
Maturity: available today vs evolving
Use this split to set expectations. The "evolving / planned" column is drawn from the deployment's roadmap.
| Available today | Evolving / planned |
|---|---|
| Prometheus + Grafana + Loki monitoring | Move Postgres to a cluster operator–managed Helm chart |
| Alloy log shipping into Loki | Generate secrets/passwords via Terraform and auto-configure |
| Postgres backups via CloudNativePG + barman-cloud → S3 | Version-lock all Helm charts |
| Longhorn volume snapshots/backups → S3 | TLS for database and other inter-pod communication |
| Per-component daily/configurable backups | Documented, automated restore-from-backup procedure |
| Caching layer for the nginx-based frontend |
Manual Grafana setup, manual secret retrieval, and unautomated restore are the rough edges to plan around. As the roadmap items land (operator-managed Postgres, Terraform-generated secrets, version-locked charts), these workflows will tighten up.