Much of the joy of this hobby, at least for me, resides in finding a new service I would like to try out and deploying it to my homelab.
I would even go as far as to say setting up things is most of the fun, and then one might not even use the software that much later on.
However, it's easy to forget that once you have started using a service, you rely on it to some extent. Let's say for instance, you're self-hosting your own calendar service. I do self-host my calendar and I use it daily as a to-do list of sorts. If I were to lose my calendar data, I might miss a future appointment to my doctor, or another similarly important event.
The stakes can get high pretty fast if you switch from cloud providers of critical services to your own self-hosted instances.
But issues can also arise from faulty hardware, or an expansion of your setup when adding new machines. Would you remember the list of commands you had to run to set up a specific service if you would need to do it again today on a brand new machine?
Those reasons alone are enough to prepare a contingency plan for when unexpected events might occur. And they will eventually occur. In order to sleep well at night, the following topics must be addressed: Backups, documentation, monitoring and alerting.
There are several ways of tackling those topics but I am not going to cover all the possible ways in which one could. Instead, I will explain how I solved those in my case, and hopefully give you a sense of why I chose those approaches and why they could work for your setup.
I have to admit that for a long time, I used to access each of my services once a month and manually export data from each of them to an external hard drive. And that was okay enough for me at the time. I wasn't running many services and I felt the monthly cadence was sufficient.
Eventually, I lost consistency and the frequency of backing up everything got to a less than ideal state. I only had one node in my Proxmox cluster up until that point. If my only server would have died then, a lot would have been lost.
Since I am using Proxmox as my Hypervisor, Proxmox Backup Server was the ideal addition to address backups.
I decided then to add another smaller node to my cluster, a mini PC with an N100 SoC, to improve resilience. That node would run a VM with PBS and run a backup every evening at 21:00 of all my VMs and containers to an external SSD.
Setting up PBS is pretty straight forward and the full instalation guide can be found on their official website. If you prefer a more visual guide, I can again recommend Learn Linux TV's Proxmox course, and in particular, their video about setting up Proxmox Backup Server for your cluster.
If one of my containers or virtual machines gets corrupted now, I can safely restore one of my backups from storage in a matter of seconds and get it back on its feet in no time!
Over time, the amount of things I run in my homelab has grown to a point where I can't remember everything from the top of my mind. Which URL and port map to that one service I installed a year ago and rarely use? Remembering what software runs where, or even the whole list of things I am running at this point is not easy.
Here is where I chose to, yet again, self-host a service to solve this very problem. To address the documentation of my infrastructure, I am currently using Bookstack.
The organization of content in Bookstack resembles a library. You have shelves, books, and within books you can have chapters and pages. Within those constraints, you can do as you wish and organize your content however suits you best.
I chose to create a table with the overview of my hardware and another one with the overview of the software I am running in that hardware.
Each of the rows then links to a specific page for that piece of software, where I specify the IP of the machine or container where it runs, the URL and port of the service it is running and other useful information like the setup instructions I used, or links to guides.
Another important thing I document, is the "ID" of the entry in my vault where I store the credentials for either the machine and/or the service accounts. Like this, I delegate the enhanced security requirements to the Vault, and I can focus on just documenting the setup on Bookstack without having to worry too much about its content being highly sensitive.
The software I use for my vault is Bitwarden.
Now it would seem like we are well set. We have our backups in place in case something were to go wrong, and we have resources to migrate services to new machines, or have an overview of the software we run.
Unfortunately, that is not the case yet. I am not using all of my services every single day. What this means is, some of them could crash and it would take me potentially days to realise. What's worse, the moment I would really need to use it, the service would be down and it would take me time to get back on track.
We want to be proactive, rather than reactive in one of those scenarios.
Uptime Kuma is a service we can self-host that allows us to monitor different services and machines.
It has a very simple UI but is very effective in what it does. I created a container on my cluster and run Uptime Kuma dockerized inside, and I am not lying if I say that I set up the whole thing in less than 10 minutes altogether. You can look at the docs, but even the documentation points at Techno Tim's guide to set it up with docker-compose.
Ultimately, once you have run the docker image, setting up the monitoring entries can be done with no guides just following the simple UI.
It provides different kinds of monitoring, but I personally only use HTTP(s), HTTP(s) with keyword, Ping and TCP Port so far. DNS monitoring is the next type of monitoring I will add next.
In essence:
So now comes the final piece in the puzzle, alerting. Having monitoring is neat, but a bit pointless if it is not connected to some alerts getting triggered. After all, you are not going to have your monitoring web UI open at all times to check if everything is running smoothly.
Since I am using Uptime Kuma, I had access to a huge list of possible notification channels. I went for simplicity once again, and as I have Discord installed on my phone already, I chose that as mine.
Discord allows you to create a private server for free, and inside that server, I created a dedicated alerting channel that I would use for my homelab alerts. Once you have the private server and channel, all that is left is to create a webhook that Uptime Kuma will use to report alerts.
The setup for the Webhook was a matter of few clicks, and can be seen as well in Techno Tim's Uptime Kuma guide's alerting section or the official Discord docs about Webhooks.
Once I had it set up, I tried briefly changing the monitoring configurations to verify that the webhook alerting wiring was working. As soon as I saw everything working as expected, I reverted those changes and I was done!
Now I know that if anything goes wrong, I will know fast, and be able to recover quickly from the incident. Considering I only keep adding services to the homelab, this safety net will only become more and more valuable as time goes on.
I hope this gives you a starting point to plan your own solution for a resilient homelab!