The problem A problem I discoverd after compiling the cardano-node for the first time is that the node would simply …
I took some days of from my day job to perform maintenance on the stake pool that requires a bit more time than the normal daily operations.
This is the TODO list:
- Look into sending slot info to pooltool.io
- performance tuning
- Add the UPS to monitoring and alerting
- build a riscv64 version of GHC 8.10.7
- Rebuild cardano node 1.30.1 and dependencies using GHC 8.10.7
- Website update
- Upgrade cnitools
Sending slots to pooltool.io turned out to be a bigger challange. The problem is not that it’s hard to configure or setup.
The problem lies with the
cardano-cli which is used to retrieve the
activeStakeMark values the
cncli needs calculate the leader log.
cardano-cli query stake-snapshot ... will put some load on the cardano-node as it will temporary need about 1GB memory and the cardano-cli needs about 5.6GB of memory to run to completion.
When I ran the command it resulted in the cardano-node process being OOM killed by the kernel.
It might be wise to limmit the amout of heap the cardano-cli can allocate by adding the
-M option, 5GB should be enough.
cardano-cli -RTS -M5G +RTS query stake-snapshot ...
You can tell the kernel to not kill the cardano-node and try and kill another process first in case it runs out of memory.
echo -15 > /proc/$(pidof cardano-node)/oom_adj
If the cardano-node is not shutdown cleanly the complete immutable and volatile databases will be validated on the next start of the cardano-node, this can take considerable time.
Tuning GHC garbage collection
In the last couple of weeks I have been adding more metrics to graphana specifically to get a better understanding of garbage collection. The number of minor and major garbage collections can vary greatly depending on the GHC settings, major GC can lead to missed leader slots on the producer node. It turns out that saving ledger snapshots is generating a considerable amount of garbage I might result in multiple major GCs. The number of peers the node is connected to also affects the number of major GCs.
UPS monitoring and alerting
The complete stake pool including the buildserver that doubles as the testnet stake pool is now on the UPS.
The UPS is now also partly observable via prometheus and is available to grafana and alertmanager. Unfortunately the
LOADPCT is not reported by
apcupsd for my UPS so I can’t graph the load over time.
On the display of the UPS it varies between 97W and 100W.
I plan on writing another blog post after the cardano-node is rebuild with the new GHC and tested.