Just got home from KubeCon.
One of my big goals for the trip was to make some progress in a few areas of postgres and kubernetes – primarily around allowing more flexible use of the linux page cache and avoiding OOM kills with less hardware overprovisioning. When I look at Postgres on Kubernetes, I think there are idle resources (both memory and CPU) on the table with the current Postgres deployment models that generally use guaranteed QoS.
Ultimately this is about cost savings. I think we can still run more databases on less hardware without compromising the availability and reliability of our database services.
The trip was a success, because I came home with lots of reading material and homework!
Putting a few bookmarks here, mostly for myself to come back to later:
- key place for discussion is sig-node
- documentation on node-pressure eviction https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
- eviction signal thresholds can be customized
- it looks like priority classes give a lot of control over the order in which pods are evicted
- documentation on priority classes https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
- cgroups v2 memory controller documentation https://docs.kernel.org/admin-guide/cgroup-v2.html#memory
- long running github issue about pod evictions due to kubernetes (incorrectly?) interpreting active page cache as working memory that won’t be reclaimed https://github.com/kubernetes/kubernetes/issues/43916
- new feature MemoryQOS – still alpha (feature gate off-by-default)
- KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/
- currently stalled – related message from Linux Kernel Mailing Lists https://lkml.org/lkml/2023/6/1/1300
- “Future: memory.high can be used to implement kill policies in for userspace OOMs, together with Pressure Stall Information (PSI). When the workloads are in stuck after their memory usage levels reach memory.high, high PSI can be used by userspace OOM policy to kill such workload(s).”
- Nov 2021 blog https://kubernetes.io/blog/2021/11/26/qos-memory-resources/
- May 2023 blog https://kubernetes.io/blog/2023/05/05/qos-memory-resources/
- Brief mention in docs https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#memory-qos-with-cgroup-v2
- KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/
- metrics added to CAdvisor for both active and inactive page cache https://github.com/google/cadvisor/pull/3445
- homework – taking a closer look at anonymous memory and page cache metrics (both active and inactive) for real postgres databases on kubernetes
- homework – set up tests that emulate the diagram below and demonstrate the eviction behavior that i think will happen
I still have a lot of catching up to do. I sketched out the diagram below, but please take this with a large grain of salt – this aspect of kubernetes is complex and linux memory management is complex:

I tried to summarize some thoughts in a comment on the long-running github issue, but this might be wrong – it’s just what I’ve managed to piece together so far.
.
My “user story” is that (1) I’d like higher limit and more memory over-commit for page cache specifically – letting linux use available/unused memory as needed for page cache and (2) I’d like lower request to get scheduling closer to actual anonymous memory needs. I’m running Postgres. In the current state, I have to simultaneously set an artificially low limit on per-pod page cache (to avoid eviction) and artificially high request on per-pod anonymous memory (to avoid OOM by getting oom_score_adj). I’d like individual pods able to burst anonymous memory usage (eg. an unexpected SQL query that hogs memory), if we can steal from page cache of other pods beyond their request – avoiding OOM. The linux kernel can do this; I think it should be possible with the right cgroup settings?
It seems like the new Memory QOS feature might be assigning a static calculated value to memory.high – but for page cache usage, I wonder if we actually want kubernetes to dynamically adjust memory.high eventually as low as request in an attempt to reclaim node-level resources – before evicting end-user pods – when the memory.available eviction signal has exceeded the threshold?
Anyway it’s also worth pointing out that the postgres problems are likely accentuated by higher concentrations of postgres on nodes; if databases are spread across large multi-tenant clusters that likely mitigates things a bit.