Jeremy Schneider: KubeCon 2025: Bookmarks on Memory and Postgres

Jeremy Schneider: KubeCon 2025: Bookmarks on Memory and Postgres

Just got home from KubeCon.

One of my big goals for the trip was to make some progress in a few areas of postgres and kubernetes – primarily around allowing more flexible use of the linux page cache and avoiding OOM kills with less hardware overprovisioning. When I look at Postgres on Kubernetes, I think there are idle resources (both memory and CPU) on the table with the current Postgres deployment models that generally use guaranteed QoS.

Ultimately this is about cost savings. I think we can still run more databases on less hardware without compromising the availability and reliability of our database services.

The trip was a success, because I came home with lots of reading material and homework!

Putting a few bookmarks here, mostly for myself to come back to later:

I still have a lot of catching up to do. I sketched out the diagram below, but please take this with a large grain of salt – this aspect of kubernetes is complex and linux memory management is complex:

I tried to summarize some thoughts in a comment on the long-running github issue, but this might be wrong – it’s just what I’ve managed to piece together so far.

.

My “user story” is that (1) I’d like higher limit and more memory over-commit for page cache specifically – letting linux use available/unused memory as needed for page cache and (2) I’d like lower request to get scheduling closer to actual anonymous memory needs. I’m running Postgres. In the current state, I have to simultaneously set an artificially low limit on per-pod page cache (to avoid eviction) and artificially high request on per-pod anonymous memory (to avoid OOM by getting oom_score_adj). I’d like individual pods able to burst anonymous memory usage (eg. an unexpected SQL query that hogs memory), if we can steal from page cache of other pods beyond their request – avoiding OOM. The linux kernel can do this; I think it should be possible with the right cgroup settings?

It seems like the new Memory QOS feature might be assigning a static calculated value to memory.high – but for page cache usage, I wonder if we actually want kubernetes to dynamically adjust memory.high eventually as low as request in an attempt to reclaim node-level resources – before evicting end-user pods – when the memory.available eviction signal has exceeded the threshold?

Anyway it’s also worth pointing out that the postgres problems are likely accentuated by higher concentrations of postgres on nodes; if databases are spread across large multi-tenant clusters that likely mitigates things a bit.

Unknown's avatar

Stay Informed

Get the best articles every day for FREE. Cancel anytime.