i finally got annoyed enough to build my own llm gateway

bogdan » i finally got annoyed enough to build my own llm gateway

10:30 am on Dec 1, 2025 | read the article | tags: buggy

recently, i realized i have a special talent: whenever i rely on someone else’s something, the universe conspires to remind me why i usually build* things myself. so, yes, i’ve started writing my own LLM Gateway.

*here build = start and never finish (mean people say)

why? because i wanted to work on a personal project: an AI companion powered mostly by gemini nano banana (still the cutest model name ever), while also playing with some image-to-video stuff to generate animations between keyframes. nothing complicated, just the usual «relaxing weekend» kind of project that ends up consuming two months and part of your soul.

how it started

somewhere around february this year i added a tiny PoC gateway in one of our kubernetes clusters at work. just to see what’s possible, what breaks, what costs look like. i picked berryai’s litellm because:

it had most of the features i needed straight out of the box,
it was easy enough to deploy
and, crucially, it was Python… meaning: «perfect, i can hack whatever i need»

or so i thought…

the PoC got traction fast, people started using it, and now i’m actually running two production LiteLLM instances. so this wasn’t just a toy experiment. it grew into a fairly important internal service.

and then the problems started.

the «incident»

prisma’s python client (yes, the Python one) thought it was a brilliant idea to install the latest stable Node.js at runtime.

i was happily watching anime on my flight to Tallinn, for one of our team’s meetings when node 25 dropped. karpenter shuffled some pods. prisma wasn’t ready. our deployment exploded in the most beautiful, kubernetes-log-filling way sending chills on my colleagues’ spines. sure, they patched it quickly and yes, i found eventually a more permanent solution.

but while digging around, i realized the prisma python client (used under the hood by litellm) isn’t exactly actively maintained anymore making my personal «production red flag detector» to start screaming. LiteLLM’s creators ignoring the issue definitely didn’t help.

latency, my beloved

red flag number two: overhead. we’re running LiteLLM on k8s with hpa, rds postgres, valkey, replication, HA. the whole cloud-enterprise-lego-set. and despite all that, the gateway added seconds of latency on top of upstream calls. with p95 occasionally touching 20 seconds.

i tweaked malloc. i tweaked omp. i tweaked environment variables i’m pretty sure i shouldn’t have touched without adult supervision. nothing changed.

cost tracking? it’s… there. existing in a philosophical sense. about as reliable as calorie counts on protein bars.

i tried maximhq’s bifrost. only proxies requests in its open-source version. same for traceloop’s hub. so nothing that ticked all the boxes.

and, as usual, the moment annoyance crosses a certain threshold (involving generating anime waifu), i start hacking.

the bigger picture: ThinkPixel

for about a year, i’ve been trying to ship ThinkPixel: a semantic search engine you can embed seamlessly into WooCommerce shops. it uses custom embedding models, qdrant as the vector store and BM42 hybrid search. and a good dose of stubbornness on my part.

it works, but not «public release» level yet. i’ll get there eventually.

in my mind, ThinkPixel is the larger project: search, retrieval, intelligence that plugs into boring real-world small business ecommerce setups. for that, somewhere in the future i’ll need a reliable LLM layer. so ThinkPixelLLMGW naturally became a core component of that future. (until then, i just need it to animate anime elfs, but that’s the side-story)

so:

introducing: ThinkPixelLLMGW

https://github.com/bdobrica/ThinkPixelLLMGW (a piece of the bigger ThinkPixel puzzle)

what i wanted here was something:

fully open-source
lightweight enough to run on a raspberry pi at home
reliable enough to convince people to run in kubernetes at work
with enterprise features, minus the enterprise pain

so i wrote it in Go (not a rust hater, just allergic to hype), backed it with postgres + redis/valkey, and started adding the features i actually need:

virtual keys + model aliases: different teams, different projects, different costs. i don’t want people creating their own access keys (we’re not doing democracy here), but i do want tagging, cost grouping, and the ability to map «anime-assistant-gpt5» and «llmgw-issue-tracker-gpt5» to different underlying provider keys so i get clean cost splits. something litellm was able to do and i was happy with.
fast: i’m aiming for a gateway that adds milliseconds, not seconds.
k8s friendly: stateless where possible, redis-backed where needed.
prometheus metrics: yes, i want dashboards.
s3-like logging: but configurable in chunks, not millions of per-request json objects stored each day. (looking at you, litellm.)
management API: jwt-based, because i refuse to store session state in k8s. with service account tokens for okta workflows integrations.
optional UI: just because not everyone likes curl.

current status

the project is actually in a pretty good place. according to myself MVP is complete: admin features are implemented, openai provider works with streaming, async billing and usage queue system is done, and the whole thing is surprisingly solid. i even wrote tests. dozens of them. i know, i’m shocked too (kudos to copilot for help).

the full TODO / progress list is here. kept updated with AI. so bare with me. it’s long. like, romanian-bureaucracy long.

why am i posting this?

because i enjoy building things that solve my own frustrations. because gateways are boring… until they break. because vendor-neutral LLM infrastructure will matter more and more, especially with pricing randomness, model churn, and the growing zoo of providers.

and because maybe someone else has been annoyed by the same problems and wants something open-source, fast, predictable, and designed by someone who doesn’t think «production-ready» means «works in docker, on my mac».

ThinkPixelLLMGW is just one component in a larger thing i’ve been slowly carving out. if/when the original ThinkPixel semantic search finally ships, this gateway will already be there, quietly doing the unglamorous work of routing, tracking and keeping costs under control.

until then, i’ll keep adding features, and i’ll keep the repo public. feel free to star it, fork it, bash it, open issues, or just lurk.

sometimes the best things you build are the ones you started out of mild irritation.

disclaimer

as with all open-source projects, it works flawlessly on my cluster. your machine, cloud, cluster, or philosophical worldview may vary.

Anime Elf Waifu AI Assistant

ublo
bogdan's (micro)blog

bogdan » i finally got annoyed enough to build my own llm gateway

find me:

in my mind:

search

ublobogdan's (micro)blog

bogdan » i finally got annoyed enough to build my own llm gateway

find me:

in my mind:

search

ublo
bogdan's (micro)blog