Do you do anything to deal with the model performance degradation caused by having too many MCP tools? Would be cool to see a smart MCP router added here
We did! Basically we have a meta layer between the Agent and the MCP servers you give to the clients.
We call this server manager, basically instead of exposing all the tools from all the servers at once to the agent we only expose 4 meta tools:
- list servers()
- connect to server(server_name)
- search_tool(query)
- disconnect from server(server_name)
So that the agent can dynamically connect to specific servers without flooding its context with all the tools.
The search tool basically performs semantic search over all the tools from all the servers returning the top N results (tools) and the server they belong to so that the agent can connect to the right server.
We use AI code review and it is genuinely helpful, but I agree it's mostly just making my life easier to review my own PR by pointing out salient points I would otherwise not really think about.
This obviously is not a replacement for another human looking at your code, and I would not do it in safety critical environments, but it really helps especially in small teams where time is precious and you ship fast.
My only issue is that I would love a dedicated UI where I get this review BEFORE another human looks at the code, so their feedback is not drowned by the AI noise
I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.
This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.
I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.
The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block