krawfy's comments

krawfy · on Feb 7, 2024

How is this different from other solutions like Open Interpreter?

krawfy · on Aug 1, 2023

Good catch! We're looking to add function calling support very soon, and have an open issue for it on our GitHub. If you want to raise a PR and add it, we'll help you land it and get it merged

krawfy · on Aug 1, 2023

Thanks Neel! We totally agree that automated evals will become an essential part of production LLM systems.

krawfy · on Aug 1, 2023

Awesome! Let us know if there's anything from that tool that you think we should add to PromptTools

krawfy · on Aug 1, 2023

This is really cool! When we were trying to launch the GSPMD feature for PyTorch/XLA at Google, one of our biggest bottlenecks was network overhead, but we didn't really have any robust tools to dig into it and perform root cause analysis. I'm loving the tools I see come out of Trainy.

roanakb · on Aug 1, 2023

Thanks! Let me know if there are any features you'd like to see added.

krawfy · on Aug 1, 2023

We've actually been in contact with the qdrant team about adding it to our roadmap! Andre (CEO) was asking for an integration. If you want to work on the PR, we'd be happy to work with you and get that merged in

kacperlukawski · on Aug 2, 2023

Qdrant here! We're already working on that :D

krawfy · on Aug 1, 2023

Great question, chainforge looks interesting!

We offer auto-evals as one tool in the toolbox. We also consider structured output validations, semantic similarity to an expected result, and manual feedback gathering. If anything, I've seen that people are more skeptical of LLM auto-eval because of the inherent circularity, rather than over-trusting it.

Do you have any suggestions for other evaluation methods we should add? We just got started in July and we're eager to incorporate feedback and keep building.

fatso784 · on Aug 1, 2023

Thanks for the clarification! Yes, I see now that auto-evals here is more AI agent-ish, than a one-shot approach. Still has the trust issue.

For suggestions, one thing I'm curious about is how we can have out-of-the-box benchmark datasets and do this responsibly. ChainForge supports most OpenAI evals, but from adding this we realized the quality of OpenAI Evals is really _sketchy_... duplicate data, questionable metrics, etc. OpenAI has shown that trusting the community to make benchmarks is perhaps not a good idea; we should instead make it easier for scientists/engineers to upload their benchmarks and make it easier for others to run them. That's one thought, anyway.

krawfy · on Aug 1, 2023

Glad you think so, we agree! If you end up trying it out, we'd love to hear what you think, and what other features you'd like to see.

krawfy · on Aug 1, 2023

For now, we just aggregate those across the models / prompts / templates you're evaluating so that you can get an aggregate score. You can export to CSV, JSON, MongoDB, or Markdown files, and we're working on more persistence features so that you can get a history of which models / prompts / templates you gave the best scores to, and keep track of your manual evaluations over time.