Question about LightEval ๐ค:
I've been searching for an LLM evaluation suite that can, out-of-the-box, compare the outputs of a model(s) without any enhancements vs. the same model with better prompt engineering, vs. the same model with RAG vs. the same model with fine-tuning.
I unfortunately have not found a tool that fits my exact description, but of course I ran into LightEval.
A huge pain-point of building large-scale projects that use LLMs is that prior to building an MVP, it is difficult to evaluate whether better prompt engineering, or RAG, or fine-tuning, or some combination of all is needed for satisfactory LLM output in terms of the project's given use case.
Time and resources is then wasted R&D'ing exactly what LLM enhancements are needed.
I believe an out-of-the-box solution to compare models w/ or w/out the aforementioned LLM enhancements could help teams of any size better decide what LLM enhancements are needed prior to building.
I wanted to know if the LightEval team or Hugging Face in general is thinking about such a tool.
I've been searching for an LLM evaluation suite that can, out-of-the-box, compare the outputs of a model(s) without any enhancements vs. the same model with better prompt engineering, vs. the same model with RAG vs. the same model with fine-tuning.
I unfortunately have not found a tool that fits my exact description, but of course I ran into LightEval.
A huge pain-point of building large-scale projects that use LLMs is that prior to building an MVP, it is difficult to evaluate whether better prompt engineering, or RAG, or fine-tuning, or some combination of all is needed for satisfactory LLM output in terms of the project's given use case.
Time and resources is then wasted R&D'ing exactly what LLM enhancements are needed.
I believe an out-of-the-box solution to compare models w/ or w/out the aforementioned LLM enhancements could help teams of any size better decide what LLM enhancements are needed prior to building.
I wanted to know if the LightEval team or Hugging Face in general is thinking about such a tool.