Evaluation dashboard
Every answer is provably evaluated. The golden set runs through grounded QA and is scored on citation coverage, faithfulness, and keyword recall — including a negative case that must refuse with NOT_FOUND.
Every answer is provably evaluated. The golden set runs through grounded QA and is scored on citation coverage, faithfulness, and keyword recall — including a negative case that must refuse with NOT_FOUND.