Perhaps :
1. Your end goal is, the candidate will deliver consistently "over benchmark".
2. The benchmark is relatively concrete, and repetitive, so instead of testing against a "qualitative proxy for the actual benchmark" you should test against the "actual benchmark".
3. The only way to get data on the error "against the actual benchmark" is to spot check, or continuously check. If you are not on-site, then you have to spot check. This is called "sampling", in statistics. The "sample of opportunities for failure" is a proxy for the 'set of all possible opportunities for failure".
4. Next you must consider the division of labour. Is the benchmark defined qualitatively or quantitatively?
(For hospitality there are hundreds of points to quantify per service per customer. When you have a corporate level quality assurance team, this is the way to go. Now maybe with fewer staff, you have to define success qualitatively, "by eyeballing/tasting". If this is the case, then testing can only be executed by the person who can do a qualitative assessment. Whose body is doing the testing?)
5. If you require a leader to be multi-faceted, it becomes harder to identify leaders. If you design an organisation where leadership revolves around cooperation between smaller, focused roles, then it becomes easier to identify leaders. Testing is just the cost of identification. In the latter sort of design, the only prerequisite for leadership is readiness, ability, and willingness to co-operate.