Evaluation Name
Optional
Select Models
Choose one or more models to evaluate
Or add a custom model ID:
Select Tracks
Choose which evaluation tracks to run
Episodes per Track
Number of evaluation episodes to run for each track
Recommended: 50+ episodes for statistically significant results
Summary
Models:0 selected
Tracks:A, B, C
Episodes per track:50
Total episodes:0