The final stage in the IBM DeepQA pipeline involves ranking all candidate answers according to their evidence scores and judging the likelihood that each candidate answer is correct. In DeepQA, this is done using a machine learning framework that is phase-based, providing capabilities for manipulating the data and applying machine learning in successive applications. We show how this design can be used to implement solutions to particular challenges that arise in applying machine learning for evidence-based hypothesis evaluation. Our approach facilitates an agile development environment for DeepQA; evidence scoring strategies can be easily introduced, revised, and reconfigured without the need for error-prone manual effort to determine how to combine the various evidence scores. We describe the framework, explain the challenges, and evaluate the gain over a baseline machine learning approach.