Predicting phenology is essential for adapting varieties to different environmental conditions and for crop management. Therefore, it is important to evaluate how well different crop modeling groups can predict phenology. Multiple evaluation studies have been previously published, but it is still difficult to generalize the findings from such studies since they often test some specific aspect of extrapolation to new conditions, or do not test on data that is truly independent of the data used for calibration. In this study, we analyzed the prediction of wheat phenology in Northern France under observed weather and current management, which is a problem of practical importance for wheat management. The results of 27 modeling groups are evaluated, where modeling group encompasses model structure, i.e. the model equations, the calibration method and the values of those parameters not affected by calibration. The data for calibration and evaluation are sampled from the same target population, thus extrapolation is limited. The calibration and evaluation data have neither year nor site in common, to guarantee rigorous evaluation of prediction for new weather and sites. The best modeling groups, and also the mean and median of the simulations, have a mean absolute error (MAE) of about 3 days, which is comparable to the measurement error. Almost all models do better than using average number of days or average sum of degree days to predict phenology. On the other hand, there are important differences between modeling groups, due to model structural differences and to differences between groups using the same model structure, which emphasizes that model structure alone does not completely determine prediction accuracy. In addition to providing information for our specific environments and varieties, these results are a useful contribution to a knowledge base of how well modeling groups can predict phenology, when provided with calibration data from the target population.