Abstract:
Accurate crop yield prediction is essential to ensure global food security under changing climatic, environmental and agro-industrial conditions. This study presents a comprehensive comparative analysis of machine learning algorithms for large-scale yield prediction using agroecological parameters and pesticide usage as key explanatory variables. Crop yield prediction dataset, downloaded from Kaggle, comprising 28 242 records across multiple countries and crops was preprocessed and modelled with 17 algorithms, including tree-based, regression-based, support vector, neural network, boosting, and ensemble approaches. Model performance was assessed using root mean square error (RMSE) and coefficient of determination (R2). Modelling and visualization were performed in Python 3 using corresponding external modules. Among global models, Extra Trees achieved the highest accuracy (R2 = 0.991$, RMSE = 8282.92 hg/ha), outperforming both gradient boosting and neural network approaches. Ensemble techniques, particularly stacking ensembles, provided comparable accuracy R2 = 0.990), confirming the robustness of tree-based methods. Feature importance analysis highlighted crop type (0.609) as the dominant predictor, while pesticides (0.110), average temperature (0.108), and rainfall (0.087) emerged as the most influential agro-environmental factors. Country-specific models achieved near-perfect predictive power R2 ≈ 1.0) for India, Brazil, Pakistan, Mexico, and Turkey, while Ukraine's best-performing model (XGBoost, R2 = 0.980) revealed yields averaging 43.4% below the global mean. Crop-level analysis identified potatoes, cassava, and sweet potatoes as the highest-yielding crops globally. These results demonstrate the superiority of tree-based and ensemble models for yield forecasting and emphasize the value of localized modelling strategies. Findings provide actionable insights for optimizing agricultural practices and guiding sustainable intensification policies.