An optimized machine learning model for identifying socio-economic, demographic and health-related variables associated with low vaccination levels that vary across ZIP codes in California

Document Type


Publication Date



There is an urgent need for an in-depth and systematic assessment of a wide range of predictive factors related to populations most at risk for delaying and refusing COVID-19 vaccination as cases of the disease surge across the United States. Many studies have assessed a limited number of general sociodemographic and health-related factors related to low vaccination rates. Machine learning methods were used to assess the association of 151 social and health-related risk factors derived from the American Community Survey 2019 and the Centers for Disease Control and Prevention (CDC) BRFSS with the response variables of vaccination rates and unvaccinated counts in 1,555 ZIP Codes in California. The performance of various analytical models was evaluated according to their ability to regress between predictive variables and vaccination levels. Machine learning modeling identified the Gradient Boosting Regressor (GBR) as the predictive model with a higher percentage of the explained variance than the variance identified through linear and generalized regression models. A set of 20 variables explained 72.90% of the variability of unvaccinated counts among ZIP Codes in California. ZIP Codes were shown to be a more meaningful geo-local unit of analysis than county-level assessments. Modeling vaccination rates was not as effective as modeling unvaccinated counts. The public health utility of this model provides for the analysis of state and local conditions related to COVID-19 vaccination use and future public health problems and pandemics.

Publication Title

Preventive Medicine Reports



This document is currently not available here.