A model of disparities: risk factors associated with COVID-19 infection

Background By mid-May 2020, there were over 1.5 million cases of (SARS-CoV-2) or COVID-19 across the U.S. with new confirmed cases continuing to rise following the re-opening of most states. Prior studies have focused mainly on clinical risk factors associated with serious illness and mortality of COVID-19. Less analysis has been conducted on the clinical, sociodemographic, and environmental variables associated with initial infection of COVID-19. Methods A multivariable statistical model was used to characterize risk factors in 34,503cases of laboratory-confirmed positive or negative COVID-19 infection in the Providence Health System (U.S.) between February 28 and April 27, 2020. Publicly available data were utilized as approximations for social determinants of health, and patient-level clinical and sociodemographic factors were extracted from the electronic medical record. Results Higher risk of COVID-19 infection was associated with older age (OR 1.69; 95% CI 1.41–2.02, p < 0.0001), male gender (OR 1.32; 95% CI 1.21–1.44, p < 0.0001), Asian race (OR 1.43; 95% CI 1.18–1.72, p = 0.0002), Black/African American race (OR 1.51; 95% CI 1.25–1.83, p < 0.0001), Latino ethnicity (OR 2.07; 95% CI 1.77–2.41, p < 0.0001), non-English language (OR 2.09; 95% CI 1.7–2.57, p < 0.0001), residing in a neighborhood with financial insecurity (OR 1.10; 95% CI 1.01–1.25, p = 0.04), low air quality (OR 1.01; 95% CI 1.0–1.04, p = 0.05), housing insecurity (OR 1.32; 95% CI 1.16–1.5, p < 0.0001) or transportation insecurity (OR 1.11; 95% CI 1.02–1.23, p = 0.03), and living in senior living communities (OR 1.69; 95% CI 1.23–2.32, p = 0.001). Conclusion sisk of COVID-19 infection is higher among groups already affected by health disparities across age, race, ethnicity, language, income, and living conditions. Health promotion and disease prevention strategies should prioritize groups most vulnerable to infection and address structural inequities that contribute to risk through social and economic policy.

severe illness, such as older adults living in long term care facilities, those with a BMI of forty or higher, and immunosuppressed individuals, including people withHIV/AIDS [8]. However, most risk models have not incorporated clinical, sociodemographic, and environmental variables, which may be predictive of community spread within the U.S.
As with other infectious diseases, predictors of COVID-19 infection may include employment status, education level, income, and housing conditions [9], which could influence the ability to seek care, adhere to treatment, and practice physical distancing measures. Thus, effective strategies for predicting risk factors for community transmission should include both clinical and social factors [10]. The latter factors in particular remain understudied, especially among communities of lower socioeconomic status [10].
Emerging data already show that communities of color and/or low socioeconomic status are experiencing disproportionate rates of serious illness if infected, due to preexisting economic and health inequities [11,12].
By performing large scale analyses, healthcare systems can play a role in investigating patient and population differences in disease susceptibility, distinct from mortality risk. The purpose of this study was to use collated data from an entire health system to identify the apparent sociodemographic and environmental, as well as clinical predictors of the risk of COVID-19 infection and their relevance to persistent health disparities across race, ethnicity, socioeconomic status, language, and age [13].

Study design and setting
This study was conducted at Providence Health System, the third largest not-for-profit health system in the U.S., servicing more than five million people across seven states located in the Western and Southwestern portion of the U.S.

Data source
Data were collected from the Providence enterprise data warehouse. The data elements that were collected were informed by a comprehensive review of prior scientific studies that documented mortality risk factors and the CDC list of groups at higher risk for severe illness [8]. Variables included patient demographic, social, and behavioral history information; chronic conditions documented in clinical history; current conditions; prescribed medications; laboratory testing results; and acute and ambulatory healthcare utilization.
To study sociodemographic and environmental variables, electronic medical record (EMR) data was utilized to link patients' locations to the U.S. Census Bureau's 2018 American Community Survey and the CDC air quality data. To join these datasets to EMR data, patient addresses were geocoded, and matched at the census block group or tract level.
Glottolog, a repository for the world's languages, was used to assign language groups. Geographic regions and clinical symptoms were also included as variables. Census data on educational attainment and financial insecurity were used to assess socioeconomic status.

Participants and procedures
Patients residing in Alaska, Washington, Oregon, Montana, and California (Los Angeles and parts of Orange County) who were tested for acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection between February 28, 2020 and April 27, 2020 were included in the data set. Testing mechanisms included swabs from respiratory specimens appropriate for viral RNA testing from eight testing platforms.

Outcomes and predictors
The principle dependent variable for our model was COVID-19 infection, as indicated by a positive lab test.
Distributions of all continuous variables including age, BMI, number of medications, and neighborhood financial insecurity were examined for normality and transformed into categorical attributes. Comorbidities were determined by problem list documentation or clinical encounter diagnoses using standard International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) nomenclature and further summarized into a measure of disease severity using total number of chronic conditions. Substance, tobacco, and alcohol consumptions were captured from social history assessments and clinician documentation.
The following variables were used as indicators of physical proximity to other people (i.e., structural barriers to social distancing): transportation insecurity, relationship status, employment, housing insecurity, and age-stratified communal living.

Statistical methods and modeling
Descriptive statistics were used to summarize study participants. Continuous variables were described by means and standard deviations, while categorical variables were described using frequencies and percentages. We conducted bivariate analysis to assess a significant effect of each factor on the outcome. All covariates with p < 0.25 in the bivariate analysis were considered for model inclusion since use of a more traditional level of 0.05 often fails to identify variables whose association with the outcome could become stronger in the presence of other variables [14]. In addition, all variables of known clinical importance found in previous studies that could make an important contribution were included to improve upon previous models [1]. Beginning with all variables of interest, a stepwise selection with backward elimination was used to create a multivariable logistic regression model for predicting risk of infection.
Initial parameters for the model were identified in the training set and then tested at the subsequent step, with data randomly partitioned into two independent data subsets: 80% for training and building the model and another 20% for testing. Missing data was recoded as unknown and included in the analysis. Detailed covariate definitions and data sources are shown in the supplement.
The model's ability to discriminate COVID-19 infection in the validation data set was evaluated using the area under the receiver operating characteristic curve and Hosmer-Lemeshow goodness-of-fit statistic. The observed and expected frequencies within each decile of risk was compared [14]. All data manipulation and modeling were completed in SAS EG (SAS Institute, Carry NC).
For all independent predictor subgroups, the risk of COVID-19 infection was quantified with odds ratios (OR) and 95% confidence intervals. These risks were calculated using the entire data set.

Sociodemographic risk factors
Comparatively, individuals between 50 and 59 years of age (

Prediction of infection risk
The model performed consistently across training and testing data sets with a receiver operating characteristic area under the curve of 0.78 and the Hosmer-Lemeshow chi-square of 4.4 (p = 0.81). The probabilities of infection was partitioned into "deciles of risk" (i.e. equal groups from smallest to the largest) did not highlight any "underperforming" areas.

Clinical risk factors
This retrospective study of the risk of COVID-19 infection identified several clinical risk factors also associated with serious illness in prior studies, including older age [3], male gender [15], diabetes [7], chronic kidney disease [16], high BMI [17], and immunosuppression [18]. However, some factors previously found to increase mortality risk, such as hypertension [3], and cardiovascular disease, liver disease, lung disease, or asthma [8], were not significant factors associated with initial COVID-19 infection.
Surprisingly, being prescribed more than ten medications or having a greater number of chronic conditions  was associated with less infection risk, suggesting possible risk reduction behavior based on perceived risk. Further research is needed to understand the differences between factors associated with initial infection risk and those associated with serious illness and mortality once the infection occurs.
Healthcare access through a relationship with an internal primary care provider was associated with a lower infection risk; however, this may be a result of higher rates of testing for COVID-19 compared to individuals with no primary care provider. Patients without a primary care provider may have only been tested for COVID-19 after respiratory and other possible COVID-19 symptoms became conspicuous, thus increasing the probability of a positive test.
Receiving secure electronic communication through the EMR was associated with lower risk of infection, suggesting that access to health advice and education may reduce risk.
Serious mental illness and drug and tobacco use were associated with lower risk; however further study is necessary to understand the mechanisms behind such associations.

Sociodemographic risk factors
Race and ethnicity appeared to be important predictors of risk. Higher risk of infection among Black, indigenous, and/or people of color may be associated with other sociodemographic and environmental characteristics found to also be significant in this study. African Americans and Latinos are more likely to live in communities with poor air quality [19], work in jobs that cannot telecommute [20], and lack access to healthcare [21] which may increase the risk of infection and contribute to racial disparities in mortality. Additionally, chronic conditions such as obesity, stroke, and diabetes, and premature death also affect African Americans and Latinos disproportionately compared to whites [13]. Communities of color are also more likely to experience lower socioeconomic status [22], and be employed as essential workers [10]. Additionally, for these and other vulnerable groups, lack of personal transportation is both a barrier to healthcare access [23] and social distancing, further exacerbating infection risk. For these reasons, communities of color experience more structural barriers to social distancing measures and are more vulnerable to severe illness.   Having limited English proficiency can be a barrier to accessing health services and understanding health information, especially when written translations and/or trained translators are not available [24]. Over the course of the pandemic, health information has changed rapidly (e.g., mandates for masking), which can create barriers to accessing information and could leave indigenous and immigrant communities uninformed. During the Ebola epidemic in West Africa, language barriers were an obstacle to slowing the spread of the disease [25]. People with LEP are also more likely to have low health literacy compared to English speakers and are at a higher risk of poor health [26]. Culturally and linguistically appropriate interventions are essential, including communication materials of differentformats and reading levels developed through the collaboration of native language speakers and English speakers, as well as the use of community health workers that can engage with underserved groups [27].

Environmental risk factors
Older age may be considered both a clinical and an environmental risk factor, as it moderates both comorbidities (e.g., dementia) requiring caregiving and housing situations (e.g., living in senior communities). Our results showed that some sociodemographic patient characteristics that influence environmental exposure to social contact were also associated with increased rates of COVID-19 infection, such as being married or having a significant other, being employed, lacking access to a personal vehicle, and living in overcrowded housing, each of which significantly increased infection risk. Religious affiliation was also associated with increased risk, which may be attributed to attendance of large religious services or other behaviors associated with religious identity.
People experiencing housing insecurity may experience challenges with physical distancing, especially when housing is crowded. These individuals may also lack hand washing facilities and/or running water [28]. Both factors could facilitate community spread of infectious diseases.
Regional differences in infection risk were evident, with Southern California and the Western Washington having the highest infection rates (15.7 and 11.3% of tested patients) while Oregon and Alaska (4.3 and 4.7%) had the lowest rates. These regional differences may reflect some combination of population density, proximity to the initial points of COVID-19 entry into the U.S., and state-specific COVID-19 precautions.

Study limitations
This study was limited to patient data from the Providence Health System, and publicly available data sets. Although the organization serves a diverse patient population across seven Western U. S states, the generalizability of this study to the entire U.S is unclear. With limited testing available and evolving screening guidelines, clinical discernment and personal bias may have impacted which individuals received testing and thus may have influenced the rates of testing in certain populations. Additionally, it is impossible to correlate patient data to measures of individual patient behaviors, such as mask use or adherence to social distancing recommendations. Finally, this study focused on factors associated with initial infection risk, however other factors may further influence outcomes such as disease severity, time in hospital, and mortality.

Conclusions
Our construction of a multi-faceted prediction model of COVID-19 infection risk in our large, multi-state population has important implications for healthcare systems, public health departments, and city and state governments to further reduce the risk of infection and prevent the spread of COVID-19 in communities that may be disproportionately impacted. Knowledge of the complex mixture of clinical, ethnic, linguistic, and environmental factors that contribute to infection risk should enable more targeted public health approaches to decrease COVID-19 infection.
Linguistically and culturally appropriate prevention education, healthcare access including routine care and