Pertussis, historically known as whooping cough, is a very serious respiratory infection caused by the
pertussis bacteria. It can cause violent coughing fits. Whooping cough is most harmful for young babies and
can be deadly.
Information on whooping cough:
Pertussis outbreaks occur every 3-5 years, as can clearly be seen above. There is a huge spike in cases in both 2010 and 2014, with the hardest-hit counties seeing upwards of 14 cases per 10,000 residents in each of those years.
Kindergarten vaccination rates by county vary immensely, as can be seen on the above box chart. Some counties' kindergarten vaccination rates are nearly 100%, while others fall below 80%.
Immunization Rate by county for the nine most populated counties with color gradient centered at the 94% herd immunity rate for Pertussis, with at risk counties showing in red. Before diving into our machine learning model we did notice and inverse relationship between vaccination rate and reported Pertussis cases.
Map displaying the most common preventable diseases in California years 2001 to 2017. The size of the circles indicate the count of each type of infection. We observe a minor Pertussis outbreak in 2005, followed by major outbreaks in 2010 and 2014. Also, in 2011, Hepatitis A surpasses Invasive Meningococcal Disease (Meningitis) as the second most common preventable disease.
The California Health & Human Services (CHHS) Open Data Portal provided California pertussis incidents by county over time.
The Shots for Schools website was a huge help to this process as they took raw CHHS data and created csv files that included data on kindergarten vaccination rates by school, sorted by county. County data, then, became easy to extract and to merge with the other data sources.
Initially, only county population data over time was extracted from the US Census Bureau API. With the second iteration, however, an additional 75 columns of data was pulled for each California county for every year between 2010 and 2016. These were processed down to the following items:
Once the US Census data had been cleaned up and saved as a csv, it was combined with the California Kindergarten Vaccination Rate and pertussis incidents data from the CHHS Open Data Portal. The files all had county name as common columns, but needed to be cleaned up to match exact wording and case.
Additional data cleaning that was performed to allow all data to be joined:
Vaccination Data
Figure 5 - Correlation matrix of the 9 features considered for analysis
View
the full machine learning jupyter notebook here
The correlation matrix was used to select which features to use for regression analysis. We selected the top 5 features based on their
correlation to the outbreak percentage. A linear regression was run on each of these features to glean additional information.
Click on the features
listed below to learn more.
Figure 6 - Linear Regression plot of training data vs predicted
Predicted line has a shallow slope that shows a decreasing pertussis outbreak rate as the vaccination
rate
increases.
Looking at the distribution of the training data plotted we can infer that the correlation between
vaccination
rate and pertussis cases is not very strong.
Figure 7 - Plot of incident threshold vs model accuracy
The plot shows that the higher the threshold set is to determine the greater the accuracy of the
model. Based
on the plot, we selected 10 as the threshold to run our logistic regression on.
Figure 8 - Logistic Regression plot of test data vs probability
Because the scatter plot of the linear regression seems to show a clusters of low incident and high
incident,
we decided to also do a logistic regression to predict outbreak vs. no outbreak.
The plot of the regression data shows a higher probability of no outbreak as the vaccination rate
increases,
which is what we expected.
Figure 9 - Logistic regression plot of number of years since last outbreak to
outbreak probability
Looking at the Tableau visualization (Figures 3 & 4 above) we could visually see that there is a cyclical nature to the outbreaks. We opted to
further explore this pattern and found a correlation between the time in years since the last outbreak and future outbreaks.
Figure 10 - Linear regression plot of population that is foreign born versus probability of accuracy
When we visualized the linear regression for foreign born population in each county against accuracy, we found minimal influence on the
accuracy of the model, as can be seen by the near-zero slope of the regression line.
Figure 11 - Logistic regression plot of percent of population that is foreign born to
outbreak probability
The probability does not vary widely over the entire range of the population percentages, which demonstrates the weak correlation in Figure 10.
Figure 12 - Linear regression plot of population that moved from overseas in the past
year versus probability of accuracy
There appears to be a slight correlation here.
Figure 13 - Logistic regression plot of percent of population that moved from
overseas in the past year to outbreak probability
We can see in the above graph that the percentage of the population in all counties that moved from overseas is less than 2%, which is likely why this
feature has a weak correlation.
Figure 14 - Linear regression plot of percent of Kindergarten population whose family income is below the poverty line versus probability of accuracy
This feature is one of the stronger indicators that we found in our analysis, as can be seen in the chart above. The fact that it is a negative correlation is interesting,
and provides some evidence of our speculations potentially being correct.
Figure 15 - Logistic regression plot of percent of Kindergarten population whose family income is below the poverty line to outbreak probability
The wider range of probabilities over the percentage of kids from low-poverty households strengthens the correlation.
Factors that prevented a stronger model for predicting Pertussis outbreaks: