Filling missing ages in the VAERS dataset
A crucial piece of information which can be easily inferred using Applied Machine Learning but isn't
When I say Applied Machine Learning, I am referring to using spaCy to identify missing information within the VAERS dataset.
I recently noticed that there are quite a few reports which contain the age of the vaccinee in the SYMPTOM_TEXT field which is not extracted and added into the AGE_YRS field. And it is possible to use spaCy to extract these values.
Preprocessing
I first started analyzing the 2022 VAERS dataset by first merging the three individual CSV files into a single dataframe.
This makes it much easier to run your queries and see all the data in one place. However, it also makes the dataframe much larger.
So I created individual merged CSV files for each month in 2022. The file becomes more manageable for text analysis, and as a bonus, I was also able to see if the patterns I discovered were true for all months (and if there was any variability within those months).
Algorithm
So the first thing we will do is look at the SYMPTOM_TEXT field for references
of age. In spaCy you can do this by creating a DependencyMatcher which matches an age pattern which looks for patterns like this in the dependency parse tree
And this:
And this:
So the basic pattern is NUM → year → old
We use the lemmatized version of “year” and “old”.
Here is the Python script (sorry, I haven’t had the time to clean up the code)
I will explain the basic algorithm I used:
1 We will add a new column called DERIVED_AGE to the dataframe.
2 First we check to see if SYMPTOM_TEXT is empty. This means we don’t have a way to infer the DERIVED_AGE from the text, and we will mark it as -1
3 Next we look for the following phrases in the SYMPTOM_TEXT - “no qualifiers provided” and “unknown age” and “unspecified age”. These are indications that the report did not include the age
4 We then use the DependencyMatcher to see if there is a match on the age pattern shown in the images.
And we have to remember that we will only be doing this for the cases where the existing AGE_YRS field is null (I don’t cross check or verify it).
So of the null values, I was able to infer the age in at least 20% of the rows for all the months. Usually about 30% of the blank AGE_YRS can be confirmed to actually not have age information.
This still leaves a fair percentage of reports where I don’t yet know if an age is provided. I will come back to this question in a future article hopefully.
But this still tells us that people are not carefully reading these reports to see if the age can be added into the AGE_YRS column.
You can download the output of the script in CSV format here and check for yourself.
How to verify
Download the CSV file and filter by rows where AGE_YRS is empty but DERIVED_AGE is non-empty. See how many of them are NOT equal to -1. Those are the values calculated by the script.
(Yes, you SHOULD expect to see some false positives. But it is a pretty small percentage)
Over 330,000 ages filled in here in the ages file: http://hawkvaers.com/download/presets/
(The word 'presets' means a previous run of my python code had filled them in).
A certain version of the code is there in collect_records.py:
http://hawkvaers.com/download/2022_09_22_vaers_lot_cleaning_v.997gh/input/
Many regular expressions including some screening of false positives.