How NLU based information extraction makes VAERS more searchable

For example, we can derive the age from the clinical narrative in a significant percentage of foreign reports even if it is missing in the AGE_YRS field

Dec 16, 2022

Recently, the CDC removed the clinical narrative field (SYMPTOM_TEXT) from the foreign VAERS dataset.

The pro-vaccine folks might say - given someone at the CDC reads the report and adds all the extracted information such as age, vaccination date etc. into the remaining columns, what is the big deal if the clinical narrative text is not made public?

Unacceptable Jessica

The foreign data set was gutted this week in VAERS and the cancer signal was halved, the myocarditis dose 3 response signal was lost and 994 spontaneous abortions/still births were dropped

As most of you know, me and a bunch of other people are monitoring VAERS data very closely week-by-week. This week (11.18.22), the first thing I noticed was that the Foreign data set was less than a fraction of the size it was last week (11.11.22): down from 283.51 MB to 96.81 MB. There is a disclaimer under the VAERS data that states the following, so …

3 years ago · 710 likes · 46 comments · Jessica Rose

But I have already noticed a pretty concerning trend after I started looking at the foreign dataset.

This is the problem: the age field is available in the SYMPTOM_TEXT but is not filled out in the actual CSV file1.

And unless you use NLU based information extraction, it is too laborious and time consuming to fill out this information, especially if the SYMPTOM_TEXT field is already verbose and hard to read.

Note: I don’t actually know if it is OK to post the actual pre-deletion dataset online. So I will not be posting any of the reports for UK and EU at the moment unless I understand the rules around posting it online. Until then I will use other regions for my article datasets.

For example, the three commonly used tools for searching VAERS - namely CDC Wonder, OpenVAERS and MedAlerts - will not provide the full list of reports if you search by age (or age range) because they use the AGE_YRS field.

New Zealand

Let us consider the example of New Zealand, which I discussed in a previous article.

Vaccine Data Science

Can you improve VAERS analysis by using Natural Language Processing?

You will see many people produce dashboards based on VAERS data by using all the information from the three CSV files. Since the narrative text (SYMPTOM_TEXT) field is a string, it usually does not feature in these dashboards and visualizations. The main reason I created this Substack was because I noticed that almost everyone who was producing these da…

3 years ago · Aravind Mohanoor

You can see something very interesting if you look at the AGE_YRS and the SYMPTOM_TEXT field in conjunction.

Filter for rows where AGE_YRS is null (meaning it was not filled out in the report) but where the SYMPTOM_TEXT field contains the word “old”. More often than not, this word appears as part of a phrase which provides the patient’s age.

Myocarditis

Suppose you want to know how many people in New Zealand under the age of 40 got Myocarditis after taking the vaccine.

CDC Wonder does not allow you to search for this.

On OpenVAERS while you can search through the foreign reports, you cannot restrict by country.

On MedAlerts, you can actually put NZ into the SPLTTYPE field and it will only return the results where the SPLTTYPE field contains the string NZ, which is what we want. These are all New Zealand reports. So I added it and also searched for the word Myocarditis in the SYMPTOM_TEXT.

Using this search query on MedAlerts, you see that only 25 results are being returned

Link to Search Result

To compare, I took the New Zealand reports from the previous article and filtered for the word Myocarditis in the SYMPTOM_TEXT (to match with the above search results) and created a new, smaller dataset.

Click here to see the results with DERIVED_AGE added

If you now filter only by AGE_YRS < 40, you get 26 results which nearly matches with the MedAlerts website. The difference is probably due to the fact that MedAlerts data is up-to-date, while mine is about a month old.

But if you use the calculated DERIVED_AGE instead (see the article I linked above to see the Python script I used for calculating the DERIVED_AGE), you get 120 results instead. Which is nearly a 5 fold increase!

(You can verify this for yourself using the link above)

Summary

Using NLU powered information extraction allows us to get a much more accurate picture of vaccine safety by allowing us to analyze the SYMPTOM_TEXT field in VAERS.

The fact that none of the health authorities seem to be concerned about this speaks very poorly of them.

And it happens at a significantly higher rate than the US dataset. This is probably because there are many more eyes looking at the US dataset compared to the foreign dataset.