Why are VAERS reports deleted?
About 80% of the deleted reports become less complete, less serious or less conclusive
Jan 2023 Update:
Looks like the CDC just keeps the first report as I discuss in a followup article.
So I don’t think there is any specific “reason” for deleting duplicate reports.
Is that a good idea though? Here is what someone wrote in the comments:
Thank you, Aravind, for the follow-up investigation!
The "keep the first report" is definitely a bad policy. No good scientific or investigative study ends with just the initial information. I think there are likely at least two reasons for the policy. First, the CDC/FDA never wanted the VAERS system. Congress forced it on them. I suspect they despise it, especially because it's open -- anyone can submit reports and anyone can view reports. This also means that reporters can check whether their reports have been included and verify their accuracy (unlike almost any other human-reported scientific data set). The openness of the system allows an independent assessment of adverse effects that CDC/FDA can't control, and it allows skilled analysts like you to investigate matters they would like kept in the dark. The "keep the first report" policy allows VAERS to minimize the negative outcomes recorded in the public version of the database. Indeed, any reporter who makes more than one report is truly committed and likely not only believes the symptoms are genuine adverse effects but may also have good evidence to support their claims. A probable second reason for the policy is that it creates less work for the contractor that administers the VAERS (contracting the work out shows how little CDC/FDA care about VAERS). The contractor doesn't want to do anything extra, and CDC/FDA don't want to pay more for a feature that they don't want.
Original article:
A couple of weeks back, I wrote an article about VAERS reports being deleted.
Where do the deleted reports come from?
I want to clarify that the deleted reports are actually from this page, and the list of reports get automatically updated every week. When I downloaded the reports on 30 Sep 2022, it had a total of 24507 rows.
Then I wrote another article about how duplicate VAERS reports sometimes become less serious in the retained report.
The founder of OpenVAERS asked me if there is any pattern we can identify in the deleted reports.
I think I now have a hypothesis on this.
Whenever there is a duplicate report, the report which is retained is often less complete, or less serious, or less conclusive when compared to the deleted report. How often? About 80% of the time.
Let us suppose most of the deleted reports have corresponding duplicate reports.
Maybe there are some exceptions. For example, this report was deleted, and there would have been no duplicate report in its place if this person had not been notified of the deletion.
But for the rest of the article I will suppose that this rarely happens and CDC does create duplicates for all deleted records.
And we also know - since I was already able to identify a lot of duplicate reports in a previous article - that a lot of them are probably deleted because they have duplicates in the system.
That is also what the CDC website itself says:
For the rest of this article, I will suppose that all deleted VAERS reports have duplicate reports, even if sometimes I am not able to identify them.
Complete, serious and conclusive reports
You can loosely classify all VAERS reports across three dimensions -
a) is it complete? - do we have all the important information such as age, sex, state, vaccination date and onset date (the last two are combined to give us number of days to onset i.e. NUMDAYS)?
b) is it serious? - how serious is the report? Does it include information about life threatening adverse events, hospitalization etc?
c) is it conclusive? - does the report establish some kind of causation? Usually the way to do this is to use the Bradford Hill criteria. For example, if the NUMDAYS in the deleted report is 0 (which means the symptom started on the same day of the vaccination) and the ONSET_DATE is deleted from the retained report, we cannot compute the number of days to onset, and the report has become less conclusive. This doesn’t mean the original report proved causation, but it does mean that the causation has become even harder to establish using the Bradford Hill criteria (hence less conclusive)
So I took the deleted records (all the files were downloaded on 30th Sep 2022) and tried to categorize them.
There were 24507 deleted reports.
Duplicate deleted reports
For some reason, this original list of 24507 deleted reports already has some duplicates.
Note: here I am referring to duplicates with the same VAERS_ID contained within the Excel file which I downloaded from the VAERSAnalysis.info website. I understand the terminology could be a bit confusing, but these should not be confused with the other duplicates - between deleted and retained reports, where the VAERS_ID is not the same. To see an example of these, check out the list of duplicate reports linked from the “Less serious reports” paragraph which show plenty of clear, juxtaposed examples.
While there are some minor differences between multiple rows, the VAERS_ID is the same and the pertinent information we care about is identical. So when I encounter a duplicate row in my analysis, I just skip it.
There are a total of 173 duplicates, so the total number of processed reports is 24334.
How I get the initial list of matches
This Python code below is used for getting the initial list of matches
Here, df is the dataframe being passed to the method and contains all the data for either 2020, 2021 or 2022. The other method parameters are the values from a single deleted report that I need to compare, and these parameters can sometimes be empty.
If the deleted value is not empty, I see if there is any match in the dataframe which is passed to the method.
I keep filtering these results until the final list of results fully match whatever information is contained in the deleted row.
def get_matching_rows(df, age_yrs, vax_date, onset_date, state, sex):
df_matches: pd.DataFrame = df
if not pd.isnull(state):
df_matches = df_matches.loc[df_matches['STATE'] == state]
if not pd.isnull(age_yrs):
df_matches = df_matches.loc[df_matches['AGE_YRS'] == age_yrs]
if not pd.isnull(sex):
df_matches = df_matches.loc[df_matches['SEX'] == sex]
if not pd.isnull(vax_date):
df_matches = df_matches.loc[df_matches['VAX_DATE'] == vax_date]
if not pd.isnull(onset_date):
df_matches = df_matches.loc[df_matches['ONSET_DATE'] == onset_date]
return df_matches
Unclassifiable reports
If the existing VAERS report is missing some important information like AGE and VAX_DATE, it is very hard to use the remaining information and get a match within the existing VAERS reports.
I label these as “unclassifiable”. There are 1251 of these.
List of unclassifiable reports
Less complete reports
This is the list of reports where I am unable to find any match in the VAERS database using the available information. I tried to use all the existing information in each report and match duplicates. If I cannot find even a single match which has all the information - AGE, SEX, STATE, VAX_DATE and at least one matching Symptom from the original report - the duplicate is considered to be less complete.
A good example is the deleted report I mentioned above. It included AGE_YRS information, but the duplicate (retained) report does not have a value for the AGE_YRS. So my search will not be able to match these two.
But this also establishes my overall point.
This report has actually become less complete since it is missing the AGE_YRS field. As a consequence, for example, if someone is computing aggregate statistics based on age, the retained report will not even be a part of the anaysis.
There are a total of 11428 unmatched reports.
List of unmatched (less complete) reports
Identifying less serious reports
The algorithm I use to identify less serious reports is based on the same one I used in my previous article about how the duplicate reports omit information about the seriousness of the adverse event. I use the flag ⚠️ to represent changes which make the report less serious.
While the above view shows all the differences between deleted and retained reports, there are 4234 such reports (where the retained report became less serious than the deleted report)
Identifying less conclusive reports
If the SYMPTOM_TEXT becomes less verbose, that is a way to make the retained report less conclusive. I use the flag ❎ to represent changes which make the report less conclusive.
For example, perhaps there is some information contained in the narrative text which would have been useful to establish correlation.
In this case, I check to see if the number of characters in the deleted report is at least 1.5 times the number of characters in the retained report (so deleted report was actually more verbose) and if it is I add it into this list.
List of less conclusive reports
While the above view shows all the differences between deleted and retained reports, there are 4824 such reports.
Combining less serious and less conclusive reports
The two categories above are not disjoint. In other words, there are many reports which are both less serious and less conclusive.
When I combined them (union of the two sets), I got 7362 reports. Said another way, there are 7362 reports which are less serious or less conclusive or both.
List of reports which are less serious or less conclusive (or both)
Unchanged reports
There were also a list of reports where there were no changes which made them less serious or less conclusive.
Note: in the context of our analysis, an unchanged report could still have some changes. But those changes did not make the report less serious or less conclusive.
Analysis
Out of the total 24334 reports:
1251 were unclassifiable
11428 were unmatched
11655 had matches I could identify
Of these 11655, 7362 became either less serious or less conclusive
In total, 11428 + 7362 = 18790 became either
a) less complete
b) less serious
c) less conclusive
That is, 18790/24334 = 77% of the deleted VAERS reports were changed in these three ways.
Summary
This does leave us with a few more questions.
Why select these specific 24000+ records to delete, and not the others?
What is unique about the 1251 unclassifiable reports? Why delete them and not other similar reports which are also missing a lot of information?
But despite these questions, it is pretty clear that whenever CDC gets a chance to choose between two VAERS reports, quite often they choose the one with less “signal”
Thank you for your very helpful and informative Substack.
I wonder whether the patterns you observe are due to VAERS deleting more recent reports of a case, given this explanation in the VAERS Data Use Guide: "when multiple reports of a single case or event are received, only the first report received is included in the publicly accessible dataset. Subsequent reports may contain additional or conflicting data, and there is no assurance that the data provided in the public dataset is the most accurate or current available" (p. 3). The negative interpretation of this policy is that it allows VAERS to eliminate some of the worst outcomes, such as when a case's condition deteriorates. Are the dates received of the retained and deleted reports consistent with the "keep the first report" policy?
https://vaersanalysis.info/