Some thoughts on the GoodSciencing list

It needs vetting, but you can speed up the vetting process by writing some code

Jan 11, 2023

You might have heard about this recent analysis by Dr Peter McCullough (PM) comparing on-field athlete deaths since mRNA vaccine rollout to prior years.

From my quick searches it looks like Dr Peter McCullough’s article is paywalled, but the article cited by him (which is more relevant for my article) is not.

It is this article also called Lausanne Recommendations (LR) which when calculated on a per-year basis rounds up to about 30 on field athlete deaths per year between 1966 to 2004.

Dr PM compared it the GoodSciencing (GS) list found here.

Criticism of the comparison

As expected, there was some criticism of the GS list. And in my opinion, the criticism is fair.

Eric Burnett, MD @Doctor_Eric_B

I’ve been seeing a lot of people quoting stats that show more athletes died in the last year than have died in the last 38 years. Is this true? No, let’s chat about it. A 🧵👇🏻

As it stands, the GS list is a mixture of everything - people who died and did not, athletes versus non-athletes, people of all age groups and also some stories which later turned out to be either ambiguous or false.

By citing this list, Dr PM probably made it a very problematic comparison with the LS paper which has a more rigorous inclusion/exclusionary criteria, and obviously also passed peer review (plus this paper was published in 2006 and not after the COVID19 vaccine rollout, so I am inclined to take the whole peer review process a lot more seriously).

Which leads to another question - can we do a fairer comparison with prior years based on the existing GS list?

Preliminary Analysis

User HeKS44364823 on Twitter did some preliminary analysis for a single month (Dec 2022) and talked about their approach to separating out on-field and off-field deaths which sort of inspired this article.

HEKS @HeKS44364823

refers to athletes < 35 who die on the field, but it's not completely clear. If so, it indicates an average of approx 29 deaths p/yr for 38 years in athletes aged 35 & under across all sports (w/ most occurring in soccer, basketball & running). In the GS dataset for December... 8

Creating an “auto-vetted” dataset

The HTML on the GS page is highly structured.

So I parsed the HTML on the GS page and turned it into a CSV file. This part is straight forward.

Then I added some code to automate the vetting.

To get the sport, I just manually created my own list of keywords after starting from this list and then adding in more after reading the descriptions of each entry in the GS list. I will then run the code, see which rows had SPORT = ‘UNK’, and iteratively keep updating the list.

List of sports keywords I used

To check for on-field vs off-field, I checked to see if the description contained one of the following phrases: “on field”, “on the field”, “game”, “match”, “during”. If I see one of these phrases, I will just add the sentence which contains the phrase into the ONFIELD cell.

I noticed that the word “during” is actually surprisingly effective at helping us discover on-field events.

I refer to this as an auto-vetted dataset since I did not do it manually, but instead used some algorithmic heuristics to guess the values for the SPORTS and ONFIELD columns. This means there will definitely be both false positives and false negatives. For example, there will be some events marked as on-field even if they are not (false positives) and some others marked as UNK even if you can see that they did happen on-field (false negatives).

The goal of creating this is not to create perfectly vetted data, but to help people analyze the original GS list more quickly.

Click here to see the auto-vetted dataset

Some basic analysis on the auto-vetted dataset

So what happens if we do an apples-to-apples comparison?

So I will filter for athlete deaths for ages between 12 to 35 and which happened on-field. Note that this is how the LS paper also does it, so it makes this a much more accurate comparison.

Filter for ONFIELD != ‘UNK’ and AGE between (12, 35) and Status = ‘Dead’ and you get 129 results over a two year period.

Technically, this must be compared with the 30/year from the LS dataset. When you compare the two numbers (60 vs 129), that is still a doubling after the mRNA rollout, provided you can manually vet this filtered dataset and confirm they are mostly accurate.

There is a second way to get the same result.

Filter for STATUS=’Dead’ and 12TO35=’True’ and ONFIELD!=’UNK’

However, the number of results for this filter is probably going to go up over the next few days and I will explain why in the next section.

The 12TO35 column

Sometimes the exact age is not known. Where it is known and if it is between 12 and 35 I simply mark it as True, and False if it is outside the range.

But even for the events where the age is not available but you can make a reasonable guess that it must be a person in that age range.

Here is a good example:

Barron Mann (Age), amateur boxer and student, died suddenly 5 days before his college graduation

I will be manually updating the 12TO35 from UNK to True for these events when I get a chance. You will know which ones got updated if you filter for AGE = -1 and 12TO35 != ‘UNK’

So you should expect the 129 to go up slightly higher over the next few days as I do the manual checks.

Download the CSV file

You can download the CSV file here

Vaccine Data Science