6 Comments
Mar 10, 2023·edited Mar 10, 2023Liked by Aravind Mohanoor

I want people to understand ... There are two types of followups.

1. People and medical staff make additional reports in VAERS on individuals. Those are being referred to in the article above.

2. CDC says they follow up on serious VAERS reports and no you can't have them, these are kept secret from the public.

There are two possible ways to resolve that:

A. We slip into an alternate universe where CDC is honest and interested in health (only became possible in that alternate universe because we were not growing at 385,000 new babies per day, that's the real emergency in our current universe here which they are forced to try to address to save planet earth, they do so arguably obeying the law but meanwhile avoiding an increase in public health, by publishing poor data)

B. FOIA request

We humans can't have the discussion about our population growth, a topic that is too emotional, and that's why we get number 2 (pun acknowleged).

Expand full comment

Example of poor data: Over 335,000 empty age fields while the age is clearly stated in the writeups

Expand full comment

Does anyone have a theory on why the CDC injects (or allows) multiple repetitions of sentences in reports? Random copy and paste are not how people write. One doesn't have to look through very many to notice them, surprisingly common. A cynical mind might figure it is their effort to flip off reader's brains (pun intended). To illustrate the point, random copy and paste are not how people write and random copy and paste are not how people write.

Is there some code that can remove them?

Some candidates:

https://www.google.com/search?q=python+remove+repeat+sentences+site%3Astackoverflow.com

One possibility among the results:

https://stackoverflow.com/questions/53181784/how-to-remove-duplicate-phrases-in-python

I'd change that to r'\1 [repeated]' to document them, but didn't work right anyway.

How would the shorter vs. longer text signals on veracity of the reports then tend to differ from now?

Might not be possible with regex. Some suggestions venture into AI.

Another at https://stackoverflow.com/a/64201821/962391, not tested and frankly doesn't make sense to me. :(

No answer at this point.

Expand full comment
author

Don't do this using string processing, you will have to implement too many custom rules. I think spending a few days learning spaCy can help.

a) Split the text into multiple sentences (just use the built-in sentence splitter):

https://botflo.com/courses/intro-to-spacy/lessons/how-to-split-text-into-sentences-using-spacy/

b) convert the tokens in each sentence into a Python list

c) do a pairwise comparison of the token list (for all sentences within a given report) and ignore the duplicates when you do the analysis

Can you try this for a few sample reports and let me know whether or not it works? I haven't actually implemented this in code, but I would be surprised if it doesn't work. And if it doesn't work, just reply to this comment with the VAERS_ID and I will try and figure out what is going on.

Expand full comment

[since this code didn't stand a chance at looking right here, I made an effort to make it better, just replace the backticks with spaces]

After a cup of coffee, here's an answer:

A specific VAERS report, long example, on what this solution does is 2320023, reducing it from 20,264 characters to 6,770 (to 1/3rd). 2/3rds of that content was repeat, in 15 collections (blocks) of sentences.

https://medalerts.org/vaersdb/findfield.php?IDNUMBER=2320023&WAYBACKHISTORY=ON

This line for example presented 10 times becomes just once:

"Product Description (CR): Compound BNT 162 covid-19 vaccine suspension for intramuscular 2ml multiple dose vial X 1."

from collections import OrderedDict

max_num = 0

def dedupe_repeat_sentences(df_syms_cell):

````'''````Remove repeat sentences in VAERS SYMPTOM_TEXT fields

````````````` This version prints before and after each time a larger change occurs.

````'''

````global`max_num

````string_in = str(df_syms_cell) # sometimes not string

````string_out = '. '.join(list(OrderedDict.fromkeys(string_in.split('. '))))

````if`string_in != string_out:

````````diff_num = len(string_in) - len(string_out)

````````if`diff_num > max_num:

````````````print()

````````````print(string_in)

````````````print(f' {diff_num} fewer bytes, changed to ')

````````````print(string_out)

````````````print()

````````else:

````````````print(f' {diff_num} fewer bytes ')

````````max_num = max(max_num, diff_num)

````return`string_out

# df for dataframe, in Python using pandas

df['SYMPTOM_TEXT'] = df.apply(lambda row: dedupe_repeat_sentences( row['SYMPTOM_TEXT'] ), axis=1)

(originated at https://stackoverflow.com/a/40353780/962391)

Expand full comment

Reduces clutter repeats in 30,152 reports, total of 9,999,722 characters, just shy of 10 megabytes.

Expand full comment