7 Comments

Similarly, as anyone who has read through VAERS reports knows as well as we do ... there are an overwhelming number of repeated sentences. One might theorize this is intentional to switch the reader's brain off as an agency would surely want quality to be their reputation, the presenting of sloppiness would not be an option.

I would be interested in a regular expression or method to replace repeated strings of 50 or more characters with something like [echo87]

Expand full comment

That's an interesting point. I have also wondered why they have that.

Removing repeated sentence blocks is quite easy to do with a tool like spaCy, but only if you are sure the repetition happens at the sentence boundary.

Expand full comment

Thank you, Aravind, for the follow-up investigation!

The "keep the first report" is definitely a bad policy. No good scientific or investigative study ends with just the initial information. I think there are likely at least two reasons for the policy. First, the CDC/FDA never wanted the VAERS system. Congress forced it on them. I suspect they despise it, especially because it's open -- anyone can submit reports and anyone can view reports. This also means that reporters can check whether their reports have been included and verify their accuracy (unlike almost any other human-reported scientific data set). The openness of the system allows an independent assessment of adverse effects that CDC/FDA can't control, and it allows skilled analysts like you to investigate matters they would like kept in the dark. The "keep the first report" policy allows VAERS to minimize the negative outcomes recorded in the public version of the database. Indeed, any reporter who makes more than one report is truly committed and likely not only believes the symptoms are genuine adverse effects but may also have good evidence to support their claims. A probable second reason for the policy is that it creates less work for the contractor that administers the VAERS (contracting the work out shows how little CDC/FDA care about VAERS). The contractor doesn't want to do anything extra, and CDC/FDA don't want to pay more for a feature that they don't want.

Expand full comment

https://www.vaersaware.com/deleted-reports-2007-2022

I'm interested to get your downloadable files of the results. I could give you my "master excel" file of all 35K deleted reports for your further research. https://i.imgur.com/IZ4K9iU.jpg There is much philosophy and even debate on the subject. Spot checking some of your results I noticed a couple things that might be considered or could be added. The "appears" date which only medalerts and my dashboard has, is critical to this analysis/observation. I strongly agree that the 1st report received is the first kept the grand majority of the time. However, my medical fraud auditing senses tell me they change the "received date time stamp" to give the appearance of a 1st received 1st kept protocol, when they desire to publish with a less "unfavorable" light, just like your very first example ID# 1002148? This is much easier to see when you have the luxury of viewing when the reports were published and not just received. There is much more philosophy not worth explaining here, but I'd like to collab with you if are are interested in making history. There is already enough casual observers, and this "deduping" process is already under way. If you notice, I'm almost done populating the "symptoms" of the deleted reports and will add another layer of sophistication in helping identify true duplicates. Your ID# 1001213 match with 1002148 is a moderate low probability of this being the same 80yr female. Outside of the basic biometrics, some discrepancies exist like the administered by field- Senior facility versus a pharmacy. I think given the limited data, this could easily be two different 80yr old females... so it would be ideal to incorporate a probability level like strong, moderate, maybe or lowest probability... It's all good just suggestions, it's what mine will have eventually. Some topics co-mingled with this topic are Temp ID# and Finalized ID# associations, and Unpublished ID#s (not deleted ID#s). It's becoming apparent VAERS does not publish all legitimate reports received. They throttle the ones they do publish, and they delete legitimate reports once published. There is a very telling VAERS AUDIT here: https://react19.org/vaers-audit/

Expand full comment

Here is the link to the source CSV file:

https://workdrive.zohoexternal.com/external/35a5488ee7d20842c615ddc229a5fd109b3602aa8cf11a765835ce8bd32bbe78/download

I was hoping that Zoho Analytics would provide a feature for public viewers to download the source document but that is not available out-of-the-box. So I had to upload the file separately. If there is a mismatch between the file and the online view, do let me know and I will take another look.

>>However, my medical fraud auditing senses tell me they change the "received date time stamp" to give the appearance of a 1st received 1st kept protocol

Interesting. I wonder if there is any way to actually triangulate this information.

I have taken note of your other suggestions, and I will try and address them when I get some time. But it will only be after a few more text-based analyses I am already planning to do.

Expand full comment

Hi Aravind, I'm sure you'll find this interesting, I just published it: https://deepdots.substack.com/p/13585-pulmonary-embolism-reports

Secondly, I'm using a couple of spellcheck imports on lot code corrections. Do you think spaCy could be applied as well? My code can be found via https://deepdots.substack.com/p/technical-ai-fixed-150000-lot-numbers

Lastly I would be interested in knowing about it if anyone ever runs across a record that started with age or gender populated and later blanked out, thanks all.

Expand full comment

1 Will read it when I get time. I am now looking at a few more things for my text analysis, and will come back to this when I get a chance.

BTW, I know for sure you are touching on a topic which is interesting to other people, so you should also leave a comment over there to get some feedback.

https://lawhealthandtech.substack.com/p/vaers-report-20/comment/11813290

Also, if you can find some retired doctors/nurses/medical professionals who have both the domain knowledge and the time, you should ask them to review this stuff, and not just for pulmonary embolism but on a more systematic level for all diseases. I don't know how easy it is where you reside, but clearly this is the kind of stuff where teaming up with someone who has both domain knowledge and time can speed up things a lot.

2 If you are using existing Python libraries for spell correction, I think that's pretty good already.

spaCy does not offer anything for this out of the box.

The computer science terminology for this is called "levenshtein distance" - which is a measure of similarity between two strings.

If you are very curious you can try rolling out your own implementation in Python if you have the time. The advantage is that if you have some very specific use case and a lot of domain knowledge (both of them are true in the case of lot numbers), you can _probably_ improve the underlying algorithm. I don't know for sure, so if you spend a few days and realize that you got no improvement, don't blame me :-)

https://codereview.stackexchange.com/questions/217065/calculate-levenshtein-distance-between-two-strings-in-python

3 The last one is easiest to answer - we usually use age + gender + vax_date to identify duplicates so if one of those fields is missing it becomes too hard to find out without using very elaborate and computationally expensive methods.

This does not mean it never happened, just that it is very hard to find those.

Expand full comment