How LLMs can accelerate vaccine safety research

Large Language Models are about to dramatically improve the quality, quantity and presentation of vaccine safety research (and probably of every other field)

Nov 26, 2024

Summary:

- Large Language Models (LLMs) can help researchers extract structured data from clinical narratives like VAERS reports, making vaccine safety information more searchable and analyzable.

- They can transform complex data into visual presentations and charts, making research articles more accessible and easier to understand.

- LLMs can effectively summarize long Substack articles and convert video transcripts into structured, searchable text documents called OutScripts.

- LLMs can parse and format difficult-to-read PDF documents (like clinical trial data) into clear, structured tables and searchable content.

- Discourse forum software now integrates with LLMs (like GPT-4, Claude, and Gemini) to enable AI-powered searches through forum content

Note: the Discourse forum I have linked in this article is only a demo and I will delete it in mid-Dec 2024

I think people who are doing vaccine safety research will benefit a lot from learning about and using Large Language Models.

Here are some things the LLMs can already do.

Extract structured output from clinical narratives to create searchable datasets

In a recent article, I mentioned how I prompted GPT4 to read a list of 16K VAERS death reports and tell me if

a) the report mentioned if the person had ever tested positive for COVID19

b) if the report mentioned that COVID19 was responsibile for the patient’s death

This is an example of extracting Structured Output from the input prompt.

You can now get all kinds of Structured Output using GPT4 and Gemini APIs. While they are not 100% accurate, they are very close.

Converting free form text1 into structured data makes all this information easy to search2 within analytics software like Zoho Analytics and search tools like Algolia.

Visualize data in articles

Reading long Substack articles is both time consuming and also sometimes just a bit boring. This is especially true if it includes a lot of data analysis.

While summaries are helpful, it is always preferable to use charts and diagrams to explain the data. Even if they do not provide full details, it will be easy for people to get the gist of the article, and the reader can anyway read the full article if they are interested.

You can get Claude to generate an entire HTML presentation by just copy/pasting a data heavy article into its prompt and asking it to render charts based on the information provided in the article!

Here is an example prompt3 for my v-safe text analysis article.

Here is the output4 Claude generated

I would say this is pretty good.

Sometimes the LLM might make some errors. Please remember to double check the results!

Summarize Substack articles

This is a well known and a pretty common use case.

LLMs do a very good job of summarizing information, so you can ask them to summarize Substack articles.

You can then paste the summary into your website or a Discourse5 forum.

Here is an example

Convert video transcripts into articles

You can paste the transcript6 of a video and generate video chapters and combine the chapters (Outline) and the transcript to create something called an OutScript

An OutScript provides multipe benefits:

it reads like an article, so people who don’t like to sit through videos might find it easier to grasp the main points
it makes it easy to skim the contents quickly before watching a video, so you can decide if you want to watch it and where you want to begin
it makes it easy to recall the information quickly after7 watching a video
it makes it suitable for GPT powered search, which I will discuss later in this article

Here is an example

You can paste the Markdown into Discourse

Format tabular information inside FOIA’ed PDF files

Recently

a_nineties reads

asked me to parse some PDF files and compile some summary information to help out with his analysis of Pfizer FOIA PDF files.

a_nineties reads

At investigator's discretion

TL;DR This article recounts findings from reviewing Pfizer-BioNTech’s C4591001 clinical trial materials, focussing on manipulations of, and discrepancies between, trial subject source documents and databases. Numerous, serious, individual and systemic data quality issues are identified and examined, with several avenues of further investigation outlined…

a year ago · 42 likes · 9 comments

You can write some Python code to extract text from PDF files and people have been doing it for many years.

But tabular information inside PDF files has been quite a challenge till date.

For example, a PDF page which looks like this (notice that the second Value column is itself in a tabular format)

This is the output you get when you parse the page using a PDF parser library like PDF Plumber:

"table": [
          [
            "Header Text: c4591001\nVisit: SCR Form: GENERAL MEDICAL HISTORY - eCRF Audit\nTrail History\nForm Version: 22-Apr-2020 21:03 Form Status: Data Complete, Locked, Frozen, Verified\nSite No: 1007 Site Name: (1007) Cincinnati Children's Hospital Medical\nCenter\nSubject No: 10071050 Subject Initials: ---\nGenerated By: pfe.levissc Generated Time (GMT): 29-Mar-2021 04:12",
            null,
            null,
            null,
            null
          ],
          [
            "1.b Ongoing:",
            null,
            null,
            null,
            null
          ],
          [
            "Date",
            "Location",
            "User",
            "Value",
            "Reason"
          ],
          [
            "Jun-22-2020\n11:44:51\n(UTC-05:00)\nEastern Time (US\n& Canada)",
            "ACV0PFEINFP6000",
            "Laura\nPace\n(pfe.lpace)",
            "Data Entry:\nYES",
            "Initial Entry"
          ],
          [
            "1.c",
            null,
            null,
            null,
            null
          ],
          [
            "Date",
            "Location",
            "User",
            "Value",
            "Reason"
          ],
          [
            "Jun-22-2020\n11:45:14\n(UTC-05:00)\nEastern Time (US\n& Canada)",
            "ACV0PFEINFP6000",
            "auto calc\n(autocalc)",
            "Data Entry:\nLine/MH N 3\number:\nMedical His Anemia\ntory Term:\nStart Date: Sep/UN\nK/2019\nOngoing: YES",
            "Initial Entry"
          ]
        ]
      }

Notice that the “Date Entry” field JSON is unformatted. The line breaks usually end up distorting the entire meaning of the data if you parse it directly.

But you can ask Gemini LLM to fix this, and you will get this Markdown:

## Summary of PDF Content

**Non-Tabular Text:**

* Header Text: c4591001
* Visit: SCR Form: GENERAL MEDICAL HISTORY - eCRF Audit
* Trail History
* Form Version: 22-Apr-2020 21:03 
* Form Status: Data Complete, Locked, Frozen, Verified
* Site No: 1007 
* Site Name: (1007) Cincinnati Children's Hospital Medical Center
* Subject No: 10071050 
* Subject Initials: ---
* Generated By: pfe.levissc 
* Generated Time (GMT): 29-Mar-2021 04:12
* Page 290 of 614
* ***Confidential***
* )TMG(
* 13:50
* 1202-rpA-10
* :nO
* laniF\laniF\a204ea691e771090
* FDA-CBER-2021-5683-0129391

**Tabular Information:**

**Table 1: 1.b Ongoing**

| Date                    | Location        | User             | Value           | Reason        |
| ------------------------ | --------------- | ---------------- | --------------- | ------------- |
| Jun-22-2020 11:44:51 (UTC-05:00) Eastern Time (US & Canada) | ACV0PFEINFP6000 | Laura Pace (pfe.lpace) | YES             | Initial Entry |

**Table 2: 1.c**

| Date                    | Location        | User            | Value                                      | Reason        |
| ------------------------ | --------------- | --------------- | -------------------------------------------- | ------------- |
| Jun-22-2020 11:45:14 (UTC-05:00) Eastern Time (US & Canada) | ACV0PFEINFP6000 | auto calc (autocalc) | Line/MH Number: 3<br>Medical History Term: Anemia<br>Start Date: Sep/UNK/2019<br>Ongoing: YES | Initial Entry | 


Note: Some line breaks within table cells have been preserved for clarity.  The "UNK" in the date is likely meant to represent "Unknown".

The Markdown gets rendered like this inside Discourse:

In other words, Gemini LLM actually fixed the parsed PDF data in a way which renders like a proper table, even better than it renders inside the PDF8!

Bonus: Use AI Bots inside Discourse

This is probably the most interesting development for people who are doing vaccine safety research, although it is still something of an experimental feature inside Discourse and probably has plenty of scope for improvement.

You can actually ask an AI bot to answer questions based only on the contents of your forum using an off-the-shelf plugin which is already available9 inside Discourse.

Since Discourse has a robust API for publishing topics/posts etc, you can even automate a lot of steps while publishing all this content.

Discourse already supports most of the popular LLMs.

I asked a question based on the parsed PDF page (from earlier in this article) to Gemini Pro, GPT4 Turbo and Claude Sonnet10 and see which one gives me the best answer.

First choose the type of AI assistant you need. You will choose Forum Helper if you want the AI bot to answer questions based on the forum’s topics.

Then choose one of the LLMs you have already configured in the Admin settings.

Discourse preconfigures this to be available only via Direct Messages to the appropriate LLM (it is as if you are having a DM chat on Twitter, except on the other side you have a bot instead of a human). So you cannot use this to get the LLM bot to answer public topics11. For the same reason, I cannot share the chat with the public either and can only provide screenshots.

As you can see, the response from Gemini Pro is wrong.

GPT4 is a little better.

Claude Sonnet provides the best12 response.

You can also chat back and forth within that particular conversation if you want to ask followup questions, and the AI automatically names the conversation like most LLMs do these days.

Since all these companies are constantly updating their models, you might find that one of the companies (Google, Microsoft, Anthropic, Twitter) might have a better response when you test it.

You might also choose the best model depending on the subject matter of the forum. Or you might choose to ask the question to multiple bots13 and choose the best one.

Now imagine if you could add OutScripts of all the interviews of vaccine skeptics on the DarkHorse Podcast to your Discourse forum. In theory, you should be able to easily respond to all the useless Twitter trolls with very specific citations :-)

Questions?

Do you have any questions about the topic of LLMs, or broadly on the topic of Natural Language Understanding and text analytics?

Let me know in the comments below, and I will do my best to answer them.

Free form text is notoriously hard to “grok” when you are doing data analysis, because it is inherently non-relational

I would also add that creating data subsets of VAERS for use within Zoho Analytics and Algolia makes it much easier for people to see exactly how organizations like CDC are suppressing important information. It is far superior to the current interface that the CDC provides for VAERS search.

Reading the prompt, you might notice that it already involved some “prompt engineering”. You might be surprised by how much better the results become when you spend some time engineering your prompt.

You can see the interactive chart on my website

For example, you can add LLM-generated summaries of Substack articles on the Kennedy MAHA forum which is built on Discourse.

The transcript should include timestamps for each sentence. If the timestamps are too far apart, the results are not very reliable.

For example, I make heavy use of OutScripts whenever I write rebuttals of videos made by people like Dan “Debunk the Funk” Wilson

By the way, the important thing here isn’t just the fact that it renders better. The underlying information is now in a structed data format which is also easier to search. However, you can clearly see that it omitted some words and also added its own table name, so obviously this does come at a little cost in terms of accuracy, which is generally true for all LLMs. Don’t use it blindly!

The Discourse hosting service I use is called CommuniteQ which includes the AI plugin as part of its least expensive plan.

You can also use other large language models - you need to have an API key (usually paid) and you also need to provide the URL for the appropriate API method

Someone was asking the Discourse developers for this feature. Apparently it is possible to do this with some kind of automation, but I need to pay for the higher tier in order to use it in my demo forum.

The Discourse AI plugin uses “Retrieval Augmented Generation” (aka RAG) under the hood. If you are familiar with how RAG works you can probably infer that GPT4 Turbo did not add all the required information into the prompt, while Claude Sonnet did add the information.

Obviously all of these requests incur costs against your API key. So please remember that before you overuse this feature.

Vaccine Data Science

Discussion about this post