
Introducing PLIP, a foundation model for pathology

Introduction
The continued AI revolution is bringing us innovations in all directions. OpenAI GPT(s) models are leading the event and showing how much foundation models can actually make a few of our day by day tasks easier. From helping us write higher to streamlining a few of our tasks, every single day we see recent models being announced.
Many opportunities are opening up in front of us. AI products that will help us in our work life are going to be probably the most vital tools we’re going to get in the following years.
Where are we going to see probably the most impactful changes? Where can we help people accomplish their tasks faster? Some of the exciting avenues for AI models is the one which brings us to Medical AI tools.
On this blog post, I describe PLIP (Pathology Language and Image Pre-Training) as considered one of the primary foundation models for pathology. PLIP is a vision-language model that will be used to embed images and text in the identical vector space, thus allowing multi-modal applications. PLIP is derived from the unique CLIP model proposed by OpenAI in 2021 and has been recently published in Nature Medicine:
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T., Zou, J., A visible–language foundation model for pathology image evaluation using medical Twitter. 2023, Nature Medicine.
Some useful links before starting our adventure:
We show that, through the use of information collection on social media and with some additional tricks, we will construct a model that will be utilized in Medical AI pathology tasks with good results — and without the necessity for annotated data.
While introducing CLIP (the model from which PLIP is derived) and its contrastive loss is a bit out of the scope of this blog post, it remains to be good to get a primary intro/refresher. The quite simple idea behind CLIP is that we will construct a model that puts images and text in a vector space through which “images and their descriptions are going to be close together”.
The GIF above also shows an example of how a model that embeds images and text in the identical vector space will be used for classification: by putting every little thing in the identical vector space we will associate each image with a number of labels by considering the space within the vector space: the closer the outline is to the image, the higher. We expect the closest label to be the actual label of the image.
To be clear: Once CLIP is trained you may embed any image or any text you’ve got. Consider that this GIF shows a 2D space, but usually, the spaces utilized in CLIP are of much higher dimensionality.
Which means once images and text are in the identical vector spaces, there are various things we will do: from zero-shot classification (find which text label is more much like a picture) to retrieval (find which image is more much like a given description).
How will we train CLIP? To place it simply, the model is fed with MANY image-text pairs and tries to place similar matching items close together (as within the image above) and all the remainder far-off. The more image-text pairs you’ve got, the higher the representation you’re going to learn.
We are going to stop here with the CLIP background, this must be enough to grasp the remainder of this post. I actually have a more in-depth blog post about CLIP on Towards Data Science.
CLIP has been trained to be a really general image-text model, but it surely doesn’t work as well for specific use cases (e.g., Fashion (Chia et al., 2022)) and there are also cases through which CLIP underperforms and domain-specific implementations perform higher (Zhang et al., 2023).
We now describe how we built PLIP, our fine-tuned version of the unique CLIP model that’s specifically designed for Pathology.
Constructing a Dataset for Pathology Language and Image Pre-Training
We’d like data, and this data needs to be ok for use to coach a model. The query is how will we find these data? What we want is images with relevant descriptions — just like the one we saw within the GIF above.
Although there’s a big amount of pathology data available on the internet, it is commonly lacking annotations and it could be in non-standard formats equivalent to PDF files, slides, or YouTube videos.
We’d like to look someplace else, and this someplace else goes to be social media. By leveraging social media platforms, we will potentially access a wealth of pathology-related content. Pathologists use social media to share their very own research online and to ask inquiries to their fellow colleagues (see Isom et al., 2017, for a discussion on how pathologists use social media). There may be also a set of generally really useful Twitter hashtags that pathologists can use to speak.
Along with Twitter data, we also collect a subset of images from the LAION dataset (Schuhmann et al., 2022), an unlimited collection of 5B image-text pairs. LAION has been collected by scraping the online and it’s the dataset that was used to coach lots of the popular OpenCLIP models.
Pathology Twitter
We collect greater than 100K tweets using pathology Twitter hashtags. The method is moderately easy, we use the API to gather tweets that relate to a set of specific hashtags. We remove tweets that contain an issue mark because these tweets often contain requests for other pathologies (e.g., “Which type of tumor is that this?”) and never information we’d really want to construct our model.
Sampling from LAION
LAION incorporates 5B image-text pairs, and our plan to gather our data goes to be as follows: we will use our own images that come from Twitter and find similar images in this huge corpus; in this fashion, we must always give you the chance to get reasonably similar images and hopefully, these similar images are also pathology images.
Now, doing this manually could be infeasible, embedding and searching over 5B embeddings is a really time-consuming task. Luckily there are pre-computed vector indexes for LAION that we will query with actual images using APIs! We thus simply embed our images and use K-NN search to search out similar images in LAION. Remember, each of those images comes with a caption, something that is ideal for our use case.
Ensuring Data Quality
Not all the pictures we collect are good. For instance, from Twitter, we collected a lot of group photos from Medical conferences. From LAION, we sometimes got some fractal-like images that would vaguely resemble some pathology pattern.
What we did was quite simple: we trained a classifier through the use of some pathology data as positive class data and ImageNet data as negative class data. This sort of classifier has an incredibly high precision (it’s actually easy to tell apart pathology images from random images on the internet).
Along with this, for LAION data we apply an English language classifier to remove examples that are usually not in English.
Training Pathology Language and Image Pre-Training
Data collection was the toughest part. Once that is completed and we trust our data, we will start training.
To coach PLIP we used the unique OpenAI code to do training — we implemented the training loop, added a cosine annealing for the loss, and a few tweaks here and there to make every little thing ran easily and in a verifiable way (e.g. Comet ML tracking).
We trained many various models (a whole bunch) and compared parameters and optimization techniques, Eventually, we were capable of give you a model we were pleased with. There are more details within the paper, but probably the most vital components when constructing this type of contrastive model is ensuring that the batch size is as large as possible during training, this permits the model to learn to tell apart as many elements as possible.
It’s now time to place our PLIP to the test. Is that this foundation model good on standard benchmarks?
We run different tests to judge the performance of our PLIP model. The three most interesting ones are zero-shot classification, linear probing, and retrieval, but I’ll mainly give attention to the primary two here. I’ll ignore experimental configuration for the sake of brevity, but these are all available within the manuscript.
PLIP as a Zero-Shot Classifier
The GIF below illustrates the best way to do zero-shot classification with a model like PLIP. We use the dot product as a measure of similarity within the vector space (the upper, the more similar).
In the next plot, you may see a fast comparison of PLIP vs CLIP on considered one of the dataset we used for zero-shot classification. There may be a big gain by way of performance when using PLIP to switch CLIP.
PLIP as a Feature Extractor for Linear Probing
One other strategy to use PLIP is as a feature extractor for pathology images. During training, PLIP sees many pathology images and learns to construct vector embeddings for them.
Let’s say you’ve got some annotated data and you must train a brand new pathology classifier. You may extract image embeddings with PLIP after which train a logistic regression (or any type of regressor you want) on top of those embeddings. That is a straightforward and effective strategy to perform a classification task.
Why does this work? The concept is that to coach a classifier PLIP embeddings, being pathology-specific, must be higher than CLIP embeddings, that are general purpose.
Here is an example of the comparison between the performance of CLIP and PLIP on two datasets. While CLIP gets good performance, the outcomes we get using PLIP are much higher.
use PLIP? listed below are some examples of the best way to use PLIP in Python and a Streamlit demo you should use to play a bit with the mode.
Code: APIs to Use PLIP
Our GitHub repository offers a few additional examples you may follow. We have now built an API that permits you to interact with the model easily:
from plip.plip import PLIP
import numpy as npplip = PLIP('vinid/plip')
# we create image embeddings and text embeddings
image_embeddings = plip.encode_images(images, batch_size=32)
text_embeddings = plip.encode_text(texts, batch_size=32)
# we normalize the embeddings to unit norm (in order that we will use dot product as an alternative of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)
It’s also possible to use the more standard HF API to load and use the model:
from PIL import Image
from transformers import CLIPProcessor, CLIPModelmodel = CLIPModel.from_pretrained("vinid/plip")
processor = CLIPProcessor.from_pretrained("vinid/plip")
image = Image.open("images/image1.jpg")
inputs = processor(text=["a photo of label 1", "a photo of label 2"],
images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
Demo: PLIP as an Educational Tool
We also imagine PLIP and future models will be effectively used as educational tools for Medical AI. PLIP allows users to do zero-shot retrieval: a user can seek for specific keywords and PLIP will try to search out probably the most similar/matching image. We built a straightforward web app in Streamlit which you could find here.
Thanks for reading all of this! We’re excited in regards to the possible future evolutions of this technology.
I’ll close this blog post by discussing some very vital limitations of PLIP and by suggesting some additional things I actually have written that is likely to be of interest.
Limitations
While our results are interesting, PLIP comes with a lot of different limitations. Data isn’t enough to learn all of the complex elements of pathology. We have now built data filters to make sure data quality, but we want higher evaluation metrics to grasp what the model is getting right and what the model is getting fallacious.
More importantly, PLIP doesn’t solve the present challenges of pathology; PLIP isn’t an ideal tool and might make many errors that require investigation. The outcomes we see are definitely promising and so they open up a spread of possibilities for future models in pathology that mix vision and language. Nevertheless, there remains to be a lot of work to do before we will see these tools utilized in on a regular basis medicine.
Miscellanea
I actually have a few other blog posts regarding CLIP modeling and CLIP limitations. For instance:
References
Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A.R., Gonçalves, D., Greco, C., & Tagliabue, J. (2022). Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12.
Isom, J.A., Walsh, M., & Gardner, J.M. (2017). Social Media and Pathology: Where Are We Now and Why Does it Matter? Advances in Anatomic Pathology.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402.
Zhang, S., Xu, Y., Usuyama, N., Bagga, J.K., Tinn, R., Preston, S., Rao, R.N., Wei, M., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., & Poon, H. (2023). Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing. ArXiv, abs/2303.00915.