Home Artificial Intelligence Domain Adaptation of A Large Language Model Step 1: The Data

Domain Adaptation of A Large Language Model Step 1: The Data

0
Domain Adaptation of A Large Language Model
Step 1: The Data

Adapt a pre-trained model to a brand new domain using HuggingFace

Towards Data Science
Image from unsplash

Large language models (LLMs) like BERT are often pre-trained on general domain corpora like Wikipedia and BookCorpus. If we apply them to more specialized domains like medical, there is usually a drop in performance in comparison with models adapted for those domains.

In this text, we are going to explore the way to adapt a pre-trained LLM like Deberta base to medical domain using the HuggingFace Transformers library. Specifically, we are going to cover an efficient technique called intermediate pre-training where we do further pre-training of the LLM on data from our goal domain. This adapts the model to the brand new domain, and improves its performance.

This is an easy yet effective technique to tune LLMs to your domain and gain significant improvements in downstream task performance.

Let’s start.

First step in any project is to organize the information. Since our dataset is in medical domain, it accommodates the next fields and lots of more:

image by writer

Putting the total list of fields here is not possible, as there are various fields. But even this glimpse into the present fields help us to form the input sequence for an LLM.

First point to take into account is that, the input must be a sequence because LLMs read input as text sequences.

To form this right into a sequence, we are able to inject special tags to inform the LLM what piece of knowledge is coming next. Consider the next example: name:John, surname: Doer, patientID:1234, age:34 , the is a special tag that tells LLM that what follows are details about a patient.

So we form the input sequence as following:

Image by writer

As you see, we’ve injected 4 tags:

  1. : to contain…

LEAVE A REPLY

Please enter your comment!
Please enter your name here