Home Artificial Intelligence Bootstrapping Labels with GPT-4 Leveraging GPT-4’s Predictions for Data Pre-labeling Reviewing Pre-labeled Data in Label Studio Cost Evaluation Beyond Sentiment Evaluation: Label Any NLP Task Conclusion

Bootstrapping Labels with GPT-4 Leveraging GPT-4’s Predictions for Data Pre-labeling Reviewing Pre-labeled Data in Label Studio Cost Evaluation Beyond Sentiment Evaluation: Label Any NLP Task Conclusion

0
Bootstrapping Labels with GPT-4
Leveraging GPT-4’s Predictions for Data Pre-labeling
Reviewing Pre-labeled Data in Label Studio
Cost Evaluation
Beyond Sentiment Evaluation: Label Any NLP Task
Conclusion

A cheap approach to data labeling

Data labeling is a critical component for machine learning projects. It’s built on the old adage, “garbage in, garbage out.” Labeling involves creating annotated datasets for training and evaluation. But this process may be time-consuming and expensive, especially for projects with numerous data. But, what if we could use the advances in LLMs to cut back the price and energy involved in data labeling tasks?

GPT-4 is a state-of-the-art language model developed by OpenAI. It has a remarkable ability to know and generate human-like text and has been a game changer within the natural language processing (NLP) community and beyond. On this blog post, we’ll explore how you need to use GPT-4 to bootstrap labels for various tasks. This will significantly reduce the time and value involved within the labeling process. We’ll concentrate on sentiment classification to reveal how prompt engineering can enable you to create accurate and reliable labels using GPT-4 and the way this system may be used for way more powerful things as well.

As in writing, editing is commonly less strenuous than composing the unique work. That’s why starting with pre-labeled data is more attractive than starting with a blank slate. Using GPT-4 as a prediction engine to pre-label data stems from its ability to know context and generate human-like text. Subsequently, it might be excellent to leverage GPT-4 to cut back the manual effort required for data labeling. This might lead to cost savings and make the labeling process less mundane.

So how can we do that? For those who’ve used GPT models, you’re probably acquainted with prompts. Prompts set the context for the model before it begins generating output and may be tweaked and engineered (i.e. prompt engineering) to assist the model deliver highly specific results. This implies we will create prompts that GPT-4 can use to generate text that appears like model predictions. For our use case, we’ll craft our prompts in a way that guides the model toward producing the specified output format as well.

Let’s take an easy example of sentiment evaluation. If we are attempting to categorise the sentiment of a given string of text as positive, negative, or neutral we could provide a prompt like:

"Classify the sentiment of the next text as 'positive', 'negative', or 'neutral': "

Once we’ve a well-structured prompt, we will use the OpenAI API to generate predictions from GPT-4. Here’s an example using Python:

import openai
import re

openai.api_key = ""

def get_sentiment(input_text):
prompt = f"Respond within the json format: {{'response': sentiment_classification}}nText: {input_text}nSentiment (positive, neutral, negative):"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
max_tokens=40,
n=1,
stop=None,
temperature=0.5,
)
response_text = response.selections[0].message['content'].strip()
sentiment = re.search("negative|neutral|positive", response_text).group(0)
# Add input_text back in for the result
return {"text": input_text, "response": sentiment}

We are able to run this with a single example to examine the output we’re receiving from the API.

# Test single example
sample_text = "I had a terrible time on the party last night!"
sentiment = get_sentiment(sample_text)
print("Resultn",f"{sentiment}")
Result:
{'text': 'I had a terrible time on the party last night!', 'response': 'negative'}

Once we’re satisfied with our prompt and the outcomes we’re getting, we will scale this as much as our entire dataset. Here, we’ll assume a text file with one example per line.

import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
examples = []
for line in input_file:
text = line.strip()
if text:
examples.append(convert_ls_format(get_sentiment(text)))
output_file.write(json.dumps(examples))

We are able to import the information with pre-labeled predictions into Label Studio and have reviewers confirm or correct the labels. This approach significantly reduces the manual work required for data labeling, as human reviewers only must validate or correct the model-generated labels quite than annotate all the dataset from scratch. See our full example notebook here.

Note that in most situations, OpenAI is allowed to make use of any information sent to their APIs to coach their models further. So it’s vital to not send protected or private data to those APIs for labeling if we don’t want to reveal the data more broadly.

Once we’ve our pre-labeled data ready, we’ll import it into a knowledge labeling tool, similar to Label Studio, for review. This section will guide you thru establishing a Label Studio project, importing the pre-labeled data, and reviewing the annotations.

Figure 1: Reviewing Sentiment Classification in Label Studio. (Image by creator, screenshot with Label Studio)

Step 1: Install and Launch Label Studio

First, you could have Label Studio installed in your machine. You may install it using pip:

pip install label-studio

After installing Label Studio, launch it by running the next command:

label-studio

This can open Label Studio in your default web browser.

Step 2: Create a Latest Project

Click on “Create Project” and enter a project name, similar to “Review Bootstrapped Labels.” Next, you could define the labeling configuration. For Sentiment Evaluation, we will use the text Sentiment Evaluation Text Classification.

These templates are configurable, so if we would like to alter any of the properties, it’s really straightforward. The default labeling configuration is shown below.









Click “Create” to complete establishing the project.

Step 3: Import Pre-labeled Data

To import the pre-labeled data, click the “Import” button. Select the json file and choose the pre-labeled data file generated earlier (e.g., “output_responses.json”). The info will probably be imported together with the pre-populated predictions.

Step 4: Review and Update Labels

After importing the information, you possibly can review the model-generated labels. The annotation interface will display the pre-labeled sentiment for every text sample, and reviewers can either accept or correct the suggested label.

You may improve quality further by having multiple annotators review each example.

By utilizing GPT-4-generated labels as a start line, the review process becomes way more efficient, and reviewers can concentrate on validating or correcting the annotations quite than creating them from scratch.

Step 5: Export Labeled Data

Once the review process is complete, you possibly can export the labeled data by clicking the “Export” button within the “Data Manager” tab. Select the specified output format (e.g., JSON, CSV, or TSV), and save the labeled dataset for further use in your machine learning project.

LEAVE A REPLY

Please enter your comment!
Please enter your name here