Home Artificial Intelligence From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models The Use Case It’s all within the Prompt API to the Rescue The Way of the Code The Execution In search of More Nails The Final Nail The Co-conspirators Reference

From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models The Use Case It’s all within the Prompt API to the Rescue The Way of the Code The Execution In search of More Nails The Final Nail The Co-conspirators Reference

0
From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models
The Use Case
It’s all within the Prompt
API to the Rescue
The Way of the Code
The Execution
In search of More Nails
The Final Nail
The Co-conspirators
Reference

Cleansing up survey responses using OpenAI’s GPT Model. Full Code with Github link.

Towards Data Science
Image by Dall-E 2. Generated and modified by the writer.

Within the digital age, accurate and reliable data is paramount for businesses striving to deliver personalized experiences and make informed decisions[1]. Nonetheless, the sheer volume and complexity of information often pose significant challenges requiring many hours of tedious and manual work. Enter the game-changing technology of huge language models (LLMs). These advanced AI tools, with their natural language processing capabilities and pattern recognition, have the potential to revolutionize the technique of cleansing data to make it more usable.

Among the many wrenches and the screwdrivers in the info scientists’ tool chest are the LLMs, reshaping activities and harnessing powers to reinforce data quality. The proverbial whack of a hammer will unlock actionable insights and ultimately pave the way in which for higher customer experiences.

That said, let’s drill right into the use case that I might be using for instance today.

Photo by Scott Graham on Unsplash

The worst thing you might do when conducting a survey amongst students is to depart a factual field as free-form text! You possibly can imagine a number of the responses we got.

Jokes aside, certainly one of our clients, Study Fetch, an AI-powered platform that uses course material to create personalized all-in-one study sets for college students, conducted a survey amongst university students. After receiving a whopping 10K+ responses, their CEO and Co-Founder, Esan Durrani, stumbled upon somewhat hiccup. Seems, the “major” field within the survey was a free-form text box, meaning respondents could type in whatever they pleased. Now, as data scientists, we all know that’s not the brightest move if you wish to crunch some statistical numbers. So, the raw data from the survey ended up looking like this…

Anthropology 
Chem E
Computer Science
Business and Law
Drama
cs
IMB

Oh my! Get able to grab your Excel and embark on a sorting adventure which may take you a mere hour or, who knows, possibly even three. Only then will this data heresy be thoroughly purged.

Yet, fear not, as we’ve the hammer of the Large Language Model (LLM).

As a sensible man once said, if a hammer is all you’ve, all the pieces looks like a nail. And boy, doesn’t the info cleansing job appear to be probably the most perfect nail?

We will simply ask our friendly neighborhood LLM to categorise these into known majors. Specifically, OpenAI’s Generative Pre-trained Transformers (GPT), an LLM that powers the favored Chatbot app ChatGPT, will work for this case. GPT models use upwards of 175 billion parameters and have been trained on 2.6 billion stored web pages scraped from Common Crawl, an open dataset. Moreover, through a way generally known as reinforcement learning from human feedback (RLHF), trainers can nudge and prod the model into providing more accurate and useful responses. [2]

I believe for our purpose, 175 billion+ parameters, should do exactly high-quality. So long as we’re capable of give you the best prompt.

Photo by Kelly Sikkema on Unsplash

Ryan and Esan, from the AI company whose bread-and-butter is writing great prompts, proposed the primary version of our prompt. It was an amazing one and did work thoroughly using language inference[3], but there have been two things that may very well be improved:

  • It was written to work for one record
  • It was written as a ‘Completion’ using the Da Vinci Model (My checking account recoiled in fear on the mere mention of IT)

It might cost us an excessive amount of and that simply wasn’t going to do. So, Ryan and I independently rewrote the prompt as a chat prompt using ‘gpt-3.5-turbo’ to perform bulk motion. OpenAI’s prompt best practices and the course ChatGPT Prompt Engineering for Developers got here in handy for me. We went through just a few more iterations of ideate, implement, analyze, and reform and we had a very good working version.

Without further ado, here is the prompt after revision 2:

1. Arts and Humanities: Literature and humanities.
2. Social Sciences: psychology, sociology, and political science
3. Business and Management: marketing, finance, economics, MBA
4. Life Sciences: Biology, environmental,agriculture-related
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school
7. All Other Programs: All the things else.

The above information pertains to categories of undergraduate programs.
Categorize the below data into certainly one of this system categories above.

Desired Format: CSV of information row, program name

Data
###
Drama
Literature
Physics
Health sciences
Mechanical Engineering
###

Response from the LLM for this Prompt was

Drama, Arts and Humanities: Literature and humanities
IMB, All Other Programs: All the things else
Health sciences, Medicine Health and Allied Sciences: Nursing, medical school
MBA, Business and Management: marketing, finance, economics, MBA
Mechanical Engineering, Engineering and Technology: Electrical, mechanical, etc.

This may work, form of. But I didn’t quite just like the redundant text with the long program names repeating. With LLMs, text is tokens and tokens cost money. You see, my programming skills were forged within the fiery depths of the Dot Com Bust. And let me inform you, I never pass up a chance for some cost-saving optimizations.

So, I modified the prompt barely within the ‘Desired Format’ section. I asked the model to output just the ordinal variety of the survey responses (E.g. 1 for Drama above) and the ordinal variety of this system (E.g. 1 for Literature). Then Ryan suggested I should request a JSON output as a substitute of CSV to make it simpler to parse. He also really helpful I add an ‘example output’ section, a superb suggestion.

The ultimate prompt is as follows (simplified for clarity):

1. Arts and Humanities: Literature and humanities.
2. Social Sciences: psychology, sociology, and political science
3. Business and Management: marketing, finance, economics, MBA
4. Life Sciences: Biology, environmental,agriculture-related
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school
7. All Other Programs: All the things else.

The above information pertains to categories of undergraduate programs.
Categorize the below data into certainly one of this system categories above.

Desired Format: JSON format of an array with response number, program number

Example Example output: [[1,7],[2,1],[3,5],[4,8],[5,7]]

Data
###
1. Drama
2. IMB
3. Health sciences
4. MBA
5. Mechanical Engineering
###

The total prompt we used will be viewed on the GitHub link here.

The output from the model:

[
[1, 1]
[2, 3]
[3, 6]
[4, 3]
[5, 5]
]

So, as discussed earlier, the output from the model is a mapping between the ordinal numbers of survey responses and the categories we defined. Take the primary line for instance: 1,1. Which means 1 is the response number and 1 is the corresponding mapped program number. Survey response 1 is “Drama” and the mapped program 1 is “Arts and Humanities”. This seems right! Drama in its proper #1 place, all eyes on it.

While the output at first glance looks just like the output of embeddings (utilized in clustering and dimensionality reduction), they’re simply the identical mapped information with just the ordinal positions. Along with providing some cost advantages on token usage, the numbers are easier to parse.

We will now translate the unique survey response within the file into meaningful majors, do aggregation, and gain helpful actionable insights.

But wait, I’m not going to take a seat in front of my computer, type each block of survey responses into the browser and calculate the mappings. Besides being mind-numbing, the error rate would simply not do.

What we want is a few good old automation. Enter the API …

Photo by Laura Ockel on Unsplash

As chances are you’ll bear in mind, the Application Programming Interface (API) allows our program to interact with third-party services efficiently. While many individuals are accomplishing impressive feats with ChatGPT, the true potential of Language Models lies in utilizing the API to seamlessly integrate natural language capabilities into an application, making it imperceptible to the users. Very similar to the incredible science and technology that goes into making the phone or computer you’re using to read this text on.

In the event you don’t have already got it, you may request access to the API here, https://openai.com/blog/openai-api [4]. When you enroll and get your API key, the specification will be found here. Some really helpful examples with code samples will be found here. The playground is a pleasant feature to check the prompt with various settings before you place it in [5].

We might be using the chat completion API using REST. A sample payload of the decision is as follows:

{ 
“model”: “gpt-3.5-turbo”,
“temperature”: 0,
"n": 1,
“messages”: [
{“role”: “user”, “content”: “Hello, I'm a nail.”}
]
}

Let’s take a fast have a look at the parameters and their effects

model

The just one open to the general public at this point for chat completions is gpt-3.5-turbo. Esan had access to GPT 4 model, which I used to be very jealous about. While gpt-4 is more accurate and hallucinates less [2], it’s roughly 20 times costlier and for our needs, Mr. Turbo was quite adequate, thanks.

temperature

Next to the prompt, the temperature is some of the essential settings we are able to pass to the model. It may well be set to a price between 0 and a couple of, as per the API docs. It has a major impact [6] because it controls how much randomness is within the output, form of like the quantity of caffeine in your system before you begin writing. A guide to values you need to use for every application is given here [7]

For our use case, we simply want no variations. We wish the engine to provide us mappings as is and the identical ones each time. So, we used a price of 0.

n

What number of chat completions selections to generate? If we were doing this for creative writing and wanted greater than 1 selection to pick from, we are able to use 2 and even 3. For our case n=1 (default) will work well.

message

The role will be system, user, or assistant. The system role provides instructions and sets the context. The user role represents the prompt from the tip user. The assistant role is the responses based on the conversation history. These roles help structure conversations and enable effective interaction between users and the AI assistant.

MODEL MAX TOKENS

This isn’t necessarily a parameter we pass within the request, though one other parameter called max_tokens limits the full length of the response from the chat.

Firstly, a token will be considered a bit of a word. One token is roughly 4 characters in English. For instance, the quote “The most effective option to predict the longer term is to create it” attributed to Abraham Lincoln and others, accommodates 11 tokens.

Image from Open AI Tokenizer. Generated by the Writer.

In the event you are considering that a token is strictly a word, here is one other example of 64 tokens, to indicate it isn’t all that easy.

Image from Open AI Tokenizer. Generated by the Writer.

Brace yourself for a shocking revelation: every emoji you include in your message adds a hefty toll of as much as 6. That’s right, the one that you love smileys and winks are sneaky little token thieves! 😉💸

The model max token window is a technical limit. Your prompt (including any additional data you place into it) and the reply must all fit throughout the model max limit listed here. Within the case of chat completions, the content, role, and the entire previous messages all devour tokens. In the event you remove a message from the input or the output (assistant messages), the model will lose all knowledge of it [8]. Like Dory as she helps find Chico, no Fabio, no Bingo, no Harpo, no Elmo?… Nemo!

For gpt-3.5-turbo, the model maximum limit is 4096 tokens or roughly 16K characters. For our use case, the prompt is roughly 2000 characters, each survey response is roughly 20 chars (average) and the mapping response is 7 characters. So, if we put N survey responses in each prompt, the max characters might be:

2000 + 20*N + 7*N needs to be lower than 16,000.

Solving we get an N value lower than 518 or roughly 500. Technically, we could put 500 survey responses in each request and undergo our data 20 times. As an alternative, we selected to place 50 in each response and do it 200 times as we were receiving abnormal responses intermittently if we put greater than 50 survey responses in a single request. Every now and then, the service threw a temper tantrum! We’re undecided if it’s a chronic case of systemic petulance or if we just happened to come upon the grumpy side of luck.

So, how can we use this API we’ve? Let’s get to the great part, the code.

Photo by Markus Spiske on Unsplash

Node.js is a JavaScript runtime environment [9]. We are going to write a Node.js/Javascript program that may perform the actions as described on this flow chart:

Flowchart of this system. Image by the Writer.

My Javascript skills aren’t that great. I can write higher Java, PHP, Julia, Go, C#, and even Python. But Esan was insisting on Node, so Javascript it’s.

All the code, the prompt, and the sample input can be found at this GitHub link. Nonetheless, let’s take a gander on the juiciest bits:

First, let’s see how we read the CSV file in using the “csv-parser’ Node Library.

fs.createReadStream(fileName)
.pipe(csv())
.on('data', (data) => rows.push(data))
.on('end', async () => {
// Reading is finished. Call classifier here
. . .
}

Next, we call the classifier to generate the mappings.

for(index = 0; index < totalDataPoints; ++index) {
dataText += (index+1).toString() + ". "
+ uniqueData[index] + "n";
requestCount++;
if (requestCount>batchSize||index==totalDataPoints-1){
dataText += "###n";
// Now we have loaded batchSize(50) response.
// Now construct the prompt
...
}
}

The prompt is then constructed from the categories, fundamental prompt text, and the info from the CSV. We then send the prompt to the service using their OpenAI Node Library.

let prompt = categoriesText + mainPrompt + dataText;
let payload = {
model: "gpt-3.5-turbo",
temperature: 0,
messages: [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user","content": prompt }
]
};

try {
const response = await openai.createChatCompletion(payload);
let mapping = JSON.parse(response.data.selections[0].message.content);
// Here save the mappings
...

} catch (error) {
console.error('API Error:',error);
}

Finally, when all of the iterations are done, we are able to translate the srcCol text (the survey response) to the targetCol (the normalized program name), and write out the CSV.

for (let row of rows) {
srcVal = row[srcCol].trim();
if (!row[targetCol] && dataMapping[srcVal])
row[targetCol] = dataMapping[srcVal];
}

stringify(rows, {
header: true
}, function (err, output) {
fs.writeFile(__dirname+'/'+destFileName, output,
function(err, result) {
if(err) console.log('error', err);
});
});

That little bit of JavaScript wasn’t as hairy as I expected and it got done in 2 to three hours. I suppose it all the time looks daunting until you get into it.

So, now that we’ve the code ready, it’s time for the ultimate execution…

Photo by Alexander Grey on Unsplash

Now, we wanted a spot to run the code. After debating if we should always get a cloud instance to run the load, I did some quick math and realized that I could run it on my laptop in lower than an hour. That wasn’t so bad.

We began with a test round and noticed that the service, 1 out of 10 times, would respond back with the info that was provided to it as a substitute of the mappings. So, we’d just get the list of the survey responses back. Since no mappings were found, those responses within the CSV file can be mapped to an empty string.

As an alternative of detecting this and retrying in code, I made a decision to rerun the script but have it only process records for which the goal column was empty.

The script would start with the goal column in all rows empty and fill within the normalized program name. As a consequence of the error within the response, a number of the rows would have the goal column not be mapped and remain empty. When the script ran the second time, it could construct the prompt for under those responses which weren’t processed in the primary run. We reran this system a few times and got all the pieces mapped out.

The multiple runs took roughly about half-hour and didn’t need much supervision. Here’s a number of a number of the more interesting mappings from the model:

Sample Mappings between input and program name. Image by the Writer.

Most look right. Undecided if Organizational Behavior is Social Science or Business? I suppose either would work.

Each request of about 50 records took a complete of roughly 800 tokens. The price of all the exercise was 40 cents. We probably spent 10 cents, doing testing, reruns, etc. So for a complete cost of about 50 cents, about 2 ½ hrs of coding/testing time, and ½ hr of runtime, we got the job done.

Total Cost: Approx. Lower than $1

Total Time: Approx. 3 hours

Perhaps manual conversion using Excel, sorting, , and drag-and-copy, we could have completed it in the identical period of time and saved somewhat change. But, this was far more fun, we learned something, we’ve a repeatable script/process, and got an article out of it. Besides, I even have a sense StudyFetch can afford the 50 cents.

This was a very good use that we achieved efficiently and cost-effectively, but what else can LLMs be used for?

Photo by Marcel Strauß on Unsplash

Adding language capability to your applications can have further use cases than the one I illustrated above. Listed here are more use cases just pertaining to the review data we were :

Data Parsing and Standardization: LLMs can assist in parsing and standardizing data by identifying and extracting relevant information from unstructured or semi-structured data sources just like the one we just checked out.

Data Deduplication: LLMs can assist discover duplicate records by comparing various data points. For instance, we are able to compare names, majors, and universities within the review data and flag potential duplicates.

Data Summarization: LLMs can summarize distinct records to get an idea of the response. For E.g. for the query “What’s the most important challenge you face while studying?”, a big language model can summarize several responses from the identical major and university to see if there are any patterns. We will then put all of the summarizations right into a single request and get an overall list. But I think summarization from each customer segment might be more useful.

Sentiment Evaluation: LLMs can analyze the reviews to find out sentiment and extract helpful insights. For the query “Would you pay for a service to enable you to study?”, LLMs can categorize the sentiment as 0 (very negative) to five (very positive). We will then use this to investigate student interest in a paid service by segment.

While student reviews are an amazing example of a smaller microcosm, the broader world has several uses for this technology as well. At AAXIS, where I work, we implement business-to-business in addition to business-to-consumer digital commerce solutions. Doing so includes migrating large amounts of information from an existing older system to a more recent system with different data structures. We use quite a lot of data tools to investigate source data to make sure consistency. The techniques outlined in this text may very well be of fine use for that purpose.

A number of the other digital commerce use cases are checking the product catalog for errors, writing product copy, scanning review responses, and product review summarization, to call just a few. Rather a lot simpler to code for than the murky waters of undergrad creativity when asked for his or her majors.

Still, it’s important to notice that while LLMs will be powerful tools in cleansing data, they needs to be used at the side of other techniques and human oversight. Data cleansing processes often require domain expertise, context understanding, and manual review to make informed decisions and maintain data integrity. LLMs are also not inference engines[10]. They’re next-word predictors. And they have an inclination to offer misinformation very confidently and convincingly (hallucinations) [2][11]. Fortunately, during our testing, we didn’t encounter any hallucinations since our use case primarily involved classification.

LLMs will be an amazing tool in your arsenal in the event you proceed with caution and are aware of the pitfalls.

Photo by Paul Szewczyk on Unsplash

We started off, in this text, by a selected use case of information cleansing: normalizing survey responses to a selected set of values. This is able to allow us to group the responses and gain helpful insights. We used a Large Language Model (LLM), Open AI’s GPT 3.5 Turbo, to assist classify these responses. We reviewed the prompt that was used, learn how to make use of the API calls using the prompt, and the code needed to automate all of it. Finally, we got all of it put together and got the job done for a complete OpenAI utility cost of lower than one dollar.

Did we’ve a proverbial LLM hammer and located the peerlessly shiny nail in free-form survey responses? Possibly. More likely, we had a Swiss army knife and used it to skin and eat some fish. Not quite purpose-built, but still very adequate. And Esan really loves Sushi.

What’s your use case? We’d love to listen to from you!

The fundamental work in this text was performed by me, Esan Durrani, and Ryan Trattner, Co-Founders of StudyFetch, an AI-powered platform that uses course material to create personalized all-in-one study sets for college students.

I would really like to thank Prashant Mishra, Rajeev Hans, Israel Moura, and Andy Wagner, my colleagues at AAXIS Digital for his or her review of this text and suggestions.

I might also wish to thank my friend of thirty years, Kiran Bondalapati, VP of Engineering at TRM Labs for his initial formative guidance through the world of Generative AI and for reviewing this text.

Also, due to my editor, Megan Polstra, for making the article appear and feel skilled as all the time.

1. Temu Raitaluoto, “The importance of personalized marketing within the digital age”, MaketTailor Blog, May 2023, https://www.markettailor.io/blog/importance-of-personalized-marketing-in-digital-age

2. Ankur A. Patel, Bryant Linton and Dina Sostarec, GPT-4, GPT-3, and GPT-3.5 Turbo: A Review Of OpenAI’s Large Language Models, Apr 2023, Ankur’s Newsletter, https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review

3. Alexandra Mendes, Ultimate ChatGPT prompt engineering guide for general users and developers, Jun 2023, Imaginary Cloud Blog, https://www.imaginarycloud.com/blog/chatgpt-prompt-engineering/

4. Sebastian, How you can Use OpenAI’s ChatGPT API in Node.js, Mar 2023, Medium — Coding the Smart Way, https://medium.com/codingthesmartway-com-blog/how-to-use-openais-chatgpt-api-in-node-js-3f01c1f8d473

5. Tristan Wolff, Liberate Your Prompts From ChatGPT Restrictions With The OpenAI API Playground, Feb 2023, Medium — Tales of Tomorrow, https://medium.com/tales-of-tomorrow/liberate-your-prompts-from-chatgpt-restrictions-with-the-openai-api-playground-a0ac92644c6f

6. AlgoWriting, An easy guide to setting the GPT-3 temperature, Nov 2020, Medium, https://algowriting.medium.com/gpt-3-temperature-setting-101-41200ff0d0be

7. Kane Hooper, Mastering the GPT-3 Temperature Parameter with Ruby, Jan 2023, Plain English, https://plainenglish.io/blog/mastering-the-gpt-3-temperature-parameter-with-ruby

8. OpenAI Authors, GPT Guide — Managing tokens, 2023, OpenAI Documentation, https://platform.openai.com/docs/guides/gpt/managing-tokens

9. Priyesh Patel, What exactly is Node.js?, Apr 2018, Medium — Free Code Camp, https://medium.com/free-code-camp/what-exactly-is-node-js-ae36e97449f5

10. Ben Dickson, Large language models have a reasoning problem, June 2022, Tech Talks Blog, https://bdtechtalks.com/2022/06/27/large-language-models-logical-reasoning/

11. Frank Neugebauer, Understanding LLM Hallucinations, May 2023, Towards Data Science, https://towardsdatascience.com/llm-hallucinations-ec831dcd7786

LEAVE A REPLY

Please enter your comment!
Please enter your name here