Home Artificial Intelligence End to End ML with GPT-3.5 Extract, transform, load (ETL) Model Training Model Serving

End to End ML with GPT-3.5 Extract, transform, load (ETL) Model Training Model Serving

End to End ML with GPT-3.5
Extract, transform, load (ETL)
Model Training
Model Serving

Learn learn how to use GPT-3.5 to do the heavy lifting for data acquisition, preprocessing, model training, and deployment

Lots of repetitive boilerplate code exists within the model development phase of any machine learning application. Popular libraries comparable to PyTorch Lightning have been created to standardize the operations performed when training/evaluating neural networks, resulting in much cleaner code. Nonetheless, boilerplate extends far beyond training loops. Even the information acquisition phase of machine learning projects is stuffed with steps which can be mandatory but time consuming. One strategy to cope with this challenge could be to create a library just like PyTorch Lightning for your entire model development process. It will need to be general enough to work with quite a lot of model types beyond neural networks, and able to integrating quite a lot of data sources.

Code examples for extracting data, preprocessing, model training, and deployment is quickly available on the web, though gathering it, and integrating it right into a project takes time. Since such code is on the web, likelihood is it has been trained on by a big language model (LLM) and will be rearranged in quite a lot of useful ways through natural language commands. The goal of this post is to indicate how easy it’s to automate lots of the steps common to ML projects through the use of the GPT-3.5 API from OpenAI. I’ll show some failure cases along the way in which, and learn how to tune prompts to repair bugs when possible. Ranging from scratch, without even a lot as a dataset, we’ll find yourself with a model that’s able to be deployed on AWS SageMaker. If you happen to’re following along, make sure that to setup the OpenAI API as follows:

import openai
openai.api_key = "YOUR KEY HERE"

Also, the next utility function is useful for calling the GPT-3.5 API:

def get_api_result(prompt):
request = openai.ChatCompletion.create(
messages=[{"role": "user", "content": prompt}]

result = request['choices'][0]['message']['content']


ETL Illustration by me + Midjourney

This section is simplified because it only considers a single data source, but can in principle be prolonged so situations where data comes from multiple sources (csv files, databases, images, etc.). Step one is to extract some data. For the sake of simplicity, I’ll use the Income Prediction¹ dataset where the goal is to predict if a person earns more/lower than $50k per yr based on their education, job position, industry, etc. The function below will likely be used to generate the code that downloads our data. Note how the prompt template is designed to bias the API to generate python code:

def extract(prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Please provide a natural language input text, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(prompt)


Here we ask GPT-3.5 to:

Retrieve the adult income prediction dataset from openml using the sklearn fetch_openml function. Make certain to retrieve the information as a single dataframe which incorporates the goal in a column named “goal”. Name the resulting dataframe “df”.

which supplies us the output:

from sklearn.datasets import fetch_openml

adult_income = fetch_openml(name='adult', version=2)
df = pd.DataFrame(adult_income.data, columns=adult_income.feature_names)
df['target'] = adult_income.goal

This code is freed from bugs and provides us exactly what we wish. Had I used a less complicated prompt by removing mentions of openml and the function to make use of for retrieval, we might get:

import pandas as pd

df = pd.read_csv("adult_income_prediction.csv")
df.columns = ["age", "workclass", "fnlwgt", "education", "education-num",
"marital-status", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours-per-week",
"native-country", "target"]

This assumes that the information is locally available. What’s interesting about this result’s that it has the right column names as a listing, despite the fact that we didn’t include them within the API call! These names nevertheless are all around the web, particularly on this Medium post, aside from the goal column which is added by GPT. The subsequent step is to transform the information right into a format that’s usable by machine learning models. We’ll need to use a more advanced prompt template here since GPT must be made aware of the column names and their types:

def transform(columns, column_types, prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Please provide a natural language input text, and I'll generate the corresponding Python code using the Pandas to preprocess the DataFrame. The DataFrame columns are {} and their corresponding dtypes are {}.nInput: {}nPython code:""".format(columns, column_types, prompt)


I then make the next request:

Preprocess the dataframe by converting all categorical columns to their one-hot encoded equivalents, and normalizing numerical columns. Drop rows which have an NA or NaN value in any column. Drop rows which have numeric column outliers as determined by their z rating. A numeric column outlier is a price that’s outside of the 1 to 99 inter-quantile range. The numerical columns must be normalized using StandardScaler from sklearn. The values within the goal colummn must be converted to 0 or 1 and must be of type int.

We now get:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the DataFrame
df = pd.read_csv('data.csv')

# Convert categorical columns to one-hot encoded equivalents
df = pd.get_dummies(df, columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])

# Normalize numerical columns
scaler = StandardScaler()
numerical_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Drop rows with NA or NaN values

# Drop rows with numeric column outliers
z_scores = df[numerical_cols].apply(lambda x: (x - x.mean()) / x.std())
df = df[(z_scores > -3).all(axis=1) & (z_scores < 3).all(axis=1)]

# Convert goal column to 0 or 1 and alter type to int
df['target'] = df['target'].apply(lambda x: 1 if x == '>50K' else 0).astype(int)

which is sort of perfect for our use if we exclude the part that tries to load in the information from a csv file. It’s quite difficult to persuade GPT to exclude this, even when we explicitly tell it to assume that df exists and mustn’t be loaded. Lastly, we want to load the information into an area database. That is overkill for such an easy use case, but is habit to develop.

def load(prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Please provide a natural language input text, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(prompt)


I exploit the next instruction:

Hook up with an sqlite database named “data”. Use pandas to insert data from a DataFrame named “df” right into a table named “income”. Don’t include the index column. Commit the changes before closing the connection.

with the intention to get:

import sqlite3
import pandas as pd

# Create a connection to the database
connection = sqlite3.connect("data.db")

# Define the DataFrame
df = pd.DataFrame()

# Define the table name
table_name = "income"

# Insert the information into the table
df.to_sql(table_name, connection, if_exists="replace", index=False)

# Commit the changes to the database

# Close the connection

which is ideal, minus the unnecessary creation of df. With this processed data in hand, we’re able to train some models.

Illustration of a loss function by me + Midjourney

Resources permitting, it’s idea to check out a couple of different model types to discover the one with the precise level of complexity for the given task. Due to this fact, we ask GPT-3.5 to check out a couple of different models. First, let’s arrange the generic prompt template for model training:

def train(prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Concentrate on using scikit-learn when applicable. Please provide a natural language input text, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(prompt)


and the prompt we’ll be using is:

Train quite a lot of classification models to predict the “goal” column using all other columns. Accomplish that using 5-fold cross validation to decide on the very best model and corresponding set of hyperparameters, and return the very best overall model and corresponding hyperparameter settings. Select the very best model based on accuracy. Assume a dataframe named “df” exists which is for use for training. Log your entire process using MLFlow. Start logging with mlflow before training any models so only a single run is stored. Make certain that the model is logged using the sklearn module of mlflow. Make certain that only the very best overall model is logged, but log metrics for all model types. The mean value of the next metrics on all cross validation folds must be logged: accuracy, AUC, F1 rating

Let’s have a take a look at the generated output and this time go deeper into why a few of the specific instructions needed to be provided.

import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import mlflow
import mlflow.sklearn

# load the dataframe
df = pd.read_csv('path/to/dataframe.csv')

# define the features and goal columns
features = df.columns.drop('goal')
goal = 'goal'

# arrange mlflow
mlflow.set_tracking_uri('localhost:5000') # or arrange your individual server

# arrange cross-validation
cv = 5

# define the models and corresponding hyperparameters to be tuned
models = [
'name': 'Logistic Regression',
'estimator': LogisticRegression(),
'hyperparameters': {
'penalty': ['l1', 'l2'],
'C': [0.1, 1, 10]
'name': 'Decision Tree',
'estimator': DecisionTreeClassifier(),
'hyperparameters': {
'max_depth': [2, 4, 6, 8]
'name': 'Random Forest',
'estimator': RandomForestClassifier(),
'hyperparameters': {
'n_estimators': [10, 25, 50, 100],
'max_depth': [None, 2, 4, 6, 8]

# iterate through the models and corresponding hyperparameters to coach and tune them
with mlflow.start_run():
best_model = None
best_metrics = {}

for model in models:
print(f"Training {model['name']} model...")

# arrange grid seek for hyperparameter tuning
gs = GridSearchCV(model['estimator'], model['hyperparameters'], cv=cv, verbose=2, n_jobs=-1, scoring='accuracy')
gs.fit(df[features], df[target])

# log metrics for all model types
metrics = cross_validate(gs.best_estimator_, df[features], df[target], cv=cv, n_jobs=-1,
scoring=["accuracy", "roc_auc", "f1"])
for metric, values in metrics.items():
mlflow.log_metric(f"{model['name']}_{metric}", values.mean())

# check if that is the very best model based on accuracy
if best_model is None or gs.best_score_ > best_metrics['accuracy']:
best_model = gs.best_estimator_
best_metrics = {'accuracy': gs.best_score_,
'AUC': roc_auc_score(df[target], best_model.predict_proba(df[features])[:, 1]),
'F1': f1_score(df[target], best_model.predict(df[features]))

# register the very best model in mlflow
mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model")

# log the very best model's metrics
for metric, value in best_metrics.items():
mlflow.log_metric(f"best_model_{metric}", value)

If we remove the loading of df and the section # arrange mlflow , we find yourself with exactly what’s desired. Namely, a loop over a 3 different model types, performing a grid search using 5-fold cross validation to discover the very best hyperparmeters for the given model type, while keeping track of metrics. Without specifying “select the very best model based on accuracy”, the generated code will use scoring=[“accuracy”, “roc_auc", “f1”] for the grid search which is able to not work since there’s ambiguity as to learn how to select the very best model in accordance with multiple metrics. Without “make sure that that the model is logged using the sklearn module of mlflow”, we sometimes find yourself with mlflow.log_model() which is incorrect. Also, “make sure that that only the very best overall model is logged” is mandatory to avoid storing all models. Overall, this output is suitable, but it surely’s unstable, and running it multiple times is prone to introduce different bugs. So as to have every thing ready for the serving step, it is helpful so as to add the model signature when saving the model. This signature is largely a group of feature names and their corresponding types. It’s a pain to get GPT-3.5 so as to add this, so some manual labor needs to be done by first adding the import:

from mlflow.models.signature import infer_signature

after which modifying the road of code which logs the model via:

mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model", signature=infer_signature(df[features], best_model.predict(df[features])))
Illustration of deployment by me + Midjourney

Since we used MLflow to log the very best model, we’ve got a few options to serve the model. The best option is to host the model locally. Let’s first design the final prompt template for model serving:

def serve_model(model_path, prompt):
prompt_template = """You're a ChatGPT language model that may generate shell code for deploying models using MLFlow. Please provide a natural language input text, and I'll generate the corresponding command to deploy the model. The model is positioned within the file {}.nInput: {}nShell command:""".format(model_path, prompt)


and the prompt will likely be:

Serve the model using port number 1111, and use the local environment manager

By calling serve_model("", query) we get:

mlflow models serve -m  -p 1111 --no-conda

Once we run this command within the shell, we’re able to make predictions by sending data encoded as JSON to the model. We’ll first generate the command to send data to the model, after which create the JSON payload to be inserted into the command.

def send_request(prompt):
prompt_template = """You're a ChatGPT language model that may generate code for sending data to deployed MLFlow models. Please provide a natural language input text, and I'll generate the corresponding command. nInput: {}nCommand:""".format(prompt)


The next request will likely be inserted into the prompt template in send_request():

Use the “curl” command to send data “” to an mlflow model hosted at port 1111 on localhost. Make certain that the content type is “application/json”.

The output generated by GPT-3.5 is:

curl -X POST -H "Content-Type: application/json" -d '' http://localhost:1111/invocations

It’s preferable to have the URL immediately after curl as a substitute of being on the very end of the command, i.e.

curl http://localhost:1111/invocations -X POST -H "Content-Type: application/json" -d ''

Getting GPT-3.5 to do that just isn’t easy. Each of the next requests fail to achieve this:

Use the “curl” command to send data “” to an mlflow model hosted at port 1111 on localhost. Place the URL immediately after “curl”. Make certain that the content type is “application/json”.

Use the “curl” command, with the URL placed before any argument, to send data “” to an mlflow model hosted at port 1111 on localhost. Make certain that the content type is “application/json”.

Perhaps it’s possible to get the specified output if we’ve got GPT-3.5 modify an existing command slightly than generate one from scratch. Here is the generic template for modifying commands:

def modify_request(prompt):
prompt_template = """You're a ChatGPT language model that may modify commands for sending data using "curl". Please provide a natural language instruction, corresponding command, and I'll generate the modified command. nInput: {}nCommand:""".format(prompt)


We are going to call this function as follows:

code = """curl -X POST -H "Content-Type: application/json" -d '' http://localhost:1111/invocations"""
prompt = """Please modify the next by placing the url before the "-X POST" argument:n{}""".format(code)

which finally gives us:

curl http://localhost:1111/invocations -X POST -H "Content-Type: application/json" -d ''

Now time to create the payload:

def create_payload(prompt):
prompt_template = """You're a ChatGPT language model that may generate code for sending data to deployed MLFlow models. Please provide a natural language input text, and I'll generate the corresponding command. nInput: {}nPython code:""".format(prompt)


The prompt for this part needed quite a little bit of tuning to get the specified output format:

Convert the DataFrame “df” to json format that will be received by a deployed MLFlow model. Wrap the resulting json in an object called “dataframe_split”. The resulting string mustn’t have newlines, and it mustn’t escape quotes. Also, “dataframe_split” must be surrounded by doubles quotes as a substitute of single quotes. Don’t include the “goal” column. Use the split “orient” argument

Without the specific instruction to avoid newlines and escaping quotes, a call to json.dumps() is made which just isn’t the format that the MLflow endpoint expects. The generated command is:

json_data = df.drop("goal", axis=1).to_json(orient="split", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'

Before replacing within the curl request with the worth of wrapped_data, we probably need to send only a couple of rows of knowledge for prediction, otherwise the resulting payload is just too large. So we modify the above to be:

json_data = df[:5].drop("goal", axis=1).to_json(orient="split", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'

Invoking the model gives:

{"predictions": [0, 0, 0, 1, 0]}

whereas the actual targets are [0, 0, 1, 1, 0].

There we’ve got it. In the beginning of this post, we didn’t even have access to a dataset, yet we’ve managed to find yourself with a deployed model that was chosen to be the very best through cross-validation. Importantly, GPT-3.5 did all of the heavy lifting, and only required minimal assistance along the way in which. I did nonetheless need to specify particular libraries to make use of and methods to call, but this was mainly required to resolve ambiguities. Had I specified “Log your entire process” as a substitute of “Log your entire process using MLFlow”, GPT-3.5 would have too many libraries to select from, and the resulting model format may not have been useful for serving with MLflow. Thus, some knowledge of the tools used to perform the varied steps within the ML pipeline is required to have success using GPT-3.5, but it surely is minimal in comparison with the knowledge required to code from scratch.

Another choice for serving the model is to host it as a SageMaker endpoint on AWS. Despite how easy this may occasionally look on the MLflow website, I assure you that as with many examples on the net involving AWS, things will go incorrect. To begin with, Docker should be installed with the intention to generate the Docker Imager using the command:

mlflow sagemaker build-and-push-container

Second, the Ptyhon library boto3 used to speak with AWS also requires installation. Beyond this, permissions should be properly setup such that SageMaker, ECR, and S3 services can communicate with one another on behalf of your account. Listed here are the commands I ended up having to make use of:

mlflow deployments run-local -t sagemaker -m  --name income_classifier
mlflow deployments create -t sagemaker --name income_classifier -m model/ --config image_url= --config bucket=mlflow-serving --config region_name=us-east-1

together with some manual tinkering behind the scenes to get the S3 bucket to be in the right region.

With the assistance of GPT-3.5 we went through the ML pipeline in a (mostly) painless way, though the last mile was a bit trickier. Note how I didn’t use GPT-3.5 to generate the commands for serving the model on AWS. It really works poorly for this use case, and creates made up argument names. I can only speculate that switching to the GPT-4.0 API would help resolve a few of the above bugs, and result in an excellent easier model development experience.

While the ML pipeline will be fully automated using LLMs, it isn’t yet secure to have a non-expert be chargeable for the method. The bugs within the above code were easily identified since the Python interpreter would throw errors, but there are more subtle bugs that will be harmful. For instance, the elimination of outlier values within the preprocessing code might be incorrect resulting in excess or insufficient samples being discarded. Within the worst case, it could inadvertently drop entire subgroups of individuals, exacerbating potential fairness issues.

Moreover, the grid search over hyperparameters might have been done over a poorly chosen range, resulting in overfitting or underfitting depending on the range. This might be quite tricky to discover for somebody with little ML experience because the code otherwise seems correct, but an understanding of how regularization works in these models is required. Thus, it isn’t yet appropriate to have an unspecialized software engineer stand in for an ML engineer, but that point is fast approaching.

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. (CC BY 4.0)


Please enter your comment!
Please enter your name here