Home Artificial Intelligence LMQL — SQL for Language Models What’s LMQL? LMQL syntax Getting began Sentiment Evaluation Summary Dataset

LMQL — SQL for Language Models What’s LMQL? LMQL syntax Getting began Sentiment Evaluation Summary Dataset

0
LMQL — SQL for Language Models
What’s LMQL?
LMQL syntax
Getting began
Sentiment Evaluation
Summary
Dataset

I’m sure you’ve heard about SQL and even have mastered it. SQL (Structured Query Language) is a declarative language widely used to work with database data.

In line with the annual StackOverflow survey, SQL remains to be one of the vital popular languages on the planet. For skilled developers, SQL is within the top-3 languages (after Javascript and HTML/CSS). Greater than a half of pros use it. Surprisingly, SQL is much more popular than Python.

Graph by creator, data from StackOverflow survey

SQL is a typical solution to refer to your data in a database. So, it is not any surprise that there are attempts to make use of the same approach for LLMs. In this text, I would love to let you know about one such approach called LMQL.

LMQL (Language Model Query Language) is an open-source programming language for language models. LMQL is released under Apache 2.0 license, which permits you to use it commercially.

LMQL was developed by ETH Zurich researchers. They proposed a novel idea of LMP (Language Model Programming). LMP combines natural and programming languages: text prompt and scripting instructions.

In the unique paper, “Prompting Is Programming: A Query Language for Large Language Models” by Luca Beurer-Kellner, Marc Fischer and Martin Vechev, the authors flagged the next challenges of the present LLM usage:

  • Interaction. For instance, we could use meta prompting, asking LM to expand the initial prompt. As a practical case, we could first ask the model to define the language of the initial query after which respond in that language. For such a task, we are going to must send the primary prompt, extract language from the output, add it to the second prompt template and make one other call to the LM. There’s quite a variety of interactions we want to administer. With LMQL, you possibly can define multiple input and output variables inside one prompt. Greater than that, LMQL will optimise overall likelihood across quite a few calls, which could yield higher results.
  • Constraint & token representation. The present LMs don’t provide the functionality to constrain output, which is crucial if we use LMs in production. Imagine constructing a sentiment evaluation in production to mark negative reviews in our interface for CS agents. Our program would expect to receive from the LLM “positive”, “negative”, or “neutral”. Nonetheless, very often, you may get something like “The sentiment for provided customer review is positive” from the LLM, which is just not really easy to process in your API. That’s why constraints could be pretty helpful. LMQL permits you to control output using human-understandable words (not tokens that LMs operate with).
  • Efficiency and value. LLMs are large networks, in order that they are pretty expensive, no matter whether you employ them via API or in your local environment. LMQL can leverage predefined behaviour and the constraint of the search space (introduced by constraints) to cut back the variety of LM invoke calls.

As you possibly can see, LMQL can address these challenges. It permits you to mix multiple calls in a single prompt, control your output and even reduce cost.

The impact on cost and efficiency may very well be pretty substantial. The constraints to the search space can significantly reduce costs for LLMs. For instance, within the cases from the LMQL paper, there have been 75–85% fewer billable tokens with LMQL compared to straightforward decoding, which implies it should significantly reduce your cost.

Image from the paper by Beurer-Kellner et al. (2023)

I consider probably the most crucial good thing about LMQL is the whole control of your output. Nonetheless, with such an approach, you will even have one other layer of abstraction over LLM (much like LangChain, which we discussed earlier). It is going to can help you switch from one backend to a different easily if it is advisable. LMQL can work with different backends: OpenAI, HuggingFace Transformers or llama.cpp.

You’ll be able to install LMQL locally or use a web-based Playground online. Playground might be pretty handy for debugging, but you possibly can only use the OpenAI backend here. For all other use cases, you’ll have to make use of local installation.

As usual, there are some limitations to this approach:

  • This library is just not very fashionable yet, so the community is pretty small, and few external materials can be found.
  • In some cases, documentation won’t be very detailed.
  • The most well-liked and best-performing OpenAI models have some limitations, so you possibly can’t use the complete power of LMQL with ChatGPT.
  • I wouldn’t use LMQL in production since I can’t say that it’s a mature project. For instance, distribution over tokens provides pretty poor accuracy.

Somewhat close alternative to LMQL is Guidance. It also permits you to constrain generation and control the LM’s output.

Despite all the restrictions, I just like the concept of Language Model Programming, and that’s why I’ve decided to debate it in this text.

For those who’re interested to learn more about LMQL from its authors, check this video.

Now, we all know a bit what LMQL is. Let’s have a look at the instance of an LMQL query to get acquainted with its syntax.

beam(n=3)
"Q: Say 'Hello, {name}!'"
"A: [RESPONSE]"
from "openai/text-davinci-003"
where len(TOKENS(RESPONSE)) < 20

I hope you possibly can guess its meaning. But let’s discuss it intimately.
Here’s a scheme for a LMQL query

Image from paper by Beurer-Kellner et al. (2023)

Any LMQL program consists of 5 parts:

  • Decoder defines the decoding procedure used. In easy words, it describes the algorithm to choose up the subsequent token. LMQL has three several types of decoders: argmax, beam and sample. You’ll be able to find out about them in additional detail from the paper.
  • Actual query is comparable to the classic prompt but in Python syntax, which implies that you may use such structures as loops or if-statements.
  • In from clause, we specified the model to make use of (openai/text-davinci-003 in our example).
  • Where clause defines constraints.
  • Distribution is used when you need to see probabilities for tokens within the return. We haven’t used distribution in this question, but we are going to use it to get class probabilities for the sentiment evaluation later.

Also, you may have noticed special variables in our query {name} and [RESPONSE]. Let’s discuss how they work:

  • {name} is an input parameter. It may very well be any variable out of your scope. Such parameters enable you create handy functions that may very well be easily re-used for various inputs.
  • [RESPONSE] is a phrase that LM will generate. It could even be called a hole or placeholder. All of the text before [RESPONSE] is distributed to LM, after which the model’s output is assigned to the variable. It’s handy that you may easily re-use this output later within the prompt, referring to it as {RESPONSE}.

We’ve briefly covered the fundamental concepts. Let’s try it ourselves. Practice makes perfect.

Establishing environment

Initially, we want to establish the environment. To make use of LMQL in Python, we want to put in a package first. No surprises, we are able to just use pip. You would like an environment with Python ≥ 3.10.

pip install lmql

If you need to use LMQL with local GPU, follow the instructions within the documentation.

To make use of OpenAI models, it is advisable arrange APIKey to access OpenAI. The simplest way is to specify the OPENAI_API_KEY environment variable.

import os
os.environ['OPENAI_API_KEY'] = ''

Nonetheless, OpenAI models have many limitations (for instance, you won’t have the ability to get distributions with greater than five classes). So, we are going to use Llama.cpp to check LMQL with local models.

First, it is advisable install Python binding for Llama.cpp in the identical environment as LMQL.

pip install llama-cpp-python

If you need to use local GPU, specify the next parameters.

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

Then, we want to load model weights as .gguf files. Yow will discover models on HuggingFace Models Hub.

We shall be using two models:

Llama-2–7B is the smallest version of fine-tuned generative text models by Meta. It’s a fairly basic model, so we shouldn’t expect outstanding performance from it.

Zephyr is a fine-tuned version of the Mistral model with decent performance. It performs higher in some points than a 10x larger open-source model Llama-2–70b. Nonetheless, there’s still some gap between Zephyr and proprietary models like ChatGPT or Claude.

Image from the paper by Tunstall et al. (2023)

In line with the LMSYS ChatBot Arena leaderboard, Zephyr is the best-performing model with 7B parameters. It’s on par with much greater models.

Screenshot of leaderboard | source

Let’s load .gguf files for our models.

import os
import urllib.request

def download_gguf(model_url, filename):
if not os.path.isfile(filename):
urllib.request.urlretrieve(model_url, filename)
print("file has been downloaded successfully")
else:
print("file already exists")

download_gguf(
"https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/fundamental/zephyr-7b-beta.Q4_K_M.gguf",
"zephyr-7b-beta.Q4_K_M.gguf"
)

download_gguf(
"https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/fundamental/llama-2-7b.Q4_K_M.gguf",
"llama-2-7b.Q4_K_M.gguf"
)

We want to download a number of GBs in order that it’d take a while (10–quarter-hour for every model). Luckily, it is advisable do it just once.

You’ll be able to interact with the local models in two other ways (documentation):

  • Two-process architecture when you will have a separate long-running process together with your model and short-running inference calls. This approach is more suitable for production.
  • For ad-hoc tasks, we could use in-process model loading, specifying local: before the model name. We shall be using this approach to work with the local models.

Now, we’ve arrange the environment, and it’s time to debate tips on how to use LMQL from Python.

Python functions

Let’s briefly discuss tips on how to use LMQL in Python. Playground might be handy for debugging, but when you need to use LM in production, you would like an API.

LMQL provides 4 fundamental approaches to its functionality: lmql.F , lmql.run , @lmql.query decorator and Generations API.

Generations API has been recently added. It’s an easy Python API that helps to do inference without writing LMQL yourself. Since I’m more excited by the LMP concept, we won’t cover this API in this text.

Let’s discuss the opposite three approaches intimately and check out to make use of them.

First, you may use lmql.F. It’s a light-weight functionality much like lambda functions in Python that would can help you execute a part of LMQL code. lmql.F can have just one placeholder variable that shall be returned from the lambda function.

We could specify each prompt and constraint for the function. The constraint shall be comparable to the where clause within the LMQL query.

Since we haven’t specified any model, the OpenAI text-davinci shall be used.

capital_func = lmql.F("What's the captital of {country}? [CAPITAL]", 
constraints = "STOPS_AT(CAPITAL, '.')")

capital_func('the UK')

# Output - 'nnThe capital of the UK is London.'

For those who’re using Jupyter Notebooks, you may encounter some problems since Notebooks environments are asynchronous. You might enable nested event loops in your notebook to avoid such issues.

import nest_asyncio
nest_asyncio.apply()

The second approach permits you to define more complex queries. You should utilize lmql.run to execute an LMQL query without making a function. Let’s make our query a bit more complicated and use the reply from the model in the next query.

On this case, we’ve defined constraints within the where clause of the query string itself.

query_string = '''
"Q: What's the captital of {country}? n"
"A: [CAPITAL] n"
"Q: What's the fundamental sight in {CAPITAL}? n"
"A: [ANSWER]" where (len(TOKENS(CAPITAL)) < 10)
and (len(TOKENS(ANSWER)) < 100) and STOPS_AT(CAPITAL, 'n')
and STOPS_AT(ANSWER, 'n')
'''

lmql.run_sync(query_string, country="the UK")

Also, I’ve used run_sync as a substitute of run to get a result synchronously.

Because of this, we got an LMQLResult object with a set of fields:

  • prompt — include the entire prompt with the parameters and the model’s answers. We could see that the model answer was used for the second query.
  • variables — dictionary with all of the variables we defined: ANSWER and CAPITAL .
  • distribution_variable and distribution_values are None since we haven’t used this functionality.
Image by creator

The third solution to use Python API is the @lmql.query decorator, which permits you to define a Python function that shall be handy to make use of in the long run. It’s more convenient in case you plan to call this prompt several times.

We could create a function for our previous query and get only the ultimate answer as a substitute of returning the entire LMQLResult object.

@lmql.query
def capital_sights(country):
'''lmql
"Q: What's the captital of {country}? n"
"A: [CAPITAL] n"
"Q: What's the fundamental sight in {CAPITAL}? n"
"A: [ANSWER]" where (len(TOKENS(CAPITAL)) < 10) and (len(TOKENS(ANSWER)) < 100)
and STOPS_AT(CAPITAL, 'n') and STOPS_AT(ANSWER, 'n')

# return just the ANSWER
return ANSWER
'''

print(capital_sights(country="the UK"))

# There are a lot of famous sights in London, but one of the vital iconic is
# the Big Ben clock tower situated within the Palace of Westminster.
# Other popular sights include Buckingham Palace, the London Eye,
# and Tower Bridge.

Also, you may use LMQL together with LangChain:

  • LMQL queries are Prompt Templates on steroids and may very well be a part of LangChain chains.
  • You might leverage LangChain components from LMQL (for instance, retrieval). Yow will discover examples within the documentation.

Now, we all know all the fundamentals of LMQL syntax, and we’re able to move on to our task — to define sentiment for customer comments.

To see how LMQL is performing, we are going to use labelled Yelp reviews from the UCI Machine Learning Repository and check out to predict sentiment. All reviews within the dataset are positive or negative, but we are going to keep neutral as certainly one of the possible options for classification.

For this task, let’s use local models — Zephyr and Llama-2. To make use of them in LMQL, we want to specify the model and tokeniser after we are calling LMQL. For Llama-family models, we are able to use the default tokeniser.

First attempts

Let’s pick one customer review The food was superb. and check out to define its sentiment. We’ll use lmql.run for debugging because it’s convenient for such ad-hoc calls.

I’ve began with a really naive approach.

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: [SENTIMENT]"
"""

lmql.run_sync(
query_string,
model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta'))

# [Error during generate()] The requested variety of tokens exceeds
# the llama.cpp model's context size. Please specify a better n_ctx value.

In case your local model works exceptionally slowly, check whether your computer uses swap memory. Restart may very well be a wonderful option to resolve it.

The code looks absolutely straightforward. Surprisingly, nevertheless, it doesn’t work and returns the next error.

[Error during generate()] The requested variety of tokens exceeds the llama.cpp 
model's context size. Please specify a better n_ctx value.

From the message, we are able to guess that the output doesn’t fit the context size. Our prompt is about 20 tokens. So, it’s a bit weird that we’ve hit the brink on the context size. Let’s attempt to constrain the variety of tokens for SENTIMENT and see the output.

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: [SENTIMENT]" where (len(TOKENS(SENTIMENT)) < 200)
"""

print(lmql.run_sync(query_string,
model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables['SENTIMENT'])

# Positive sentiment.
#
# Q: What's the sentiment of the next review: ```The service was terrible.```?
# A: Negative sentiment.
#
# Q: What's the sentiment of the next review: ```The hotel was amazing, the staff were friendly and the situation was perfect.```?
# A: Positive sentiment.
#
# Q: What's the sentiment of the next review: ```The product was an entire disappointment.```?
# A: Negative sentiment.
#
# Q: What's the sentiment of the next review: ```The flight was delayed for 3 hours, the food was cold and the entertainment system didn't work.```?
# A: Negative sentiment.
#
# Q: What's the sentiment of the next review: ```The restaurant was packed, however the waiter was efficient and the food was delicious.```?
# A: Positive sentiment.
#
# Q:

Now, we could see the foundation reason for the issue — the model was stuck in a cycle, repeating the query variations and answers many times. I haven’t seen such issues with OpenAI models (suppose they could control it), but they’re pretty standard to open-source local models. We could use the STOPS_AT constraint to stop generation if we see Q: or a brand new line within the model response to avoid such cycles.

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: [SENTIMENT]" where STOPS_AT(SENTIMENT, 'Q:')
and STOPS_AT(SENTIMENT, 'n')
"""

print(lmql.run_sync(query_string,
model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables['SENTIMENT'])

# Positive sentiment.

Excellent, we’ve solved the problem and got the result. But since we are going to do classification, we would love the model to return certainly one of the three outputs (class labels): negative, neutral or positive. We could add such a filter to the LMQL query to constrain the output.

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: [SENTIMENT]" where (SENTIMENT in ['positive', 'negative', 'neutral'])
"""

print(lmql.run_sync(query_string,
model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables['SENTIMENT'])

# positive

We don’t need filters with stopping criteria since we’re already limiting output to simply three possible options, and LMQL doesn’t have a look at another possibilities.

Let’s try to make use of the chain of thoughts reasoning approach. Giving the model a while to think normally improves the outcomes. Using LMQL syntax, we could quickly implement this approach.

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: Let's think step-by-step. [ANALYSIS]. Due to this fact, the sentiment is [SENTIMENT]" where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, 'n')
and (SENTIMENT in ['positive', 'negative', 'neutral'])
"""

print(lmql.run_sync(query_string,
model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables)

The output from the Zephyr model is pretty decent.

Image by creator

We are able to try the identical prompt with Llama 2.

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: Let's think step-by-step. [ANALYSIS]. Due to this fact, the sentiment is [SENTIMENT]" where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, 'n')
and (SENTIMENT in ['positive', 'negative', 'neutral'])
"""

print(lmql.run_sync(query_string,
model = lmql.model("local:llama.cpp:llama-2-7b.Q4_K_M.gguf")).variables)

The reasoning doesn’t make much sense. We’ve already seen on the Leaderboard that the Zephyr model is significantly better than Llama-2–7b.

Image by creator

In classical Machine Learning, we normally get not only class labels but additionally their probability. We could get the identical data using distribution in LMQL. We just must specify the variable and possible values — distribution SENTIMENT in [‘positive’, ‘negative’, ‘neutral’].

query_string = """
"Q: What's the sentiment of the next review: ```The food was superb.```?n"
"A: Let's think step-by-step. [ANALYSIS]. Due to this fact, the sentiment is [SENTIMENT]" distribution SENTIMENT in ['positive', 'negative', 'neutral']
where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, 'n')
"""

print(lmql.run_sync(query_string,
model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables)

Now, we got probabilities within the output, and we could see that the model is sort of confident within the positive sentiment.

Probabilities may very well be helpful in practice if you need to use only decisions when the model is confident.

Image by creator

Now, let’s create a function to make use of our sentiment evaluation for various inputs. It might be interesting to match results with and without distribution, so we want two functions.

@lmql.query(model=lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
tokenizer = 'HuggingFaceH4/zephyr-7b-beta', n_gpu_layers=1000))
# specified n_gpu_layers to make use of GPU for higher speed
def sentiment_analysis(review):
'''lmql
"Q: What's the sentiment of the next review: ```{review}```?n"
"A: Let's think step-by-step. [ANALYSIS]. Due to this fact, the sentiment is [SENTIMENT]" where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, 'n')
and (SENTIMENT in ['positive', 'negative', 'neutral'])
'''

@lmql.query(model=lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta', n_gpu_layers=1000))
def sentiment_analysis_distribution(review):
'''lmql
"Q: What's the sentiment of the next review: ```{review}```?n"
"A: Let's think step-by-step. [ANALYSIS]. Due to this fact, the sentiment is [SENTIMENT]" distribution SENTIMENT in ['positive', 'negative', 'neutral']
where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, 'n')
'''

Then, we could use this function for the brand new review.

sentiment_analysis('Room was dirty')

The model decided that it was neutral.

Image by creator

There’s a rationale behind this conclusion, but I might say this review is negative. Let’s see whether we could use other decoders and recuperate results.

By default, the argmax decoder is used. It’s probably the most straightforward approach: at each step, the model selects the token with the best probability. We could attempt to play with other options.

Let’s try to make use of the beam search approach with n = 3 and a fairly high tempreture = 0.8. Because of this, we’d get three sequences sorted by likelihood, so we could just get the primary one (with the best likelihood).

sentiment_analysis('Room was dirty', decoder = 'beam', 
n = 3, temperature = 0.8)[0]

Now, the model was capable of spot the negative sentiment on this review.

Image by creator

It’s price saying that there’s a price for beam search decoding. Since we’re working on three sequences (beams), getting an LLM result takes 3 times more time on average: 39.55 secs vs 13.15 secs.

Now, now we have our functions and might test them with our real data.

Results on real-life data

I’ve run all of the functions on a ten% sample of the 1K dataset of Yelp reviews with different parameters:

  • models: Llama 2 or Zephyr,
  • approach: using distribution or simply constrained prompt,
  • decoders: argmax or beam search.

First, let’s compare accuracy — share of reviews with correct sentiment. We are able to see that Zephyr performs significantly better than the Llama 2 model. Also, for some reason, we get significantly poorer quality with distributions.

Graph by creator

If we glance a bit deeper, we could notice:

  • For positive reviews, accuracy will likely be higher.
  • Essentially the most common error is marking the review as neutral,
  • For Llama 2 with prompt, we could see a high rate of critical issues (positive comments that were labelled as negatives).

In lots of cases, I suppose the model uses the same rationale, scoring negative comments as neutral as we’ve seen earlier with the “dirty room” example. The model is unsure whether “dirty room” has a negative or neutral sentiment since we don’t know whether the client expected a clean room.

Graph by creator
Graph by creator

It’s also interesting to take a look at actual probabilities:

  • 75% percentile of positive labels for positive comments is above 0.85 for the Zephyr model, while it’s way lower for Llama 2.
  • All models show poor performance for negative comments, where the 75% percentile for negative labels for negative comments is way below even 0.5.
Graph by creator
Graph by creator

Our quick research shows that a vanilla prompt with a Zephyr model and argmax decoder could be the very best option for sentiment evaluation. Nonetheless, it’s price checking different approaches in your use case. Also, you may often achieve higher results by tweaking prompts.

Yow will discover the complete code on GitHub.

Today, we’ve discussed an idea of LMP (Language Model Programming) that permits you to mix prompts in natural language and scripting instructions. We’ve tried using it for sentiment evaluation tasks and got decent results using local open-source models.

Although LMQL is just not widespread yet, this approach may be handy and gain popularity in the long run because it combines natural and programming languages into a robust tool for LMs.

Thanks quite a bit for reading this text. I hope it was insightful to you. If you will have any follow-up questions or comments, please leave them within the comments section.

Kotzias,Dimitrios. (2015). Sentiment Labelled Sentences. UCI Machine Learning Repository (CC BY 4.0 license). https://doi.org/10.24432/C57604

LEAVE A REPLY

Please enter your comment!
Please enter your name here