Easy and powerful techniques to make LLMs learn latest tasks at inference time
Language modeling (LM) goals to model the generative likelihood of word sequences, in order to predict the chances of future (or missing) tokens. Language models have revolutionized natural language processing (NLP) in recent times. It’s now well-known that increasing the size of language models (e.g., training compute, model parameters, etc.) can lead to higher performance and sample efficiency on a variety of downstream NLP tasks. The survey paper “A Survey of Large Language Models” [1] covers almost every aspect of the big language models. The paper provides an up-to-date review of the literature on LLMs, details concerning the training mechanisms like pre-training approaches together with instruction tuning techniques & further alignment training with the recent RLHF approach. The approaches of instruction tuning and alignment tuning is used to adapt LLMs in keeping with specific goals.
After pre-training or adaptation tuning, a serious approach to using LLMs is to design suitable prompting strategies for solving various tasks. A typical prompting method also referred to as in-context learning (ICL), formulates the duty description and/or demonstrations (examples) in the shape of natural language text.
LLMs show an in-context learning (ICL) ability, that’s, learning from just a few examples within the context. Many studies have shown that LLMs can perform a series of complex tasks through ICL, similar to solving mathematical reasoning problems.
The important thing idea of in-context learning is to learn from analogy. The figure below gives an example describing how language models make decisions with ICL. First, ICL requires just a few examples to form an indication context. These examples are frequently written in natural language templates. Then, ICL concatenates a question query and a bit of demonstration context together to form a prompt, which is then fed into the language model for prediction [2].
Different from supervised learning requiring a training stage that uses backward gradients to update model parameters, ICL doesn’t conduct parameter updates and directly performs predictions on the pre-trained language models. The model is predicted to learn the pattern hidden within the demonstration and accordingly make the fitting prediction.
What makes ICL attractive?
- Examples written in natural language provide an interpretable interface to speak with LLMs. This paradigm makes it much easier to include human knowledge into LLMs by changing the examples and templates
- It is analogous to the choice strategy of human beings by learning from analogy.
- Compared with supervised training, ICL is a training-free learning framework. This not only greatly reduces the computation costs for adapting the model to latest tasks, but in addition makes language-model-as-service possible and may be easily applied to large-scale real-world tasks.
But how does this work?
After pre-training, LLMs can exhibit intriguing ICL capabilities (emergent capabilities) without being updated [3]. While intuitively reasonable, the working mechanism of the ICL stays unclear, and few studies have provided preliminary explanations for the 2 questions.
How does pre-training affect the ICL ability?
Researchers suggested that a pre-trained model acquires some emergent ICL abilities when it achieves a big scale of pre-training steps or model parameters [3]. Some studies also showed that the ICL ability grows because the parameters of LLMs increase from 0.1 billion to 175 billion. Research suggests that the design of coaching tasks is a vital influence factor on the ICL capability of LLMs. Besides training tasks, recent studies have also investigated the connection between ICL and the pre-training corpora. It has been shown that the performance of ICL heavily depends upon the source of pre-training corpora reasonably than the size.
How do LLMs perform ICL during inference?
Within the paper “Why Can GPT Learn In-Context?” [4], researchers found out a dual form between Transformer attention and gradient descent and further proposed to grasp ICL as implicit fine-tuning. They compared GPT-based ICL and explicit fine-tuning on real tasks and located that ICL behaves similarly to fine-tuning from multiple perspectives. Under this framework, the ICL process may be explained as follows: via forward computation, LLMs generate meta-gradients with respect to demonstrations and implicitly perform gradient descent via the eye mechanism.
One other perspective from Stanford research [5] explains ‘In-context learning as Implicit Bayesian Inference’. The authors provide a framework where the LM does in-context learning through the use of the prompt to “locate” the relevant concept it has learned during pre-training to do the duty. We are able to theoretically view this as Bayesian inference of a latent concept conditioned on the prompt, and this capability comes from structure (long-term coherence) within the pre-training data.
Regardless that there are some answers, this research continues to be evolving to grasp the mechanism and underlying reasons higher.
Now allow us to explore some popular ICL methods.
- Chain of thought (COT)
- Self-consistency COT
- Tree of Thoughts
Chain of thought (COT)
It’s observed that standard prompting techniques (also referred to as general input-output prompting) don’t perform well on complex reasoning tasks, similar to arithmetic reasoning, commonsense reasoning, and symbolic reasoning. CoT is an improved prompting technique to boost the performance of LLMs such non-trivial cases involving reasoning [6]. As an alternative of simply constructing the prompts with input-output pairs as in ICL, CoT incorporates intermediate reasoning steps that may result in the ultimate output into the prompts. As may be seen from the instance below.
The figure above shows an example of a model producing a sequence of thought to resolve a math word problem that it might have otherwise gotten incorrect. On the left side, in ICL, the model is supplied with examples or demonstrations of mathematical reasoning questions and a direct answer. However the model isn’t in a position to predict the proper answer.
On the fitting side, in COT, the model is presented with an intermediate step to assist arrive at a solution of the instance/demonstration given. We are able to see when a model is now asked an analogous reasoning query, it’s in a position to predict the reply accurately, thus proving the efficacy of the COT approach for such use cases.
For those who see, COT or ICL typically provide some examples to show the use cases this is named Few-Shot (few examples). There’s yet another paper [7] that brought out interesting prompting “Allow us to think step-by-step..” with none examples to show the use case, this is named Zero-short (no examples).
In Zero-shot CoT, LLM is first prompted by “Let’s think step-by-step” to generate reasoning steps after which prompted by “Due to this fact, the reply is” to derive the ultimate answer. They find that such a method drastically boosts the performance when the model scale exceeds a certain size, but isn’t effective with small-scale models, showing a big pattern of emergent abilities.
Above: Example inputs and outputs of GPT-3 with (a) standard Few-shot (ICL), (b) Few-shot-CoT, (c) standard Zero-shot (ICL), and (d) ours (Zero-shot-CoT).
Just like Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reaches the proper answer where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples per task, Zero-Shot doesn’t need any examples and just uses the identical prompt “Let’s think step-by-step” across all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).
This research shows LLMs are decent zero-shot reasoners by adding an easy prompt, Let’s think step-by-step, to facilitate step-by-step considering before answering each query.
Allow us to see what happens underneath:
While Zero-shot-CoT is conceptually easy, it uses prompting twice to extract each reasoning and answer, as explained within the figure below.
The method involves two steps: first “reasoning prompt extraction” to extract a full reasoning path from a language model, after which use the second “answer prompt extraction” to extract the reply in the proper format from the reasoning text.
1st prompt — reasoning extraction
On this step first modify the input query x right into a prompt x’ using an easy template “Q: [X]. A: [T]”, where [X] is an input slot for x and [T] is a slot for hand-crafted trigger sentence t that will extract chain of thought to reply the query x. For instance, if we use “Let’s think step-by-step” as a trigger sentence, the prompt x’ can be “Q: [X]. A: Let’s think step-by-step.” Prompted text x’ is then fed right into a language model and generates subsequent sentence z. We are able to use any decoding strategy.
Another examples of such prompts:
Let’s take into consideration this logically.
Let’s solve this problem by splitting it into steps.
Let’s think like a detective step-by-step.
Before we dive into the reply.
2nd prompt — answer extraction
Within the second step, the generated sentence z together with prompted sentence x’ is used to extract the ultimate answer from the language model. To be concrete, simply concatenate three elements as with “[X’] [Z] [A]”: [X’] for 1st prompt x’, [Z] for sentence z generated at step one, and [A] for a trigger sentence to extract the reply. The prompt for this step is self-augmented because the prompt comprises the sentence z generated by the identical language model. In experiments, authors use barely different answer trigger depending on the reply format.
For instance, the usage of “Due to this fact, amongst A through E, the reply is” for multi-choice QA, and “Due to this fact, the reply (Arabic numerals) is” for math problems requiring a numerical answer.
The paper [7] has interesting ideas, the performance of varied prompts, etc., please read for more details.
When CoT works for LLMs?
It only has a positive effect on sufficiently large models (e.g., typically containing 10B or more parameters but not on small models. This phenomenon is known as the ‘emergent abilities’ of enormous language models. A capability is taken into account to be emergent if it isn’t present in smaller models but is present in larger models [3].
- It is principally effective to enhance the tasks that require step-by-step reasoning, similar to arithmetic reasoning, commonsense reasoning, and symbolic reasoning.
- For other tasks that don’t depend on complex reasoning, it would show worse performance than standard. Interestingly, it appears that evidently the performance gain brought by CoT prompting might be significant only when standard prompting yields poor results.
Why LLMs Can Perform CoT Reasoning?
- It’s widely hypothesized that it might probably be attributed to training on code since models trained on it show a powerful reasoning ability. Intuitively, code data is well organized with algorithmic logic and programming flow, which could also be useful to enhance the reasoning performance of LLMs. Nevertheless, this hypothesis still lacks publicly reported evidence of ablation experiments (with and without training on code).
- The main distinction between CoT prompting and standard prompting is the incorporation of reasoning paths prior to the ultimate answer. Thus, some researchers investigate the effect of various components within the reasoning paths. Specifically, a recent study identifies three key components in CoT prompting, namely symbols (e.g., numerical quantities in arithmetic reasoning), patterns (e.g., equations in arithmetic reasoning), and text (i.e., the remainder of tokens that usually are not symbols or patterns). It’s shown that the latter two parts (i.e., patterns and text) are essential to the model performance, and removing either one would result in a big performance drop.
That is an lively area of research, for an in-depth discussion on this, please read [2]. There’s yet another interesting research [8] that discusses possible reasons for in-context learning in transformer models.
Self-consistency COT
As an alternative of using the greedy decoding strategy in COT, the authors in [9] propose one other decoding strategy called self-consistency to switch the greedy decoding strategy utilized in chain-of-thought prompting, that further improves language models’ reasoning performance by a big margin. Self-consistency leverages the intuition that complex reasoning tasks typically admit multiple reasoning paths that reach an accurate answer. The more that deliberate considering and evaluation is required for an issue, the greater the range of reasoning paths that may get better the reply.
First, prompt the language model with chain-of-thought prompting, then as an alternative of greedily decoding the optimal reasoning path, authors propose “sample-and-marginalize” decoding procedure.
The figure below illustrates the self-consistency method with an example.
First sample from the language model’s decoder to generate a various set of reasoning paths; each reasoning path might result in a distinct final answer, so determine the optimal answer by marginalizing out the sampled reasoning paths to seek out essentially the most consistent answer in the ultimate answer set. Or in other words, from the model’s decoder, by taking a majority vote over the answers, we arrive at essentially the most “consistent” answer among the many final answer set.
Such an approach is analogous to the human experience that if multiple alternative ways of considering result in the identical answer, one has greater confidence that the ultimate answer is correct. In comparison with other decoding methods, self-consistency avoids the repetitiveness and native optimality that plague greedy decoding, while mitigating the stochasticity of a single sampled generation.
Extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a variety of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
One limitation of self-consistency is that it incurs more computation cost. In practice, people can try a small variety of paths (e.g., 5 or 10) as a place to begin to understand a lot of the gains while not incurring an excessive amount of cost, as generally the performance saturates quickly.
Tree of thoughts
Authors in [10] propose “Tree of Thoughts” (ToT), which generalizes over the “Chain of Thoughts” approach to prompting language models and enables exploration over coherent units of text (“thoughts”) that function intermediate steps toward problem-solving. ToT allows LMs to perform deliberate decision-making by considering multiple different reasoning paths and self-evaluating decisions to come to a decision the following plan of action, in addition to looking ahead or backtracking when vital to make global decisions. The outcomes/experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.
Tree of Thoughts (ToT) allows LMs to explore multiple reasoning paths over thoughts (above Figure). ToT frames any problem as a search over a tree, where each node is a state s = [x, z1···i] representing a partial solution with the input x and the sequence of thoughts thus far zi. The ToT does 4 things: thought decomposition, thought generator, state evaluator, and search algorithm.
1. Thought decomposition: Decompose the intermediate process into thought steps:
While CoT samples thoughts coherently without explicit decomposition, ToT leverages problem properties to design and decompose intermediate thought steps. As Table 1 shows, depending on different problems, a thought might be a few words (Crosswords), a line of equation (Game of 24), or an entire paragraph of writing plan (Creative Writing). It’s like the way you divide the query into several tasks. Each task is a step Zn that we discuss. Note that, this part is simply about decomposing the questions into tasks. It’s like planning, we don’t actually do any thoughts on this part.
2. Thought generation: So after we define the duty for every step in thought decomposition. We now actually generate the thoughts. We attempt to generate k thoughts as candidates for given a step Zn. There are two ways for generating thoughts: sample and propose.
a. Sample i.i.d. thoughts from a CoT prompt. We repeat the generation process k times independently. This works higher when the thought space is wealthy (e.g. each thought is a paragraph), and that i.i.d. samples result in diversity.
Within the above figure, a step of deliberate search in a randomly picked Creative Writing task. Given the input, the LM samples 5 different plans, then votes 5 times to come to a decision which plan is best. The bulk selection is used to consequently write the output passage with the identical sample-vote procedure.
b. Propose thoughts sequentially using a “propose prompt”. This works higher when the thought space is more constrained (e.g. each thought is only a word or a line), so proposing different thoughts in the identical context avoids duplication. On this, we generate k thoughts at one inference. So, these k thoughts might not be independent.
3. Evaluate states: On this part, we define a state evaluation function: v(s). To expand the tree, we use this function to seek out the nice path, like in chess programming. We evaluate the given path of the tree s=[x, z1…i]. There are two ways to define the evaluation function:
- Value each state independently: each state ‘s’ (or path) will likely be evaluated independently. [Example: Game of 24]
- Vote across states: each state ‘s’ will likely be evaluated given the set of all states S. Similar to you compare the states in S to one another as in self-consistency COT. [Example: creative writing task]
Example Game of 24:
Game of 24 is a mathematical reasoning challenge, where the goal is to make use of 4 numbers and basic arithmetic operations (+-*/) to acquire 24. For instance, given input “4 9 10 13”, an answer output might be “(10–4) * (13–9) = 24”.
To border ‘Game of 24’ into ToT, we decompose the thoughts into 3 steps, each an intermediate equation. As shown in Figure above (a), at each tree node, we exact the “left” numbers and prompt the LM to propose some possible next steps. The identical “propose prompt” is used for all 3 thought steps, though it only has one example with 4 input numbers. We perform a breadth-first search (BFS) in ToT, where at each step we keep the very best b = 5 candidates. To perform deliberate BFS in ToT, as shown in Figure (b), we prompt LM to guage each thought candidate as “sure/possibly/unattainable” with regard to reaching 24. The aim is to advertise correct partial solutions that may be verdicted inside few look-ahead trials, and eliminate unattainable partial solutions based on “too big/small” commonsense, and keep the remainder “possibly”. We sample values 3 times for every thought.
4. Search algorithm: We attempt to expand the tree. For every leaf node, we evaluate it with the state evaluation function. To decide on which leaf node for evaluation, we use a search algorithm. It might be a breadth-first search and a depth-first search. One can plug and play different search algorithms depending on the tree structure.
Conceptually, ToT has several advantages as a way for general problem-solving with LMs:
- Generality: IO, CoT, CoT-SC, and self-refinement may be seen as special cases of ToT (i.e. trees of limited depth and breadth
- Modularity: The bottom LM, in addition to the thought decomposition, generation, evaluation, and search procedures, can all be varied independently.
- Adaptability: Different problem properties, LM capabilities, and resource constraints may be accommodated.
- Convenience: No extra training is required, only a pre-trained LM is sufficient.
ToT framework empowers LMs to more autonomously and intelligently make decisions and solve problems.
Limitations. ToT requires more resources (e.g. model API cost) than sampling methods in an effort to improve task performances, however the modular flexibility of ToT allows users to customize such performance-cost tradeoffs, and ongoing open-source efforts should readily reduce such costs within the near future.
Prompt engineering is an empirical science and the effect of prompt engineering methods can vary quite a bit amongst models, thus requiring heavy experimentation and heuristics. Can we automate this strategy of prompt engineering? That is an lively research area and the next section discusses some attempts towards automatic prompt design approaches.
Automatic Prompt Augmentation and Selection COT
Within the paper titled “Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data” [11]. Most CoT studies depend on rigorously designed human-annotated rational chains to prompt the language model, which poses challenges for real-world applications where labeled training data is obtainable without human-annotated rational chains. To construct chain-of-thought prompts robotically, authors suggested augment-prune-select, a three-step process:
- Augment: Generate multiple pseudo-chains of thought given query using few-shot or zero-shot CoT prompts;
- Prune: Prune pseudo chains based on whether generated answers match ground truths.
- Select: Apply a variance-reduced policy gradient technique to learn the probability distribution over chosen examples, while considering the probability distribution over examples as policy and the validation set accuracy as reward.
Auto-CoT: Automatic Chain-of-Thought Prompting
In “Automatic Chain-of-Thought Prompting in Large Language Models” [12], the authors propose Auto-CoT paradigm to robotically construct demonstrations with questions and reasoning chains. In this system, authors adopted clustering techniques to sample questions after which generates chains. They observed that LLMs are inclined to make sure kinds of mistakes. One variety of errors may be similar within the embedding space and thus get grouped together. By only sampling one or just a few from frequent-error clusters, we will prevent too many mistaken demonstrations of 1 error type and collect a various set of examples.
Auto-CoT consists of the next most important stages:
- Query clustering: Perform cluster evaluation for a given set of questions Q. First compute a vector representation for every query in Q by Sentence-BERT. The contextualized vectors are averaged to form a fix-sized query representation. Then, the query representations are processed by the k-means clustering algorithm to supply k clusters of questions.
- Demonstration selection: Select a set of representative questions from each cluster; i.e. one demonstration from one cluster. Samples in each cluster are sorted by distance to the cluster centroid and people closer to the centroid are chosen first.
- Rationale generation: Use zero-shot CoT to generate reasoning chains for chosen questions and construct few-shot prompt to run inference.
LLMs have shown reasoning capabilities with CoT prompting. The superior performance of Manual-CoT hinges on the hand-crafting of demonstrations. To eliminate such manual designs, the proposed Auto-CoT robotically constructs demonstrations. It samples questions with diversity and generates reasoning chains to construct demonstrations. Experimental results on reasoning datasets showed that with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.
In-context learning or prompting helps us to speak with LLM to steer its behavior for desired outcomes. It’s a pretty approach to extracting information since you don’t need a big offline training set, you don’t need offline access to a model, and it feels intuitive even for non-engineers. Prompt engineering goals to utilize prompting as a strategy to construct reliable functionality for real-world applications. It’s an empirical science and the effect of prompt engineering methods can vary quite a bit amongst models, thus requiring heavy experimentation and heuristics. Prompting requires significant human efforts to create and adapt to latest datasets. The annotation process is nontrivial because humans must not only select the questions but in addition rigorously design the reasoning steps for every query, so there’s a necessity for automation of the prompting techniques.
[1] A Survey of Large Language Models, https://arxiv.org/pdf/2303.18223.pdf
[2] A Survey on In-Context Learning, https://arxiv.org/pdf/2301.00234.pdf
[3] Emergent Abilities of Large Language Models, https://arxiv.org/pdf/2206.07682.pdf
[4] Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers, https://arxiv.org/pdf/2212.10559.pdf
[5] An Explanation of In-context Learning as Implicit Bayesian Inference, http://ai.stanford.edu/blog/understanding-incontext/
[6] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/pdf/2201.11903.pdf
[7] Large Language Models are Zero-shot Reasoners, https://arxiv.org/pdf/2205.11916.pdf
[8] In-context learning and induction heads. Transformer Circuits, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html .
[9] Self-consistency improves chain-of-thought reasoning in LLM, https://arxiv.org/pdf/2203.11171.pdf
[10] Tree of Thoughts, https://arxiv.org/pdf/2305.10601.pdf
[11] Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data https://arxiv.org/pdf/2302.12822.pdf
[12] Automatic Chain-of-Thought Prompting in Large Language Models, https://arxiv.org/pdf/2210.03493.pdf
[13] Large Language models can Self Improve, https://www.arxiv-vanity.com/papers/2210.11610/