EPFL researchers, in collaboration with Apple, have introduced a brand new approach to speculative sampling called Parallel Speculative Sampling (PaSS). This latest approach allows for the drafting of multiple tokens concurrently using a single model, combining the advantages of auto-regressive generation and speculative sampling. The PaSS method was evaluated on text and code completion tasks, exhibiting promising performance without compromising model quality. The team also explored the impact of the variety of look-ahead embeddings on the approach, discovering an optimal number for achieving the perfect results.
PaSS addresses the constraints of speculative sampling, requiring two models with the identical tokenizer, by enabling the drafting of multiple tokens in parallel with a single model. Comparative evaluations with autoregressive generation and a baseline method show PaSS’s superior speed and performance. Testing on text and code completion tasks yields promising results without compromising overall model quality. It also explores the impact of sampling schemes and look-ahead embeddings on PaSS performance.
Large language models face limitations in natural language processing as a result of the auto-regressive generation, requiring a forward pass for every generated token and impacting memory access and processing time. Speculative sampling offers an answer but requires two models with the identical tokenizer, introducing bottlenecks. PaSS is an alternate that permits drafting multiple tokens with a single model, eliminating the necessity for a second model.
The proposed method utilizes parallel decoding, which eliminates the necessity for a second model and involves two phases: drafting and validation. Through the drafting phase, the model concurrently produces multiple tokens using parallel decoding, with the primary token being excluded from the draft for distribution matching in case of rejection. This approach achieves superior speed and performance while maintaining overall model quality.
The PaSS method was found to be an efficient way of generating language models with a major speed-up of as much as 30% in comparison with auto-regressive generation, while maintaining model performance inside the margin of error. PaSS was also shown to generate tokens with lower variance and better predictability, as demonstrated compared with baselines using different sampling schemes. The study also found that the variety of look-ahead steps steadily impacted PaSS performance, with a decrease in running time as much as 6 look-ahead steps.
PaSS is a robust language model generation technique that utilizes a parallel drafting approach for token decoding with fine-tuned look-ahead embeddings. Its effectiveness in generating tokens with low variance and high predictability has been proven through evaluations for text and code completion tasks. Further improvements are being aimed for through look-ahead tickets to boost performance much more.
Future research directions recommend exploring methods to boost the standard of parallel generation with look-ahead tokens, considering it a promising avenue for improving PaSS performance. The researchers emphasize the necessity for further investigation into the impact of the variety of look-ahead steps on PaSS, as an increased variety of steps might potentially negate the approach’s advantages.
Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.
Should you like our work, you’ll love our newsletter..
Hello, My name is Adnan Hassan. I’m a consulting intern at Marktechpost and shortly to be a management trainee at American Express. I’m currently pursuing a dual degree on the Indian Institute of Technology, Kharagpur. I’m enthusiastic about technology and need to create latest products that make a difference.