Home Artificial Intelligence Deploying Large Language Models With HuggingFace TGI

Deploying Large Language Models With HuggingFace TGI

0
Deploying Large Language Models With HuggingFace TGI

One other option to efficiently host and scale your LLMs with Amazon SageMaker

Towards Data Science
Image from Unsplash

Large Language Models (LLMs) proceed to soar in popularity as a brand new one is released nearly every week. With the variety of these models increasing, so are the choices for the way we will host them. In my previous article we explored how we could utilize DJL Serving inside Amazon SageMaker to efficiently host LLMs. In this text we explore one other optimized model server and solution in HuggingFace Text Generation Inference (TGI).

NOTE: For those of you latest to AWS, make certain you make an account at the next link if you desire to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I might suggest following this text for understanding Deployment/Inference more in depth.

DISCLAIMER: I’m a Machine Learning Architect at AWS and my opinions are my very own.

Why HuggingFace Text Generation Inference? How Does It Work With Amazon SageMaker?

TGI is a Rust, Python, gRPC model server created by HuggingFace that could be used to host specific large language models. HuggingFace has long been the central hub for NLP and it incorporates a big set of optimizations in relation to LLMs specifically, look below for just a few and the documentation for an intensive list.

  • Tensor Parallelism for efficient hosting across multiple GPUs
  • Token Streaming with SSE
  • Quantization with bitsandbytes
  • Logits warper (different params similar to temperature, top-k, top-n, etc)

A big positive of this solution that I noted is the simplicity of use. TGI at this moment supports the next optimized model architectures which you can directly deploy utilizing the TGI containers.

LEAVE A REPLY

Please enter your comment!
Please enter your name here