Learn find out how to deploy an actual ML application using AWS and FastAPI
Introduction
I even have all the time thought that even the very best project on the planet doesn’t have much value if people cannot use it. That’s the reason it is vitally necessary to learn find out how to deploy Machine Learning models. In this text we concentrate on deploying a small large language model, Tiny-Llama, on an AWS instance called EC2.
List of tools I’ve used for this project:
- Deepnote: is a cloud-based notebook that’s great for collaborative data science projects, good for prototyping
- FastAPI: an internet framework for constructing APIs with Python
- AWS EC2: is an internet service that gives sizable compute capability within the cloud
- Nginx: is an HTTP and reverse proxy server. I exploit it to attach the FastAPI server to AWS
- GitHub: GitHub is a hosting service for software projects
- HuggingFace: is a platform to host and collaborate on unlimited models, datasets, and applications.
About Tiny Llama
TinyLlama-1.1B is a project aiming to pretrain a 1.1B Llama on 3 trillion tokens. It uses the identical architecture as Llama2 .
Today’s large language models have impressive capabilities but are extremely expensive when it comes to hardware. In lots of areas we have now limited hardware: think smartphones or satellites. So there may be a number of research on creating smaller models so that they may be deployed on edge.
Here is a listing of “small” models which might be catching on:
- Mobile VLM (Multimodal)
- Phi-2
- Obsidian (Multimodal)