Run multiple open source LLMs locally on your machine
Large Language Models are text-based input generation models that allow users to ask questions and receive an output based on input parameters. Some online LLMs are ChatGPT, Claude, Gemini etc.
The problem with online LLMs is that you need an internet connection to use them and also a questionable data integrity, we have a solution, i.e. to install open source LLMs locally on our computer and use our own GPUs to generate output. The advantage is that your data is secure and safe, no third party can access your queries, its useful for privacy focused users and companies. You can also setup a server comprised of multiple Nvidia GPUs and fine tune the models according to your use case.
I am using MacBook Pro based on Apple silicon so for Windows/Linux users the steps may vary.
I also recommend a machine with at least 64GB ram, as it will take up to 70–75% of the ram usage (44GB — 48GB ram) to run optimally, you would not want memory swap as it’ll deteriorate your SSD in long run and slow down the process due to swaps.
Disclaimer: only if you are using 70B parameter model, I have added a recommended ram/model list below.
The model we are going to use is Llama3, before we begin, I want to address that you will need both Node and Conda Python environment configured.
Step 1: Clone git repository for WebUI to your dev folder.
git clone https://github.com/open-webui/open-webui.git
We can have a peek at the code
ls open-webui
cd open-webui
code .
As you can see from package.json file, the frontend is written in JavaScript with Svelte framework, and the backend is written in Python.
In order to download Llama3, we would need to download Ollama, which helps us to configure and manage multiple LLM models.
Step 2: Download Ollama
Choose your operating system
Once downloaded, extract the Ollama-darwin.zip folder and then drag the extracted file to your applications folder. You should be able to run Ollama by clicking on the app icon from application folders.
Once you ran the app, go to your web browser and type localhost:11434, which will display if the Ollama is running.
Use your terminal and type:
ollama pull llama3
You can also download any other model such as Phi3, Mixtral, Gemma etc.
In order to run the LLM model type:
ollama run llama3
To exit you need to type
/bye
Now we need a pretty GUI for our LLMs, on your WebUI project directory, enter the following terminal command:
conda create --name openwebui python=3.11
To activate the environment type:
conda activate openwebui
The base should change into (openwebui) on your terminal and should look like from (base) burhan@ezlo-macbook-pro open-webui ~ % to (openwebui) burhan@ezlo-macbook-pro open-webui ~ %
Run
python --version
And make sure its Python 3.11.x
Now we can go to backend folder
cd backend/
(openwebui) burhan@ezlo-macbook-pro backend~ %
And install requirements
pip install -r requirements.txt -U
It will take a bit to download as there are a lot of requirements.
Now go back to base and open-webui (base) burhan@ezlo-macbook-pro open-webui ~ % folder and run the following commands.
npm install
npm run build
Go back to your backend terminal (openwebui) burhan@ezlo-macbook-pro backend~ %
Now run:
bash start.sh
Go to your web browser and type localhost:8080/auth/ on address bar.
You can sign up, just a heads up, your credentials are not being sent anywhere, it's just a standard auth process for database locally stored on your machine.
Viola! you are now signed in and ready to go, just select the model from top left corner.
You can now play with settings to set different models and also mix models together according to your preference.
You can also pull models from the WebUI and you can make it multimodal by integrating AUTOMATIC1111 image generation model.
In playground you can setup your own system prompts to condition your input prompts.
You can also download pre-tuned models from Modelfiles, created by other people for your specific use case, for e.g. the pre-tuned models for coding or data analysis or tune the models yourself.
I would recommend installing models in docker container so you can easily manage and share your Ollama files configured to your specific use case with your colleagues.
Model/Ram Recommendation
Also, it depends on your GPU and GPUs VRAM, so this is just an estimate (after quantization). You can use below formula to determine model selection for your machine.
For e.g. to run a 70B parameter Llama3 model you would need 42GB of memory.
Based on the above formula we can determine that to run 70B parameter we would need approximately 42GB ram, which aligns with my testing and average ram usage I encountered during the LLM execution process.