How to run llama2
Llama 2
models are trained on 2 trillion tokens and have double the context length of Llama 1. Llama Chat models have additionally been trained on over 1 million new human annotations.
Here is an step by step guideline for running Ollama's LLaMA2 model locally:
1. Install Prerequisites
As LLaMA2 is built on PyTorch, you will need to install:
- Python 3.8+
- PyTorch 1.11+
- A compatible version of CUDA and GPU driver for GPU acceleration
- Other Python packages like transformers, tqdm, etc.
It's best to create a new Conda or Python virtualenv environment to install LLaMA2's prerequisites cleanly.
2. Install LLaMA2
LLaMA2's code is open sourced on GitHub. You can install the package directly via:pip install git+https://github.com/ollama/LLaMA.git
This will install LLaMA2 and all its Python package dependencies.
3. Download Model Checkpoints
The pretrained LLaMA2 model weights are available on the HuggingFace Hub
Download the model checkpoints to your local machine:
wget https://huggingface.co/ollama/LLaMA2-base/resolve/main/LLaMA2-base-2022-11-05.ckpt
This will download the files to use the smaller, faster "base" version of LLaMA2
4. Load Model and Generate Text
You can now load LLaMA2 in Python, set sampling parameters, provide a prompt, and generate text locally:
python
from lama import LLaMA
model = LLaMA.from_pretrained("base", pretrained=False)
model.load("LLaMA2-base-2022-11-05.ckpt")
output = model.generate(prompt="Once upon a time", max_new_tokens=50)
print(output)
The key things to note are:
- Setting
pretrained=False
and manually loading checkpoints - Sampling parameters like
max_new_tokens
- Providing a prompt and printing the generated text
Running Llama2 models on a Mac via Ollama
Ollama is easier way to run models than LLM, although it is also more limited. It currently has version for macOS and Linux. Its creators say support for Windows is "coming soon."
To download Ollama: Click Here
- Go to Terminal
- Run:
ollama run llama2
Installation is an elegant experience via point-and-click. And although Ollama is a command-line tool, there's just one command with the syntax ollama run model-name
. As with LLM, if the model isn't on your system already, it will automatically download.
You can see the list of available models at https://ollama.ai/library, which as of this writing included several versions of Llama-based models such as general-purpose Llama 2, Code Llama, CodeUp from DeepSE fine-tuned for some programming tasks, and medllama2 that's been fine-tuned to answer medical questions.
The Ollama GitHub repo's README includes helpful list of some model specs and advice that "You should have at least 8GB of RAM to run the 3B models, 16GB to run the 7B models, and 32GB to run the 13B models." On my 16GB RAM Mac, the 7B Code Llama performance was surprisingly snappy. It will answer questions about bash
/zsh
shell commands as well as programming languages like Python and JavaScript.
Despite being the smallest model in the family, it was pretty good if imperfect at answering an R coding question that tripped up some larger models: "Write R code for a ggplot2 graph where the bars are steel blue color." The code was correct except for two extra closing parentheses in two of the lines of code, which were easy enough to spot in my IDE. I suspect the larger Code Llama could have done better.
Ollama has some additional features, such as LangChain integration and the ability to run with PrivateGPT, which may not be obvious unless you check the GitHub repo's tutorials page.