How to run llama2

TECHTALKS January 4, 2024 AI

Llama 2 models are trained on 2 trillion tokens and have double the context length of Llama 1. Llama Chat models have additionally been trained on over 1 million new human annotations.

More Details

Here is an step by step guideline for running Ollama's LLaMA2 model locally:

1. Install Prerequisites

As LLaMA2 is built on PyTorch, you will need to install:

Python 3.8+
PyTorch 1.11+
A compatible version of CUDA and GPU driver for GPU acceleration
Other Python packages like transformers, tqdm, etc.

It's best to create a new Conda or Python virtualenv environment to install LLaMA2's prerequisites cleanly.

2. Install LLaMA2

LLaMA2's code is open sourced on GitHub. You can install the package directly via:

pip install git+https://github.com/ollama/LLaMA.git

This will install LLaMA2 and all its Python package dependencies.

3. Download Model Checkpoints

The pretrained LLaMA2 model weights are available on the HuggingFace Hub

Download the model checkpoints to your local machine:

wget https://huggingface.co/ollama/LLaMA2-base/resolve/main/LLaMA2-base-2022-11-05.ckpt

This will download the files to use the smaller, faster "base" version of LLaMA2

4. Load Model and Generate Text

You can now load LLaMA2 in Python, set sampling parameters, provide a prompt, and generate text locally:

python
from lama import LLaMA

model = LLaMA.from_pretrained("base", pretrained=False)
model.load("LLaMA2-base-2022-11-05.ckpt")

output = model.generate(prompt="Once upon a time", max_new_tokens=50)
print(output)

The key things to note are:

Setting pretrained=False and manually loading checkpoints
Sampling parameters like max_new_tokens
Providing a prompt and printing the generated text

Running Llama2 models on a Mac via Ollama

Ollama is easier way to run models than LLM, although it is also more limited. It currently has version for macOS and Linux. Its creators say support for Windows is "coming soon."

To download Ollama: Click Here

Go to Terminal
Run: ollama run llama2

Installation is an elegant experience via point-and-click. And although Ollama is a command-line tool, there's just one command with the syntax ollama run model-name. As with LLM, if the model isn't on your system already, it will automatically download.

You can see the list of available models at https://ollama.ai/library, which as of this writing included several versions of Llama-based models such as general-purpose Llama 2, Code Llama, CodeUp from DeepSE fine-tuned for some programming tasks, and medllama2 that's been fine-tuned to answer medical questions.

The Ollama GitHub repo's README includes helpful list of some model specs and advice that "You should have at least 8GB of RAM to run the 3B models, 16GB to run the 7B models, and 32GB to run the 13B models." On my 16GB RAM Mac, the 7B Code Llama performance was surprisingly snappy. It will answer questions about bash/zsh shell commands as well as programming languages like Python and JavaScript.

Despite being the smallest model in the family, it was pretty good if imperfect at answering an R coding question that tripped up some larger models: "Write R code for a ggplot2 graph where the bars are steel blue color." The code was correct except for two extra closing parentheses in two of the lines of code, which were easy enough to spot in my IDE. I suspect the larger Code Llama could have done better.

Ollama has some additional features, such as LangChain integration and the ability to run with PrivateGPT, which may not be obvious unless you check the GitHub repo's tutorials page.