Featured

Running LLaMA Locally with Llama.cpp: A Complete Guide

Published in

Hydroinformatics

8 min readFeb 12, 2025

Llama.cpp is a powerful and efficient inference framework for running LLaMA models locally on your machine. Unlike other tools such as Ollama, LM Studio, and similar LLM-serving solutions, Llama.cpp is designed to provide high-performance, low-resource inference while offering flexibility for different hardware architectures.

Introduction

Have you ever wanted to run large language models (LLMs) on your own machine without relying on cloud services? Llama.cpp makes this possible! This lightweight yet powerful framework enables high-performance local inference for LLaMA models, giving you full control over execution, performance, and optimization.

In this guide, we’ll walk you through installing Llama.cpp, setting up models, running inference, and interacting with it via Python and HTTP APIs. Whether you’re an AI researcher, developer, or hobbyist, this tutorial will help you get started with local LLMs effortlessly.

Why Choose Llama.cpp?

Before diving into installation, let’s compare Llama.cpp to other solutions:

Llama.cpp vs. Ollama: While Ollama provides built-in model management with a user-friendly experience, Llama.cpp gives you full control over model execution and hardware acceleration.
Llama.cpp vs. LM Studio: LM Studio features a GUI, whereas Llama.cpp is designed for CLI and scripting automation, making it ideal for advanced users.

Key Advantages of Llama.cpp:

Optimized for CPU inference while supporting GPU acceleration.
Works across Windows, Linux, and macOS.
Allows fine-tuned control over execution, including server mode and Python integration.

Now, let’s get started with setting up Llama.cpp on your system.

Installation Guide

For detailed build instructions, refer to the official guide: [Llama.cpp Build Instructions]. In the following section I will explain the different pre-built binaries that you can download from the llama.cpp github repository and how to install them on your machine

Windows Setup

Choosing the Right Binary

If you’re downloading pre-built binaries from Llama.cpp’s releases page [Link], choose based on your CPU and GPU capabilities:

AVX (llama-bin-win-avx-x64.zip): For older CPUs with AVX support.
AVX2 (llama-bin-win-avx2-x64.zip): For Intel Haswell (2013) and later.
AVX-512 (llama-bin-win-avx512-x64.zip): For Intel Skylake-X and newer.
CUDA (llama-bin-win-cuda-cu11.7-x64.zip): If using an NVIDIA GPU.

If unsure, start with AVX2 as most modern CPUs support it. For GPUs, ensure your CUDA driver version matches the binary.

For this tutorial I have CUDA 12.4 installed in my PC so I downloaded the llama-b4676-bin-win-cuda-cu12.4-x64.zip and cudart-llama-bin-win-cu12.4-x64.zip and unzip them and placed the binaries in a directory and added this directory to my path environment variables.

Linux & macOS Setup

For Linux and macOS, download the appropriate binaries:

Linux: llama-bin-ubuntu-x64.zip
macOS (Intel): llama-bin-macos-x64.zip
macOS (Apple Silicon M1/M2): llama-bin-macos-arm64.zip

After downloading, extract the files and add the directory to your system’s PATH to execute commands globally.

you can also use the following installation using curl in linux

curl -fsSL https://ollama.com/install.sh | sh

After Downloading the right files, unzipping and adding the extracted directory to your system’s environment variables to run the executables from any location, now we are ready to explore the functionalities of llama.cpp.

Understanding GGUF, GGML, Hugging Face, and LoRA Formats

What is GGUF?

GGUF (Generalized GGML Unified Format) is an optimized file format designed for running large language models efficiently using Llama.cpp and other frameworks. It improves compatibility and performance by standardizing how model weights and metadata are stored, allowing for efficient inference on different hardware architectures.

What is GGML?

GGML (Generalized Gradient Model Language) is an earlier format used for LLM inference that supports quantized models, making them more memory-efficient. However, GGUF has largely replaced GGML due to its enhanced features and improved performance.

Converting GGML to GGUF

If you have a GGML model and need to use it with Llama.cpp, you can convert it to GGUF using a conversion script.

Example command:

python convert_llama_ggml_to_gguf.py - input model.ggml - output model.gguf

The convert_llama_ggml_to_gguf.py script exists in the llama.cpp github repository in the main directory.

Hugging Face Format

Hugging Face models are typically stored in PyTorch (.bin or .safetensors) format. These models can be converted into GGUF format using conversion scripts like convert_hf_to_gguf.py.

Text generation model with the huggingface format `.safetensors`

LoRA Format

LoRA (Low-Rank Adaptation) is a fine-tuning technique used to efficiently adapt large language models to specific tasks. LoRA adapters store only the fine-tuned weight differences rather than modifying the entire model. To use LoRA with Llama.cpp, you may need to merge LoRA weights with a base model before conversion to GGUF using convert_lora_to_gguf.py.

Downloading GGUF Model Files from Hugging Face

You can download GGUF model files from Hugging Face and use them with Llama.cpp. Follow these steps:

Visit Hugging Face Models Page: Go to Hugging Face and search for LLaMA or any model compatible with GGUF. in this tutorial we will use the mistral gguf files downloaded from this link
Download the Model: Navigate to the model’s repository and download the GGUF version of the model. If the GGUF format is not available, you may need to convert it manually as explained before.
Move the File: Place the downloaded or converted GGUF model into your models/ directory.

Downloading the gguf llm model file for the Deepseek R1 model

Downloading the gguf for the Mistral-7B-instruct-v0.2 model

Downloading the gguf model file for the Llama-3.2–1B-instruct model

Run a model

Now we can use the command llama-cli that is one of the executables that we have downloaded, you can check all the flags that can be used with the llama-cli command to trigger the llm model using the gguf file.

at the end of the help list of the llama-cli utility, there are two examples of triggering a text generation, and a chat.

Interacting with the Mistral-7B instruct model using the GGUF file and llama-cli utility from llama.cpp

Interacting with Llama.cpp in Python

Overview of `llama-cpp-python`

The llama-cpp-python package provides Python bindings for Llama.cpp, allowing users to:

Load and run LLaMA models within Python applications.
Perform text generation tasks using GGUF models.
Customize inference parameters like temperature, top-k, and top-p for more controlled responses.
Run models efficiently on both CPU and GPU (if CUDA is enabled).
Host models as an API server for easy integration into applications.

Installing Required Packages

You can use llama-cpp-python, which provides Python bindings for llama.cpp:

pip install llama-cpp-python

Running Inference in Python

Now we can use the llm model gguf file that we have downloaded above, load it in python using the llama_cpp package and trigger the chat completion function

from llama_cpp import Llama

llm = Llama(model_path="mistral-7b-instruct-v0.2.Q2_K.gguf")
response = llm.create_chat_completion(
  messages=[
    {
        "role": "user",
        "content": "how big is the sky"
    }
])
print(response)

The response will be something like



```plaintext
{
  'id': 'chatcmpl-e8879677-7335-464a-803b-30a15d68c015', 
  'object': 'chat.completion', 
  'created': 1739218403, 
  'model': 'mistral-7b-instruct-v0.2.Q2_K.gguf', 
  'choices': [
    {
      'index': 0, 
      'message': 
        {
          'role': 'assistant', 
          'content': ' The size of the sky is not something that can be measured in a way that 
          is meaningful to us, as it is not a physical object with defined dimensions. 
          The sky is the expanse above the Earth, and it includes the atmosphere and the outer 
          space beyond. It goes on forever in all directions, as far as our current understanding 
          of the universe extends. So, we cannot assign a specific size to the sky. 
          Instead, we can describe the size of specific parts of the universe, such as the diameter 
          of a star or the distance between two galaxies.'
        }, 
        'logprobs': None, 
        'finish_reason': 'stop'
    }
  ], 
  'usage': {
    'prompt_tokens': 13, 
    'completion_tokens': 112, 
    'total_tokens': 125
    }
}
```

Downloading and Using GGUF Models with `Llama.from_pretrained`

The Llama.from_pretrained method allows users to directly download GGUF models from Hugging Face and use them without manually downloading the files.

Example:

from llama_cpp import Llama

# Download and load a GGUF model directly from Hugging Face
llm = Llama.from_pretrained(
   repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF", 
   filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf"
)
response = llm.create_chat_completion(
  messages=[
    {"role": "user", "content": "How does a black hole work?"}
  ]
)
print(response)

This method simplifies the process by automatically downloading and loading the required model into memory, eliminating the need to manually place GGUF files in a directory. and loading the gguf file from that directory.


  'id': 'chatcmpl-e8879677-7335-464a-803b-30a15d68c015', 
  'object': 'chat.completion', 
  'created': 1739218403, 
  'model': 'mistral-7b-instruct-v0.2.Q2_K.gguf', 
  'choices': [
    {
      'index': 0, 
      'message': 
        {
          'role': 'assistant', 
          'content': ' The size of the sky is not something that can be measured in a way that 
          is meaningful to us, as it is not a physical object with defined dimensions. 
          The sky is the expanse above the Earth, and it includes the atmosphere and the outer 
          space beyond. It goes on forever in all directions, as far as our current understanding 
          of the universe extends. So, we cannot assign a specific size to the sky. 
          Instead, we can describe the size of specific parts of the universe, such as the diameter 
          of a star or the distance between two galaxies.'
        }, 
        'logprobs': None, 
        'finish_reason': 'stop'
    }
  ], 
  'usage': {
    'prompt_tokens': 13, 
    'completion_tokens': 112, 
    'total_tokens': 125
    }
}

you can use the cache_dir parameter to specify the directory where the model will be downloaded and cached.

Running Llama.cpp as a Server

You can run llama.cpp as a server and interact with it via API calls.

Start the Server

llama-server -m mistral-7b-instruct-v0.2.Q2_K.gguf

launching the model as a server in your terminal will give the following response.

Send Requests Using Python

import requests

# Define the API endpoint
url = "http://localhost:8000/completion"

# Define the payload 
payload = {
    "model": "mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "prompt": "How big is the sky?",
    "temperature": 0.7,
    "max_tokens": 50
}

headers = {"Content-Type": "application/json"}
try:
    response = requests.post(url, json=payload, headers=headers)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the response JSON
        response_data = response.json()

        # Extract the result from the response
        choices = response_data.get("choices", [])
        if choices:
            result = choices[0].get("text", "")
            print("Response:", result)
        else:
            print("No choices found in the response.")
    else:
        print(f"Request failed with status code {response.status_code}: {response.text}")
except Exception as e:
    print(f"Error occurred: {e}")

The response will be something like

Response: 
The sky is not a tangible object and does not have physical dimensions, so it cannot be measured or quantified in the same way that we measure and quantify objects with size or dimensions. The sky is simply the vast expanse of

Send Requests from Terminal (Linux/macOS) or PowerShell (Windows)

curl -X POST "http://localhost:8000/completion" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "Tell me a fun fact.", "max_tokens": 50}'

Conclusion

This tutorial covered installing, running, and interacting with Llama.cpp on different platforms. You can now integrate Llama models into your applications for local inference and API-based interactions.

Hydroinformatics

Running LLaMA Locally with Llama.cpp: A Complete Guide

Introduction

Why Choose Llama.cpp?

Installation Guide

Windows Setup

Choosing the Right Binary

Linux & macOS Setup

Understanding GGUF, GGML, Hugging Face, and LoRA Formats

What is GGUF?

What is GGML?

Converting GGML to GGUF

Hugging Face Format

LoRA Format

Downloading GGUF Model Files from Hugging Face

Run a model

Interacting with Llama.cpp in Python

Overview of `llama-cpp-python`

Installing Required Packages

Running Inference in Python

Downloading and Using GGUF Models with `Llama.from_pretrained`

Running Llama.cpp as a Server

Start the Server

Send Requests Using Python

Send Requests from Terminal (Linux/macOS) or PowerShell (Windows)

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Hydroinformatics

Written by Mostafa Farrag

No responses yet

Hydroinformatics

Running LLaMA Locally with Llama.cpp: A Complete Guide

Introduction

Why Choose Llama.cpp?

Installation Guide

Windows Setup

Choosing the Right Binary

Linux & macOS Setup

Understanding GGUF, GGML, Hugging Face, and LoRA Formats

What is GGUF?

What is GGML?

Converting GGML to GGUF

Hugging Face Format

LoRA Format

Downloading GGUF Model Files from Hugging Face

Run a model

Interacting with Llama.cpp in Python

Overview of llama-cpp-python

Installing Required Packages

Running Inference in Python

Downloading and Using GGUF Models with Llama.from_pretrained

Running Llama.cpp as a Server

Start the Server

Send Requests Using Python

Send Requests from Terminal (Linux/macOS) or PowerShell (Windows)

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Hydroinformatics

Written by Mostafa Farrag

No responses yet

Overview of `llama-cpp-python`

Downloading and Using GGUF Models with `Llama.from_pretrained`