Skip to content

zhaohb/ollama_ov

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenVINO Integration with Ollama

Click here to expand/collapse content

    Ollama

    Get up and running with large language models.

    macOS

    Download

    Windows

    Download

    Linux

    curl -fsSL https://ollama.com/install.sh | sh

    Manual install instructions

    Docker

    The official Ollama Docker image ollama/ollama is available on Docker Hub.

    Libraries

    Community

    Quickstart

    To run and chat with Llama 3.2:

    ollama run llama3.2

    Model library

    Ollama supports a list of models available on ollama.com/library

    Here are some example models that can be downloaded:

    Model Parameters Size Download
    DeepSeek-R1 7B 4.7GB ollama run deepseek-r1
    DeepSeek-R1 671B 404GB ollama run deepseek-r1:671b
    Llama 3.3 70B 43GB ollama run llama3.3
    Llama 3.2 3B 2.0GB ollama run llama3.2
    Llama 3.2 1B 1.3GB ollama run llama3.2:1b
    Llama 3.2 Vision 11B 7.9GB ollama run llama3.2-vision
    Llama 3.2 Vision 90B 55GB ollama run llama3.2-vision:90b
    Llama 3.1 8B 4.7GB ollama run llama3.1
    Llama 3.1 405B 231GB ollama run llama3.1:405b
    Phi 4 14B 9.1GB ollama run phi4
    Phi 3 Mini 3.8B 2.3GB ollama run phi3
    Gemma 2 2B 1.6GB ollama run gemma2:2b
    Gemma 2 9B 5.5GB ollama run gemma2
    Gemma 2 27B 16GB ollama run gemma2:27b
    Mistral 7B 4.1GB ollama run mistral
    Moondream 2 1.4B 829MB ollama run moondream
    Neural Chat 7B 4.1GB ollama run neural-chat
    Starling 7B 4.1GB ollama run starling-lm
    Code Llama 7B 3.8GB ollama run codellama
    Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored
    LLaVA 7B 4.5GB ollama run llava
    Solar 10.7B 6.1GB ollama run solar

    [!NOTE] You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

    Customize a model

    Import from GGUF

    Ollama supports importing GGUF models in the Modelfile:

    1. Create a file named Modelfile, with a FROM instruction with the local filepath to the model you want to import.

      FROM ./vicuna-33b.Q4_0.gguf
      
    2. Create the model in Ollama

      ollama create example -f Modelfile
    3. Run the model

      ollama run example

    Import from Safetensors

    See the guide on importing models for more information.

    Customize a prompt

    Models from the Ollama library can be customized with a prompt. For example, to customize the llama3.2 model:

    ollama pull llama3.2

    Create a Modelfile:

    FROM llama3.2
    
    # set the temperature to 1 [higher is more creative, lower is more coherent]
    PARAMETER temperature 1
    
    # set the system message
    SYSTEM """
    You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
    """
    

    Next, create and run the model:

    ollama create mario -f ./Modelfile
    ollama run mario
    >>> hi
    Hello! It's your friend Mario.
    

    For more information on working with a Modelfile, see the Modelfile documentation.

    CLI Reference

    Create a model

    ollama create is used to create a model from a Modelfile.

    ollama create mymodel -f ./Modelfile

    Pull a model

    ollama pull llama3.2

    This command can also be used to update a local model. Only the diff will be pulled.

    Remove a model

    ollama rm llama3.2

    Copy a model

    ollama cp llama3.2 my-model

    Multiline input

    For multiline input, you can wrap text with """:

    >>> """Hello,
    ... world!
    ... """
    I'm a basic program that prints the famous "Hello, world!" message to the console.
    

    Multimodal models

    ollama run llava "What's in this image? /Users/jmorgan/Desktop/smile.png"
    

    Output: The image features a yellow smiley face, which is likely the central focus of the picture.

    Pass the prompt as an argument

    ollama run llama3.2 "Summarize this file: $(cat README.md)"

    Output: Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.

    Show model information

    ollama show llama3.2

    List models on your computer

    ollama list

    List which models are currently loaded

    ollama ps

    Stop a model which is currently running

    ollama stop llama3.2

    Start Ollama

    ollama serve is used when you want to start ollama without running the desktop application.

    Building

    See the developer guide

    Running local builds

    Next, start the server:

    ./ollama serve

    Finally, in a separate shell, run a model:

    ./ollama run llama3.2

    REST API

    Ollama has a REST API for running and managing models.

    Generate a response

    curl http://localhost:11434/api/generate -d '{
      "model": "llama3.2",
      "prompt":"Why is the sky blue?"
    }'

    Chat with a model

    curl http://localhost:11434/api/chat -d '{
      "model": "llama3.2",
      "messages": [
        { "role": "user", "content": "why is the sky blue?" }
      ]
    }'

    See the API documentation for all endpoints.

    Community Integrations

    Web & Desktop

    Cloud

    Terminal

    Apple Vision Pro

    Database

    • pgai - PostgreSQL as a vector database (Create and search embeddings from Ollama models using pgvector)
    • MindsDB (Connects Ollama models with nearly 200 data platforms and apps)
    • chromem-go with example
    • Kangaroo (AI-powered SQL client and admin tool for popular databases)

    Package managers

    Libraries

    Mobile

    • Enchanted
    • Maid
    • Ollama App (Modern and easy-to-use multi-platform client for Ollama)
    • ConfiChat (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)

    Extensions & Plugins

    Supported backends

    • llama.cpp project founded by Georgi Gerganov.

    Observability

    • Lunary is the leading open-source LLM observability platform. It provides a variety of enterprise-grade features such as real-time analytics, prompt templates management, PII masking, and comprehensive agent tracing.
    • OpenLIT is an OpenTelemetry-native tool for monitoring Ollama Applications & GPUs using traces and metrics.
    • HoneyHive is an AI observability and evaluation platform for AI agents. Use HoneyHive to evaluate agent performance, interrogate failures, and monitor quality in production.
    • Langfuse is an open source LLM observability platform that enables teams to collaboratively monitor, evaluate and debug AI applications.
    • MLflow Tracing is an open source LLM observability tool with a convenient API to log and visualize traces, making it easy to debug and evaluate GenAI applications.

    Ollama-ov

    Getting started with large language models and using the GenAI backend.

    Ollama-OV

    We provide two ways to download the executable file of Ollama, one is to download it from Google Drive, and the other is to download it from Baidu Drive:

    Google Driver

    Windows

    Download exe + Download OpenVINO GenAI

    Linux(Ubuntu 22.04)

    Download + Donwload OpenVINO GenAI

    百度云盘

    Windows

    Download exe + Download OpenVINO GenAI

    Linux(Ubuntu 22.04)

    Download + Donwload OpenVINO GenAI

    Docker

    Linux

    We also prepared a Dockerfile to help developers quickly build Docker images: Dockerfile

    docker build -t ollama_openvino:v1 -f Dockerfile_genai .

    then

    docker run --rm -it ollama_openvino:v1

    Model library

    The native Ollama only supports models in the GGUF format, the Ollama-OV invoke OpenVINO GenAI which requires models in the OpenVINO format. Therefore, we have enabled support for OpenVINO model files in Ollama. For public LLMs, you can access and download OpenVINO IR model from HuggingFace or ModelScope:

    Model Parameters Size Compression Download Device
    Qwen3-0.6B-int4-ov 0.6B 0.4GB INT4_ASYM_128 ratio 0.8 ModelScope CPU, GPU, NPU(base)
    Qwen3-1.7B-int4-ov 1.7B 1.2GB INT4_ASYM_128 ratio 0.8 ModelScope CPU, GPU, NPU(base)
    Qwen3-4B-int4-ov 4B 2.6GB INT4_ASYM_128 ratio 0.8 ModelScope CPU, GPU, NPU(base)
    DeepSeek-R1-Distill-Qwen-1.5B-int4-ov 1.5B 1.4GB INT4_ASYM_32 ModelScope CPU, GPU, NPU(base)
    DeepSeek-R1-Distill-Qwen-1.5B-int4-ov-npu 1.5B 1.1GB INT4_SYM_CW ModelScope NPU(best)
    DeepSeek-R1-Distill-Qwen-7B-int4-ov 7B 4.3GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    DeepSeek-R1-Distill-Qwen-7B-int4-ov-npu 7B 4.1GB INT4_SYM_CW ModelScope NPU(best)
    DeepSeek-R1-Distill-Qwen-14B-int4-ov 14B 8.0GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    DeepSeek-R1-Distill-llama-8B-int4-ov 8B 4.5GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    DeepSeek-R1-Distill-llama-8B-int4-ov-npu 8B 4.2GB INT4_SYM_CW ModelScope NPU(best)
    llama-3.2-1b-instruct-int4-ov 1B 0.8GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    llama-3.2-3b-instruct-int4-ov 3B 1.9GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    llama-3.2-3b-instruct-int4-ov-npu 3B 1.8GB INT4_SYM_CW ModelScope NPU(best)
    Phi-3.5-mini-instruct-int4-ov 3.8B 2.1GB INT4_ASYM HF, ModelScope CPU, GPU
    Phi-3-mini-128k-instruct-int4-ov 3.8B 2.5GB INT4_ASYM HF, ModelScope CPU, GPU
    Phi-3-mini-4k-instruct-int4-ov 3.8B 2.2GB INT4_ASYM HF, ModelScope CPU, GPU
    Phi-3-medium-4k-instruct-int4-ov 14B 7.4GB INT4_ASYM HF, ModelScope CPU, GPU
    Qwen2.5-0.5B-Instruct-openvino-ovms-int4 0.5B 0.3GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    Qwen2.5-1.5B-Instruct-int4-ov 1.5B 0.9GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
    Qwen2.5-3B-Instruct-gptq-ov 3B 2.7GB INT4_GPTQ ModelScope CPU, GPU
    Qwen2.5-7B-Instruct-int4-ov 7B 4.3GB INT4_ASYM ModelScope CPU, GPU
    minicpm-1b-sft-int4-ov 1B 0.7GB INT4_SYM ModelScope CPU, GPU, NPU(base)
    gemma-2-9b-it-int4-ov 9B 5.3GB INT4_ASYM HF, ModelScope CPU, GPU
    gemma-3-1b-it-int4-ov 1B 0.7G INT4_SYM_128 ModelScope CPU, GPU
    TinyLlama-1.1B-Chat-v1.0-int4-ov 1.1B 0.6GB INT4_ASYM HF, ModelScope CPU, GPU
    • INT4_SYM_128: INT4 symmetric compression with NNCF, group size 128, all linear layer compressed. Similar to Q4_0 compression.
    • INT4_SYM_CW: INT4 symmetric compression with NNCF, channel wise compression for NPU best performance.
    • INT4_ASYM: INT4 asymmetric compression with NNCF, has better accuracy than symmetric, NPU not support asymmetric compression.
    • INT4_GPTQ: INT4 GPTQ compression by NNCF which aligned with Huggingface.

    Just provide above model link as example for part models, for other LLMs, you can check OpenVINO GenAI model support list. If you have customized LLM, please follow model conversion step of GenAI.

    Ollama Model File

    We added two new parameters to Modelfile based on the original parameters:

    Parameter Description Value Type Example Usage
    stop_id Sets the stop ids to use int stop_id 151643
    max_new_token The maximum number of tokens generated by genai (Default: 2048) int max_new_token 4096

    Quick start

    Start Ollama

    1. First, set GODEBUG=cgocheck=0 env:

      Linux

      export GODEBUG=cgocheck=0

      Windows

      set GODEBUG=cgocheck=0
    2. Next, ollama serve is used when you want to start ollama (you must use the Ollama compiled by ollama_ov to start the serve):

      ollama serve

    Import from openVINO IR

    How to create an Ollama model based on Openvino IR

    How to create an Ollama model based on Openvino IR

    Example

    Let's take deepseek-ai/DeepSeek-R1-Distill-Qwen-7B as an example.

    1. Download the OpenVINO model

      1. Download from ModelScope: DeepSeek-R1-Distill-Qwen-7B-int4-ov

        pip install modelscope
        modelscope download --model zhaohb/DeepSeek-R1-Distill-Qwen-7B-int4-ov --local_dir ./DeepSeek-R1-Distill-Qwen-7B-int4-ov
      2. If the OpenVINO model exists in HF, we can also download it from HF. Here we take TinyLlama-1.1B-Chat-v1.0-int4-ov as an example to introduce how to download the model from HF.

        If your network access to HuggingFace is unstable, you can try to use a proxy image to pull the model.

        set HF_ENDPOINT=https://hf-mirror.com
        pip install -U huggingface_hub
        huggingface-cli download --resume-download OpenVINO/TinyLlama-1.1B-Chat-v1.0-int4-ov  --local-dir  TinyLlama-1.1B-Chat-v1.0-int4-ov --local-dir-use-symlinks False
        
    2. Package OpenVINO IR into a tar.gz file

      tar -zcvf DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz DeepSeek-R1-Distill-Qwen-7B-int4-ov
    3. Create a file named Modelfile, with a FROM instruction with the local filepath to the model you want to import. For convenience, we have put the model file of DeepSeek-R1-Distill-Qwen-7B-int4-ov model under example dir: Modelfile for DeepSeek, we can use it directly.

      Note:

      1. The ModelType "OpenVINO" parameter is mandatory and must be explicitly set.
      2. The InferDevice parameter is optional. If not specified, the system will prioritize using the GPU by default. If no GPU is available, it will automatically fall back to using the CPU. If InferDevice is explicitly set, the system will strictly use the specified device. If the specified device is unavailable, the system will follow the same fallback strategy as when InferDevice is not set (i.e., GPU first, then CPU).
      3. For more information on working with a Modelfile, see the Modelfile documentation.
    4. Unzip OpenVINO GenAI package and set environment

      cd openvino_genai_windows_2025.2.0.0.dev20250513_x86_64
      setupvars.bat
    5. Create the model in Ollama

      ollama create DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 -f Modelfile

      You will see output similar to the following:

         gathering model components 
         copying file sha256:77acf6474e4cafb67e57fe899264e9ca39a215ad7bb8f5e6b877dcfa0fabf919 100% 
         using existing layer sha256:77acf6474e4cafb67e57fe899264e9ca39a215ad7bb8f5e6b877dcfa0fabf919 
         creating new layer sha256:9b345e4ef9f87ebc77c918a4a0cee4a83e8ea78049c0ee2dc1ddd2a337cf7179 
         creating new layer sha256:ea49523d744c40bc900b4462c43132d1c8a8de5216fa8436cc0e8b3e89dddbe3 
         creating new layer sha256:b6bf5bcca7a15f0a06e22dcf5f41c6c0925caaab85ec837067ea98b843bf917a 
         writing manifest 
         success 
    6. Run the model

      ollama run DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1

    CLI Reference

    Show model information

    ollama show DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 

    List models on your computer

    ollama list

    List which models are currently loaded

    ollama ps

    Stop a model which is currently running

    ollama stop DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 

    Building from source

    Install prerequisites:

    • Go
    • C/C++ Compiler e.g. Clang on macOS, TDM-GCC (Windows amd64) or llvm-mingw (Windows arm64), GCC/Clang on Linux.

    Then build and run Ollama from the root directory of the repository:

    Windows

    1. clone repo

      git lfs install
      git lfs clone https://github.com/zhaohb/ollama_ov.git
    2. Enable CGO:

      go env -w CGO_ENABLED=1
    3. Initialize the GenAI environment

      Download GenAI runtime from GenAI, then extract it to a directory openvino_genai_windows_2025.2.0.0.dev20250513_x86_64.

      cd openvino_genai_windows_2025.2.0.0.dev20250513_x86_64
      setupvars.bat
    4. Setting cgo environment variables

      set CGO_LDFLAGS=-L%OpenVINO_DIR%\..\lib\intel64\Release
      set CGO_CFLAGS=-I%OpenVINO_DIR%\..\include 
    5. building Ollama

      go build -o ollama.exe
    6. If you don't want to recompile ollama, you can choose to directly use the compiled executable file, and then initialize the genai environment in step 3 to run ollama directly.

      But if you encounter the error when executing ollama.exe, it is recommended that you recompile from source code.

      This version of ollama.exe is not compatible with the version of Windows you're running. Check your computer's system information and then contact the software publisher.'

    Linux

    1. clone repo

      git lfs install
      git lfs clone https://github.com/zhaohb/ollama_ov.git
    2. Enable CGO:

      go env -w CGO_ENABLED=1
    3. Initialize the GenAI environment

      Download GenAI runtime from GenAI, then extract it to a directory openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64.

      cd openvino_genai_ubuntu22_2025.2.0.0.dev20250513_x86_64
      source setupvars.sh
    4. Setting cgo environment variables

      export CGO_LDFLAGS=-L$OpenVINO_DIR/../lib/intel64/
      export CGO_CFLAGS=-I$OpenVINO_DIR/../include
    5. building Ollama

      go build -o ollama
    6. If you don't want to recompile ollama, you can choose to directly use the compiled executable file, and then initialize the genai environment in step 3 to run ollama directly ollama.

      If you encounter problems during use, it is recommended to rebuild from source.

    Running local builds

    1. First, set GODEBUG=cgocheck=0 env:

      Linux

      export GODEBUG=cgocheck=0

      Windows

      set GODEBUG=cgocheck=0
    2. Next, start the server:

      ollama serve
    3. Finally, in a separate shell, run a model:

      ollama run DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 

    Community Integrations

    Terminal

    Web & Desktop

    Future Development Plan

    Here are some features and improvements planned for future releases:

    1. Multimodal models: Support for multimodal models that can process both text and image data.

Attention:

This repository will no longer be maintained. For the latest code, please refer to: ollama_openvino

About

Add genai backend for ollama to run generative AI models using OpenVINO Runtime.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published