Windows 本地部署大模型

程小虎

Windows 本地部署大模型

适用于：

Windows 11
AI Agent 学习
RAG 知识库项目
LangChain / LangGraph
FastAPI
ChromaDB

一、安装 Ollama

官网：

https://ollama.com/download

下载 Windows 版本并安装。

安装完成后打开 PowerShell 验证：

ollama --version

查看服务是否正常：

curl http://localhost:11434

Ollama is running

说明安装成功。

二、查看 Ollama 服务状态

查看已安装模型：

ollama list

查看当前运行模型：

ollama ps

三、安装 ModelScope（魔搭）

pip install modelscope

验证：

modelscope --help

四、下载 BGE-M3 模型

打开：https://www.modelscope.cn

搜索：

bge-m3 gguf

推荐仓库：

ollmOne/bge-m3-GGUF

下载：

modelscope download --model ollmOne/bge-m3-GGUF --local_dir D:\大模型\

五、导入 Ollama

创建 Modelfile：

FROM ./bge-m3-q8_0.gguf

执行：

ollama create bge-m3-local -f Modelfile

六、验证导入

ollama list

看到：

bge-m3-local

说明导入成功。

七、测试 Embedding

chcp 65001 > $null; [Console]::InputEncoding = [System.Text.Encoding]::UTF8; [Console]::OutputEncoding = [System.Text.Encoding]::UTF8; $OutputEncoding = [System.Text.Encoding]::UTF8
$body = @{
    model = "bge-m3-local"
    input = "设备维修工单"
} | ConvertTo-Json

Invoke-RestMethod `
    -Uri "http://localhost:11434/api/embed" `
    -Method Post `
    -ContentType "application/json" `
    -Body $body

八、接口说明

原生接口

http://localhost:11434/api/embed

请求：

{
  "model":"bge-m3-local",
  "input":"设备维修工单"
}

OpenAI 兼容接口

http://localhost:11434/v1/embeddings

API Key：

ollama

九、Python 调用（requests）

import requests

url = "http://localhost:11434/api/embed"

payload = {
    "model": "bge-m3-local",
    "input": "设备维修工单"
}

response = requests.post(url, json=payload)

embedding = response.json()["embeddings"][0]

print(len(embedding))

十、Python 调用（OpenAI SDK）

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.embeddings.create(
    model="bge-m3-local",
    input="设备维修工单"
)

print(len(response.data[0].embedding))

十一、在线拉取并运行大模型

Ollama 除了可以导入本地 GGUF 模型，也可以直接从官方模型库在线拉取大模型。

官方模型库地址：

https://ollama.com/library

打开模型库后，可以搜索模型名称，例如：

qwen3
qwen2.5
llama3.1
deepseek-r1
bge-m3
nomic-embed-text

只要模型页面存在，一般就可以使用 ollama pull 在线下载。

1. 在线拉取模型

示例：拉取 qwen3:8b：

ollama pull qwen3:8b

示例：拉取 deepseek-r1:7b：

ollama pull deepseek-r1:7b

示例：拉取 Embedding 模型 bge-m3：

ollama pull bge-m3

2. 运行聊天大模型

拉取完成后，可以直接运行：

ollama run qwen3:8b

进入交互界面后，可以直接输入问题：

请用三句话解释什么是 RAG。

退出交互界面：

/bye

3. 拉取并运行一步完成

如果本地没有该模型，ollama run 会自动先下载再运行：

ollama run qwen3:8b

这种方式更简单，但首次运行会等待模型下载完成。

4. 查看模型信息

ollama show qwen3:8b

5. 查看已下载模型

ollama list

6. 查看正在运行的模型

ollama ps

7. 停止模型

ollama stop qwen3:8b

8. 删除模型

ollama rm qwen3:8b

9. 大模型与 Embedding 模型的区别

聊天大模型使用 ollama run，例如：

ollama run qwen3:8b

Embedding 模型使用 ollama embed 或 /api/embed，例如：

ollama embed bge-m3 "设备维修工单"

不要用 ollama run bge-m3 当聊天模型使用。

10. 聊天大模型 API 调用

不需要 ollama run，Ollama 应用运行时 API 服务自动启动，模型按需加载。

接口地址

Ollama 原生接口：

http://localhost:11434/api/chat

OpenAI 兼容接口：

http://localhost:11434/v1/chat/completions

API Key：

ollama

PowerShell 调用（原生接口）

$body = '{"model":"qwen3:8b","messages":[{"role":"user","content":"你好"}]}'
Invoke-RestMethod -Uri "http://localhost:11434/api/chat" -Method Post -ContentType "application/json" -Body $body

PowerShell 调用（OpenAI 兼容接口）

$body = '{"model":"qwen3:8b","messages":[{"role":"user","content":"你好"}]}'
Invoke-RestMethod -Uri "http://localhost:11434/v1/chat/completions" -Method Post -ContentType "application/json" -Body $body

Python 调用（requests）

import requests

url = "http://localhost:11434/api/chat"

payload = {
    "model": "qwen3:8b",
    "messages": [
        {"role": "user", "content": "你好"}
    ],
    "stream": True
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        import json
        chunk = json.loads(line.decode("utf-8"))
        content = chunk.get("message", {}).get("content", "")
        if content:
            print(content, end="", flush=True)
print()

Python 调用（OpenAI SDK）

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "你好"}]
)

print(response.choices[0].message.content)

流式输出

如果需要流式返回（逐字输出），设置 stream: true：

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

stream = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "你好"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

十二、常用命令

ollama list
ollama ps
ollama stop bge-m3-local
ollama rm bge-m3-local
ollama run qwen3:8b

十三、推荐 AI Agent 学习组合

Embedding：bge-m3-local
大模型：qwen3:8b
向量数据库：ChromaDB
框架：LangGraph
服务：FastAPI

架构：

文档/PDF → Chunk → BGE-M3 → ChromaDB → 检索 → Qwen3:8B → 答案生成

Windows 本地部署大模型

目录