AIMA is AI Infrastructure, managed by AI — a single open-source Go binary that detects hardware accelerators, picks an inference engine (vLLM / SGLang / llama.cpp) from a YAML knowledge base, deploys the model, benchmarks it, and writes the winning config back. The loop is driven by a built-in PDCA agent. Apache 2.0, no invite required.

Which hardware does AIMA support?

AIMA has been validated on real hardware: NVIDIA GPU, AMD GPU, Huawei Ascend, Hygon DCU, Moore Threads, MetaX, Apple Silicon, and CPU-only environments.

How is AIMA different from Ollama?

Ollama trades performance for simplicity; bare vLLM trades ease for performance; AIMA replaces the human operator with an agent — Ollama-level TCO, SOTA inference throughput.

Is AIMA an MCP server?

Yes. AIMA is an MCP server out of the box on port 6188. Any MCP-compatible runtime can drive the full AIMA control plane: hardware detection, model deployment, benchmarking, fleet discovery, and knowledge sync.

Open source · Apache 2.0 · MCP-native

AI Infrastructure, managed by AI

One-click deploy. Peak performance.

AIMA detects your hardware, picks the fastest engine, and deploys — a built-in agent keeps tuning, so every run gets faster.

Install AIMA

View on GitHub

AIMA Deploydetect · query KB · optimal config · deploy · ready

live

1✓

Detect hardwareAMD Radeon 8060S

2✓

Query knowledge base23 on-silicon records matched

3✓

Optimal configvLLM · FP842.3 tok/s

4✓

Deploy modelQwen3.6-35B

5✓

API readylocalhost:6188

Scroll

What is AIMA?

AIMA is a single Go binary that runs AI models on your own computer or server. It auto-detects your hardware, picks the best engine and config, deploys the model, and self-tunes — no manual parameter tuning needed.

Ease or performance? AIMA says both.

	Simple stacks (one binary, one engine)	High-performance stacks (raw engines, you tune them)	AIMA
One-line install	✓	—	✓
OpenAI-compatible API	✓	✓	✓
Inference backend	llama.cpp	vLLM / SGLang	vLLM · SGLang · llama.cpp (auto-picked per hardware)
SOTA throughput on discrete GPUs	—	DIY	✓
Multi-vendor silicon	—	DIY	NVIDIA · AMD · Ascend · Hygon DCU · Moore Threads · MetaX · Apple
MCP server out of the box	—	—	✓
Self-tuning loop	—	—	✓
LAN fleet / multi-node	—	DIY	✓

Simple stacks trade performance for low cost. High-performance stacks trade ease for raw throughput. AIMA makes the operator an agent — ease and performance are no longer a tradeoff; "what runs fastest on this machine" accumulates in the knowledge base, not in any engineer's head.

Agent in the loop

A built-in PDCA agent (codename Explorer) keeps cycling: plan a benchmark → deploy a config → sample throughput / TTFT → promote the winner to a shared knowledge base. When a new chip arrives, the agent runs the tuning matrix itself — the more it accumulates, the faster and more precise the next deployment.

Plan

Decide the next benchmark: model, quantization, parallelism

Deploy

Deploy the engine and model with the selected config

Benchmark

Sample throughput / TTFT, compare candidate configs

Learn

Winner goes to the knowledge base — next run is faster

↻ repeats — back to step 1

Validated on silicon, not on slides

NVIDIA, AMD, Huawei Ascend, Hygon DCU, Moore Threads, MetaX, Apple Silicon — all benchmarked on real hardware. CPU-only works too.

NVIDIAAMDHuawei AscendHygon DCUMoore ThreadsMetaXApple SiliconCPU-only

Full benchmark data on the Benchmarks page.

See the benchmarks →

MCP-native

AIMA is an MCP server. Point any MCP-compatible runtime at its port and you get the full operational surface: hardware detection, model scan, engine selection, deployment, benchmark, fleet discovery, knowledge sync — no REST wrapper to write, no official SDK to wait for.

Runs in production as OpenClaw's inference backend — covering LLM, ASR, TTS, image generation, and VLM. Any other MCP-speaking runtime plugs in the same way.

Point any MCP client at AIMA's HTTP endpoint — that's the integration.

mcp config

{
  "mcpServers": {
    "aima": { "type": "http", "url": "http://<aima-host>:6188/mcp" }
  }
}

AIMA Cloud · Works out of the box · 10 free runs included

Hand your device fleet to a cloud AI ops agent

Connect devices to the cloud and let an AI agent install, diagnose, repair, and upgrade them remotely — including one-command install of Dify / ComfyUI / Open WebUI / OpenClaw. Built-in invite code, works out of the box; 10 free service runs included.

Explore AIMA Cloud →See the one-command installs →

Quick start

One command installs the binary — the agent takes it from there.

Get the binary

One command installs AIMA. The China site uses official-site-hosted binaries; the global site keeps GitHub.

# macOS / Linux

Terminal · AIMA

curl -fsSL https://raw.githubusercontent.com/Approaching-AI/AIMA/master/install.sh | sh

# Windows (PowerShell)

PowerShell

irm https://raw.githubusercontent.com/Approaching-AI/AIMA/master/install.ps1 | iex

GitHub Releases remain available as a fallback, or build from source: git clone ... && make build Downloads / Releases →

See what AIMA found / run onboarding

Prints the detected GPU/NPU, driver versions, RAM. Then run onboarding checks (read-only).

Terminal · AIMA

# see what AIMA detected

aima hal detect

# run onboarding checks (read-only — no services installed, no model deployed)

aima onboarding

Run a model / open the API

Resolves the model, picks the engine + config for this host, pulls missing assets, deploys, waits for readiness.

Terminal · AIMA

# resolve model, pick engine, pull assets, deploy, wait for readiness

aima run qwen3-4b

# keep the OpenAI-compatible API and Web UI open in this terminal (http://localhost:6188)

aima serve

Linux shared server: sudo aima init → aima deploy qwen3-4b → aima serve

Read the docs (GitHub README) →Downloads / Releases →