🌐 AI搜索 & 代理 主页
Skip to content

Conversation

@wbruna
Copy link
Contributor

@wbruna wbruna commented Dec 6, 2025

Introduces a new --use-mmap flag that replaces model loading I/O operations with mmap + memcpy.

In my tests, this helps model loading speed slightly, though the gain was never higher than half a second. Its primary benefit right now is validation of the mmap backend implementation. Later, I plan to extend this to allow the mapped file to serve directly as weight storage for backends that use main memory.

I used a non-default flag to be extra safe, but we could arguably follow llama.cpp approach, with a --no-mmap flag to disable it instead.

I was only able to test (and build...) it under Linux, so additional testing is very welcome 🙂

@Green-Sky
Copy link
Contributor

How much value would it be if llama.cpp exported the mmap stuff as a library?

@wbruna
Copy link
Contributor Author

wbruna commented Dec 9, 2025

How much value would it be if llama.cpp exported the mmap stuff as a library?

I don't think it'd help that much right now. The mmap part itself is more-or-less straightforward; replacing the current alloc+memcpy code with a buffer managed externally will be much trickier.

@valkarias
Copy link

valkarias commented Dec 10, 2025

Have you experimented with MMaping then copying to GPU?
In my experience. I've restricted MMapping only to CPU inference & loading. MMap -> copy to GPU became a bottleneck for some reason (I assume page size potentially?)

@wbruna
Copy link
Contributor Author

wbruna commented Dec 10, 2025

Have you experimented with MMaping then copying to GPU? In my experience. I've restricted MMapping only to CPU inference & loading. MMap -> copy to GPU became a bottleneck for some reason (I assume page size potentially?)

Not yet. Right now I'm just reusing the I/O buffer; adding a separate code path to deliver the mapped area directly to the backend just to avoid a memcpy sounded like too much change for too little potential gain.

That behavior you describe sounds... odd. At least on Linux, large dynamically-allocated memory areas use mmap as backend anyway, so they should behave the same. Maybe it's a difference between file -backed and anonymous mappings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants