mirror of
https://github.com/microsoft/BitNet.git
synced 2026-05-03 11:20:36 +00:00
[fix] correct README
This commit is contained in:
@@ -22,7 +22,7 @@ A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
|
||||
https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1
|
||||
|
||||
## What's New:
|
||||
- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) 
|
||||
- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/XsquirrelC/BitNet/blob/main/src/README.md) 
|
||||
- 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md)
|
||||
- 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)
|
||||
- 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880)
|
||||
|
||||
+6
-1
@@ -50,8 +50,13 @@ build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/g
|
||||
|
||||
### 1. Weight & Activation Parallelism
|
||||
|
||||
The key optimization introduces parallel processing paths for weight and activation computation:
|
||||
The kernel implements two parallelization strategies:
|
||||
|
||||
- **Weight Parallel:** Reduces kernel launch overhead by processing multiple weight rows/columns in a single kernel call
|
||||
- **Activation Parallel:** Built on top of weight parallel, further reduces the unpack overhead when reading I2_S format weights by amortizing the unpacking cost across multiple activation elements
|
||||
- **Recommendation:** For I2_S quantization format, activation parallel is recommended and used in all subsequent benchmarks
|
||||
|
||||
**Key Optimizations:**
|
||||
- **Vectorized Operations:** Utilizes SIMD instructions (AVX2 for x86, NEON for ARM) to process multiple elements simultaneously
|
||||
- **Parallel Accumulation:** Processes multiple weight-activation pairs in parallel, reducing sequential dependencies
|
||||
- **Reduced Memory Latency:** Optimized memory access patterns minimize cache misses
|
||||
|
||||
Reference in New Issue
Block a user