[fix] correct README

This commit is contained in:
deva100
2026-01-15 03:44:50 +00:00
parent 53ffe5e92b
commit 35b1c28585
2 changed files with 7 additions and 2 deletions
+1 -1
View File
@@ -22,7 +22,7 @@ A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1
## What's New:
- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/XsquirrelC/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
- 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md)
- 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)
- 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880)
+6 -1
View File
@@ -50,8 +50,13 @@ build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/g
### 1. Weight & Activation Parallelism
The key optimization introduces parallel processing paths for weight and activation computation:
The kernel implements two parallelization strategies:
- **Weight Parallel:** Reduces kernel launch overhead by processing multiple weight rows/columns in a single kernel call
- **Activation Parallel:** Built on top of weight parallel, further reduces the unpack overhead when reading I2_S format weights by amortizing the unpacking cost across multiple activation elements
- **Recommendation:** For I2_S quantization format, activation parallel is recommended and used in all subsequent benchmarks
**Key Optimizations:**
- **Vectorized Operations:** Utilizes SIMD instructions (AVX2 for x86, NEON for ARM) to process multiple elements simultaneously
- **Parallel Accumulation:** Processes multiple weight-activation pairs in parallel, reducing sequential dependencies
- **Reduced Memory Latency:** Optimized memory access patterns minimize cache misses