[fix] correct README

2026-05-03 11:20:36 +00:00 · 2026-01-15 03:44:50 +00:00
parent 53ffe5e92b
commit 35b1c28585
2 changed files with 7 additions and 2 deletions
@@ -22,7 +22,7 @@ A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
 https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1

 ## What's New:
- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
+- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/XsquirrelC/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
 - 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md)
 - 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)
 - 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880)
@@ -50,8 +50,13 @@ build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/g

 ### 1. Weight & Activation Parallelism

-The key optimization introduces parallel processing paths for weight and activation computation:
+The kernel implements two parallelization strategies:

+- **Weight Parallel:** Reduces kernel launch overhead by processing multiple weight rows/columns in a single kernel call
+- **Activation Parallel:** Built on top of weight parallel, further reduces the unpack overhead when reading I2_S format weights by amortizing the unpacking cost across multiple activation elements
+- **Recommendation:** For I2_S quantization format, activation parallel is recommended and used in all subsequent benchmarks
+
+**Key Optimizations:**
 - **Vectorized Operations:** Utilizes SIMD instructions (AVX2 for x86, NEON for ARM) to process multiple elements simultaneously
 - **Parallel Accumulation:** Processes multiple weight-activation pairs in parallel, reducing sequential dependencies
 - **Reduced Memory Latency:** Optimized memory access patterns minimize cache misses