diff --git a/README.md b/README.md index 20acc22..6f949fd 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2: https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1 ## What's New: -- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red) +- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/XsquirrelC/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red) - 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md) - 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T) - 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880) diff --git a/src/README.md b/src/README.md index d2d42af..b7eaef4 100644 --- a/src/README.md +++ b/src/README.md @@ -50,8 +50,13 @@ build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/g ### 1. Weight & Activation Parallelism -The key optimization introduces parallel processing paths for weight and activation computation: +The kernel implements two parallelization strategies: +- **Weight Parallel:** Reduces kernel launch overhead by processing multiple weight rows/columns in a single kernel call +- **Activation Parallel:** Built on top of weight parallel, further reduces the unpack overhead when reading I2_S format weights by amortizing the unpacking cost across multiple activation elements +- **Recommendation:** For I2_S quantization format, activation parallel is recommended and used in all subsequent benchmarks + +**Key Optimizations:** - **Vectorized Operations:** Utilizes SIMD instructions (AVX2 for x86, NEON for ARM) to process multiple elements simultaneously - **Parallel Accumulation:** Processes multiple weight-activation pairs in parallel, reducing sequential dependencies - **Reduced Memory Latency:** Optimized memory access patterns minimize cache misses