diff --git a/README.md b/README.md index 798c0e9..20acc22 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,8 @@ A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2: https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1 ## What's New: -- 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md) ![NEW](https://img.shields.io/badge/NEW-red) +- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red) +- 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md) - 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T) - 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880) - 11/08/2024 [BitNet a4.8: 4-bit Activations for 1-bit LLMs](https://arxiv.org/abs/2411.04965) diff --git a/src/README.md b/src/README.md new file mode 100644 index 0000000..d2d42af --- /dev/null +++ b/src/README.md @@ -0,0 +1,210 @@ +# BitNet CPU Inference Optimization + +This update provides significant performance improvements for BitNet inference on CPU through paralleled kernel implementations, native I2_S GEMM/GEMV support, configurable tiling block size and embedding quantization. + +## Update + +- **Parallel Weight & Activation Computation** + Implemented parallel processing of weights and activations in the W2A8 vet_dot kernel, achieving improved throughput on both x86 and ARM architectures. + +- **Native I2_S GEMM & GEMV Support** + Integrated I2_S GEMM and GEMV operations into ggml library, making them fully compatible with the llama.cpp architecture. This enables seamless integration with existing inference pipelines. + +- **Configurable Tiling & Parallelism** + Introduced configurable GEMM & GEMV block sizes and parallelism levels, allowing performance fine-tuning for different CPU architectures. + +- **Embedding Quantization** + Added support for embedding layer quantization with Q6_K format, reducing memory footprint and improving inference speed while maintaining high accuracy. + +## Usage + +### Configuration Options + +The `gemm-config.h` file controls kernel behavior: + +```c +#define ACT_PARALLEL // Enable activation parallelism, otherwise weight parallelism + +#if defined(ACT_PARALLEL) + #define ROW_BLOCK_SIZE 4 // Number of rows processed per block + #define COL_BLOCK_SIZE 32 // Number of columns processed per block + #define PARALLEL_SIZE 4 // Degree of parallelism +#else + #define ROW_BLOCK_SIZE 32 + #define COL_BLOCK_SIZE 4 + #define PARALLEL_SIZE 4 +#endif +``` + +Modify these values based on your CPU cache size and architecture for optimal performance. Users can fine-tune performance on their machine through `gemm-config.h`. + +### Enabling Embedding Quantization + +To use embedding quantization for additional speedup: + +```bash +build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/ggml-model-f32.gguf models/BitNet-b1.58-2B-4T/ggml-model-i2_s-embed-q6_k.gguf I2_S 1 1 +``` + +## Optimizations + +### 1. Weight & Activation Parallelism + +The key optimization introduces parallel processing paths for weight and activation computation: + +- **Vectorized Operations:** Utilizes SIMD instructions (AVX2 for x86, NEON for ARM) to process multiple elements simultaneously +- **Parallel Accumulation:** Processes multiple weight-activation pairs in parallel, reducing sequential dependencies +- **Reduced Memory Latency:** Optimized memory access patterns minimize cache misses + +**Schematic diagram:** + +weight_parallel +activation_parallel + +**Code Structure:** +- `ggml_vec_dot_i2_i8_s_1xN()`: Processes N rows in parallel +- `ggml_vec_dot_i2_i8_s_Nx1()`: Processes N columns in parallel +- Automatic dispatch based on `ACT_PARALLEL` configuration + +### 2. GEMM/GEMV Integration with llama.cpp + +Integrated I2_S quantization format into llama.cpp's compute graph: + +- **GEMV Operations:** Optimized matrix-vector multiplication for token generation +- **GEMM Operations:** Efficient matrix-matrix multiplication for batch processing +- **Tiling Strategy:** Configurable block sizes for optimal cache utilization + +**Benefits:** +- No modifications required to model architecture code +- Compatible with existing llama.cpp optimizations +- Supports dynamic batching and sequence parallelism + +### 3. Embedding Quantization (Q6_K) + +Quantizes embedding layers to 6-bit precision: + +- **Quantization Method:** k-quants (Q6_K) with grouped quantization +- **Accuracy Preservation:** Maintains >99% similarity to FP32 embeddings +- **Memory Savings:** ~5.33x reduction in embedding size (FP32 → Q6_K) +- **Dequantization:** Automatic dequantization during forward pass + +## Performance + +### End-to-End Inference Performance + +Comparison of optimized parallel kernels vs. original implementation: + +**Test Configuration:** +- Model: BitNet-b1.58-2B-4T +- Hardware: AMD EPYC 7V13 64-Core Processor +- Threads: 1 / 2 / 4 / 8 / 12 / 16 +- Test: 128 prompt tokens (pp128) + 128 generated tokens (tg128) +- Method: Activation Parallel + +**Prompt Processing (pp128)** + +| Threads | Original | Activation Parallel | Speedup | +|---------|----------|---------------------|---------| +| 1 | | 47.49 ± 0.16 | | +| 2 | | 89.94 ± 0.25 | | +| 4 | | 169.61 ± 2.64 | | +| 8 | | 295.70 ± 3.19 | | +| 12 | | 403.04 ± 0.49 | | +| 16 | | 521.58 ± 0.95 | | + +**Token Generation (tg128)** + +| Threads | Original | Activation Parallel | Speedup | +|---------|----------|---------------------|---------| +| 1 | | 17.65 ± 0.16 | | +| 2 | | 32.24 ± 0.64 | | +| 4 | | 54.34 ± 0.12 | | +| 8 | | 74.42 ± 0.25 | | +| 12 | | 76.37 ± 0.18 | | +| 16 | | 74.02 ± 0.15 | | + +**Test Configuration:** +- Model: BitNet-b1.58-2B-4T +- Hardware: ARM Core +- Threads: 1 / 2 / 4 / 8 +- Test: 128 prompt tokens (pp128) + 128 generated tokens (tg128) +- Method: Activation Parallel with DOTPROD + +**Prompt Processing (pp128)** + +| Threads | Original | Activation Parallel | Speedup | +|---------|----------|---------------------|---------| +| 1 | | | | +| 2 | | | | +| 4 | | | | +| 8 | | | | + +**Token Generation (tg128)** + +| Threads | Original | Activation Parallel | Speedup | +|---------|----------|---------------------|---------| +| 1 | | | | +| 2 | | | | +| 4 | | | | +| 8 | | | | + +**Test Configuration:** +- Model: BitNet-b1.58-2B-4T +- Hardware: ARM Core +- Threads: 1 / 2 / 4 / 8 +- Test: 128 prompt tokens (pp128) + 128 generated tokens (tg128) +- Method: Activation Parallel without DOTPROD + +**Prompt Processing (pp128)** + +| Threads | Original | Activation Parallel | Speedup | +|---------|----------|---------------------|---------| +| 1 | | | | +| 2 | | | | +| 4 | | | | +| 8 | | | | + +**Token Generation (tg128)** + +| Threads | Original | Activation Parallel | Speedup | +|---------|----------|---------------------|---------| +| 1 | | | | +| 2 | | | | +| 4 | | | | +| 8 | | | | + +**Speedup Visualization:** + +``` +[TODO] +``` + +### Embedding Quantize Evaluation + +**Performance test:** + +performance_test + +**Quality test:** + + +quality_test + +## Technical Details + +### Commit Information + +- **Latest Commit:** `43da5e5f760887d5b061c95605cd89a7e63db76b` +- **Baseline Commit:** `404980eecae38affa4871c3e419eae3f44536a95` + +### Key Files Modified + +- `src/ggml-bitnet-mad.cpp`: Parallel kernel implementations +- `3rdparty/llama.cpp/ggml/src/ggml.c`: GEMM/GEMV integration +- `include/gemm-config.h`: Configuration file + +### Supported Architectures + +- ✅ x86-64 with AVX2 +- ✅ ARM with NEON +- ✅ ARM with DOTPROD extension diff --git a/src/assets/activation_parallel.png b/src/assets/activation_parallel.png new file mode 100644 index 0000000..585f38a Binary files /dev/null and b/src/assets/activation_parallel.png differ diff --git a/src/assets/performance_test.png b/src/assets/performance_test.png new file mode 100644 index 0000000..587b22e Binary files /dev/null and b/src/assets/performance_test.png differ diff --git a/src/assets/quality_test.png b/src/assets/quality_test.png new file mode 100644 index 0000000..29f7969 Binary files /dev/null and b/src/assets/quality_test.png differ diff --git a/src/assets/weight_parallel.png b/src/assets/weight_parallel.png new file mode 100644 index 0000000..b7567d5 Binary files /dev/null and b/src/assets/weight_parallel.png differ