diff --git a/README.md b/README.md
index 20acc22..6f949fd 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,7 @@ A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
 https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1
 
 ## What's New:
-- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
+- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/XsquirrelC/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)
 - 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md)
 - 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)
 - 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880)
diff --git a/src/README.md b/src/README.md
index d2d42af..b7eaef4 100644
--- a/src/README.md
+++ b/src/README.md
@@ -50,8 +50,13 @@ build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/g
 
 ### 1. Weight & Activation Parallelism
 
-The key optimization introduces parallel processing paths for weight and activation computation:
+The kernel implements two parallelization strategies:
 
+- **Weight Parallel:** Reduces kernel launch overhead by processing multiple weight rows/columns in a single kernel call
+- **Activation Parallel:** Built on top of weight parallel, further reduces the unpack overhead when reading I2_S format weights by amortizing the unpacking cost across multiple activation elements
+- **Recommendation:** For I2_S quantization format, activation parallel is recommended and used in all subsequent benchmarks
+
+**Key Optimizations:**
 - **Vectorized Operations:** Utilizes SIMD instructions (AVX2 for x86, NEON for ARM) to process multiple elements simultaneously
 - **Parallel Accumulation:** Processes multiple weight-activation pairs in parallel, reducing sequential dependencies
 - **Reduced Memory Latency:** Optimized memory access patterns minimize cache misses