This project involves deep optimization of the KMeans algorithm for modern GPU architectures. I focused on reducing memory latency and maximizing compute utilization through customized CUDA kernels.

Kernel Fusion: Merged the “Find Nearest Neighbor” and “Accumulate Centroids” kernels into a single pass. This optimization eliminated redundant global memory passes and reduced kernel launch overhead.

Memory Hierarchy Tuning: Optimized the Shared Memory variant to load centroids into on-chip memory, significantly reducing global memory traffic for high-dimensional data.

Warp Divergence Reduction: Utilized ternary operators and branchless programming to minimize warp divergence during centroid updates.

Bottleneck Analysis: Discovered that kernel fusion was counter-productive in the Shared Memory version due to atomic contention, a critical insight into GPU resource management.