Have a glimpse at my projects. Alternatively, you can also take a look at my Quepayal | GitHub

Accelerated KMeans with CUDA

Optimization of KMeans clustering using Kernel Fusion and Shared Memory on NVIDIA GPUs.

This project involves deep optimization of the KMeans algorithm for modern GPU architectures. I focused on reducing memory latency and maximizing compute utilization through customized CUDA kernels.

Kernel Fusion: Merged the “Find Nearest Neighbor” and “Accumulate Centroids” kernels into a single pass. This optimization eliminated redundant global memory passes and reduced kernel launch overhead.

Memory Hierarchy Tuning: Optimized the Shared Memory variant to load centroids into on-chip memory, significantly reducing global memory traffic for high-dimensional data.

Warp Divergence Reduction: Utilized ternary operators and branchless programming to minimize warp divergence during centroid updates.

Bottleneck Analysis: Discovered that kernel fusion was counter-productive in the Shared Memory version due to atomic contention, a critical insight into GPU resource management.

Concurrent BST Equivalence Analysis

A performance study of parallelization strategies and synchronization primitives in Go.

This project investigates the performance of detecting equivalent Binary Search Trees using various parallelization strategies. I analyzed the trade-offs between different concurrency models and synchronization overhead in the Go runtime.

Synchronization Analysis: Compared Goroutine-per-BST vs. Worker Pool models, identifying the sweet spot for scheduler overhead versus task granularity.

Lock Contention Study: Evaluated Channel-based communication vs. Fine-grained locking (sharded mutexes), identifying specific bottlenecks in shared data structure synchronization.

Advanced Data Structures: Implemented a thread-safe Disjoint Set Union (DSU) with path compression for equivalence tracking.

System Primitives: Utilized mutexes, wait groups, and condition variables to manage complex state transitions across concurrent workers.

Distributed Barnes-Hut N-Body Simulation

High-performance gravitational force approximation using MPI and Hilbert Curve load balancing.

Implemented the Barnes–Hut algorithm to approximate gravitational forces in O(N log N) time. To handle distributed workloads, I utilized a coordinator-worker pattern leveraging MPI for inter-node communication.

Spatial Data Structures: Particles are stored in a spatial quadtree where internal nodes store aggregate mass and center-of-mass data.

Locality Optimization: Computed Hilbert ordering of particle indices to group spatially close particles into contiguous ranges, significantly improving cache locality and reducing communication overhead.

Collective Communication: Managed data distribution and updates using MPI_Scatterv and MPI_Allgatherv to ensure synchronized state across the cluster.

Performance Tuning: Balanced the Multipole Acceptance Criterion (MAC) to maintain high accuracy while maximizing computational throughput.

Distributed Two-Phase Commit in Rust

A fault-tolerant distributed transaction coordinator implemented in Rust using IPC channels.

Implemented a robust Two-Phase Commit (2PC) protocol to coordinate atomic transactions across multiple distributed processes. This project focuses on the safety and liveness of distributed state machines.

Process Orchestration: Designed a system involving a Coordinator, N Clients, and M Participants using Rust’s ipc_channel for low-latency bidirectional communication.

Flow Control: Implemented a Sliding Window mechanism with configurable concurrency limits to prevent coordinator saturation and maximize system throughput.

Memory Safety: Leveraged Rust’s ownership model to manage IPC handles and process registration safely across child processes.

Atomic Protocol Logic: Handled the complex state transitions required to ensure “all-or-nothing” properties across independent processes.