Projects
Have a glimpse at my projects. Alternatively, you can also take a look at my Quepayal | GitHub
Accelerated KMeans with CUDA
This project involves deep optimization of the KMeans algorithm for modern GPU architectures. I focused on reducing memory latency and maximizing compute utilization through customized CUDA kernels.
Kernel Fusion: Merged the “Find Nearest Neighbor” and “Accumulate Centroids” kernels into a single pass. This optimization eliminated redundant global memory passes and reduced kernel launch overhead.
Memory Hierarchy Tuning: Optimized the Shared Memory variant to load centroids into on-chip memory, significantly reducing global memory traffic for high-dimensional data.
Warp Divergence Reduction: Utilized ternary operators and branchless programming to minimize warp divergence during centroid updates.
Bottleneck Analysis: Discovered that kernel fusion was counter-productive in the Shared Memory version due to atomic contention, a critical insight into GPU resource management.
Concurrent BST Equivalence Analysis
This project investigates the performance of detecting equivalent Binary Search Trees using various parallelization strategies. I analyzed the trade-offs between different concurrency models and synchronization overhead in the Go runtime.
Synchronization Analysis: Compared Goroutine-per-BST vs. Worker Pool models, identifying the sweet spot for scheduler overhead versus task granularity.
Lock Contention Study: Evaluated Channel-based communication vs. Fine-grained locking (sharded mutexes), identifying specific bottlenecks in shared data structure synchronization.
Advanced Data Structures: Implemented a thread-safe Disjoint Set Union (DSU) with path compression for equivalence tracking.
System Primitives: Utilized mutexes, wait groups, and condition variables to manage complex state transitions across concurrent workers.
Distributed Barnes-Hut N-Body Simulation
Implemented the Barnes–Hut algorithm to approximate gravitational forces in O(N log N) time. To handle distributed workloads, I utilized a coordinator-worker pattern leveraging MPI for inter-node communication.
Spatial Data Structures: Particles are stored in a spatial quadtree where internal nodes store aggregate mass and center-of-mass data.
Locality Optimization: Computed Hilbert ordering of particle indices to group spatially close particles into contiguous ranges, significantly improving cache locality and reducing communication overhead.
Collective Communication: Managed data distribution and updates using MPI_Scatterv and MPI_Allgatherv to ensure synchronized state across the cluster.
Performance Tuning: Balanced the Multipole Acceptance Criterion (MAC) to maintain high accuracy while maximizing computational throughput.
Distributed Two-Phase Commit in Rust
Implemented a robust Two-Phase Commit (2PC) protocol to coordinate atomic transactions across multiple distributed processes. This project focuses on the safety and liveness of distributed state machines.
Process Orchestration: Designed a system involving a Coordinator, N Clients, and M Participants using Rust’s ipc_channel for low-latency bidirectional communication.
Flow Control: Implemented a Sliding Window mechanism with configurable concurrency limits to prevent coordinator saturation and maximize system throughput.
Memory Safety: Leveraged Rust’s ownership model to manage IPC handles and process registration safely across child processes.
Atomic Protocol Logic: Handled the complex state transitions required to ensure “all-or-nothing” properties across independent processes.