I'm a sixth-year Computer Science Ph.D. student at the University of Texas at Austin, working under the guidance of Professor Keshav Pingali in the Intelligent Software Systems (ISS) Group. I received my B.Tech. in Computer Science from Indian Institute of Technology Roorkee, India in 2012.
Before joining Ph.D. program at UT Austin, I worked with IBM Research - India. I have also interned at Georgia Tech (2011), VMware (2013) and Facebook (2018).

Research Interests

My research interests are High-Performance Computing, Graph Analytics, Compilers, Runtime Systems, Distributed Computing, and Computer Architecture. Currently, I am working on distributed systems for graph analytics applications at a very large scale, both in terms of the size of graphs and number of machines in a distributed cluster. I focus on performance, scalability, productivity, and reliability for these systems.



(737) 932-0300


Gurbinder Gill, Roshan Dathathri, Loc Hoang, Andrew Lenharth, Keshav Pingali. Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms, International European Conference on Parallel and Distributed Computing (Euro-Par), 2018, August 2018.

The trend towards processor heterogeneity and distributed-memory has significantly increased the complexity of parallel programming. In addition, the mix of applications that need to run on parallel platforms today is very diverse, and includes graph applications that typically have irregular memory accesses and unpredictable control-flow. To simplify the programming of graph applications on such platforms, we have implemented a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. The compiler manages inter-device synchronization and communication while leveraging state-of-the-art compilers for generating device specific code. The experimental results show that the novel communication optimizations in the Abelian compiler reduce the volume of communication by 23×, enabling the code produced by Abelian to match the performance of handwritten distributed CPU and GPU programs that use the same runtime. The programs produced by Abelian for distributed CPUs are roughly 2.4× faster than those in the Gemini system, a third-party distributed CPU-only system, demonstrating that Abelian can manage heterogeneity and distributed-memory successfully while generating high-performance code.

Roshan Dathathri*, Gurbinder Gill*, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, Keshav Pingali (* Both authors contributed equally). Gluon: A Communication-Optimizing Substrate for Distributed Heterogeneous Graph Analytics, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2018, June 2018.

This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

Hoang-Vu Dang, Roshan Dathathri, Gurbinder Gill, Alex Brooks, Nikoli Dryden, Andrew Lenharth, Loc Hoang, Keshav Pingali, Marc Snir. A Lightweight Communication Runtime for Distributed Graph Analytics, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, May 2018.

Distributed-memory multi-core clusters enable in-memory processing of very large graphs with billions of nodes and edges. Recent distributed graph analytics systems have been built on top of MPI. However, communication in graph applications is very irregular, and each host exchanges different amounts of non-contiguous data with other hosts. MPI does not support such a communication pattern well, and it has limited ability to integrate communication with serialization, deserialization, and graph computation tasks. In this paper, we describe a lightweight communication runtime called LCI that supports a large number of threads on each host and avoids the semantic mismatches between the requirements of graph computations and the communication library in MPI. The implementation of LCI is informed by lessons learnt from two baseline MPI-based implementations. We have successfully integrated LCI with two state-of-the-art graph analytics systems - Gemini and Abelian. LCI improves the latency up to 3.5× for microbenchmarks compared to MPI solutions and improves the end-to-end performance of distributed graph algorithms by up to 2×.

Gurbinder Singh Gill, Vaibhav Saxena, Rashmi Mittal, Thomas George, Yogish Sabharwal, Lalit Dagar. Evaluation and enhancement of weather application performance on Blue Gene/Q, High Performance Computing (HiPC), 2013, December 2013.

Numerical weather prediction (NWP) models use mathematical models of the atmosphere to predict the weather. Ongoing efforts in the weather and climate community continuously try to improve the fidelity of weather models by employing higher order numerical methods suitable for solving model equations at high resolutions. In realistic weather forecasting scenario, simulating and tracking multiple regions of interest (nests) at fine resolutions is important in understanding the interplay between multiple weather phenomena and for comprehensive predictions. These multiple regions of interest in a simulation can be significantly different in resolution and other modeling parameters. Currently, the weather simulations involving these nested regions process them one after the other in a sequential fashion. There exists a lot of prior work in performance evaluation and optimization of weather models, however most of this work is either limited to simulations involving a single domain or multiple nests with same resolution and model parameters such as model physics options. In this paper, we evaluate and enhance the performance of popular WRF model on IBM Blue Gene/Q system. We consider nested simulations with multiple child domains and study how parameters such as physics options and simulation time steps for child domains affect the computational requirements. We also analyze how such configurations can benefit from parallel execution of the children domains rather than processing them sequentially. We demonstrate that it is important to allocate processors to nested child domains in proportion to the work load associated with them when executing them in parallel. This ensures that the time spent in the different nested simulations is nearly equal, and the nested domains reach the synchronization step with the parent simulation together. Our experimental evaluation using a simple heuristic for allocation of nodes shows that the performance of WRF simulations can be improved by up to 14% by parallel execution of sibling domains with different configuration of domain sizes, temporal resolutions and physics options.

Durgaprasad Gangodkar, Sachin Gupta, Gurbinder Singh Gill, Padam Kumar, Ankush Mittal. Efficient variable size template matching using fast normalized cross correlation on multicore processors, International Conference on Advanced Computing, Networking and Security (ADCONS), 2011, December 2011.

Normalized Cross Correlation (NCC) is an efficient and robust way for finding the location of a template in given image. However NCC is computationally expensive. Fast normalized cross correlation (FNCC) makes use of pre-computed sum-tables to improve the computational efficiency of NCC. In this paper we propose a strategy for parallel implementation of FNCC algorithm using NVIDIA's Compute Unified Device Architecture (CUDA) for real-time template matching. We also present an approach to make proposed method adaptable to variable size templates which is an important challenge to tackle. Efficient parallelization strategies, adopted for pre-computing sum-tables and extracting data parallelism by dividing the image into series of blocks, substantially reduce required computational time. We show that by optimal utilization of different memories available on multicore architecture and exploiting the asynchronous nature of CUDA kernel calls we can obtain speedup of the order 17X as compared to the sequential implementation.



My poster titled “Abelian: A Compiler and Runtime for Graph Analytics on Distributed, Heterogeneous Platforms” won IPDPS 2018 Outstanding Poster Presentation Award, 1st Place.


I received the MCD Fellowship from Graduate School, The University of Texas at Austin, August 2013-May 2016