Hi 🧌

I am Gokul, I am interested in Deep Learning, MLSystems and Compilers. Currently working on OnDevice Machine Learning, CoreML Tools @ Apple. I completed my masters in computer science from NYU Courant, focusing on Deep Learning and Parallel computing. I also worked as an engineer at Samsung Research Bangalore. Please, ignore grammatical errors in my blogs, most of the content is freestyled out of my mind.

PTX and Matrix Multiplication

This article is about me learning how to optimize matrix multiplication on a GPU using PTX. The GPUs are computing device which the host machine offloads its workload, when it has to compute large data parallel task. Learning about its limitations and writing efficient GPU code is a good skill to have. NVIDIA GPUs are one of the industry leaders in this domain. Let’s see how we can write matmul code in NVIDIA’s GPU using its low-level instruction set. This was inspired by this article by Siboehm. ...

Liveness Analysis and Interference Graph

In this article, I will be noting down my understanding of liveness analysis. We compile source language to target code via multiple stages. There comes a stage, where the representation will resemble the assembly language, but instead of registers we use temporary variables. Now, the goal of next step is to allocate registers for these temporary variables and in-case of lack of available register, we store them in memory. To determine if two variables can share same register, we need to perform liveness analysis, that is why this step is crucial. ...

RISC-V, a brief learning

Reduced Instructor Set Computer version 5 (RISC - V), is an open instruction set architecture. It supports both 32-bit and 64-bit address space. In this article, we will be learning the 32 bit version. We will be learning RV32I, before that let us explore hardware terms, execution environment and general overview. Hardware Terminology Core: A hardware component is called core if it contains an independent instruction fetch unit. Hart: Each instruction fetch is performed by a hardware thread on cores. Each core can execute multiple hardware threads (using hyper-threading, multi-threading etc.) Coprocessor: It is a hardware unit attached to core which has additional architectural state and implements the instruction extensions. Accelerator: It is a core that can operate autonomously but is specialized for certain tasks. Example: I/O processor which can offload IO tasks from core to these units. Execution Environment It defines how software interacts with the hardware and system software, ensuring compatibility across different RISC-V implementations. This is done to standardize execution models, system calls, and memory models to enable smooth operation of applications, operating systems, and hypervisors. The RISC-V supports different execution environments based on needs: ...

Short Note on Pratt Parsing

For the past couple of weeks, I was trying to write an interpreter in c++. One of the challenging aspects was generating the abstract syntax tree (AST). The one particular problem which was interesting is solving for operation precedence to generate correct abstract syntax tree. For example, a * b + c = ((a * b) + c) a + b * c = (a + (b * c)) You can see that we cannot do a+b before multiplying as it defies the precedence of operation in computation. The operator with highers precedence should ‘sink’ to the bottom of the AST. If the operator is at the bottom of the syntax tree, those are evaluated first. ...

Lowering in MLIR

Introduction First let us understand the definition of lowering: The process of transforming a higher-level representation of an operation into a lower-level, but semantically equivalent, representation It means we have a representation representing high level abstraction and we convert it into low level abstraction but it should be computationally equivalent (i.e. same meaning or it cannot result in different result). Why are we doing this ? Because entire computer ecosystem is built this way, we write code in programming languages that gets converted to assembly or byte-code which eventually gets converted to 1s and 0s. Generally, in most cases LLVM IR is considered lowest level as we have standard compilers to do further lowering. ...

GPU Hardware

In this article, we will have a brief overview of GPU Hardware from programming perspective. I am a software engineer and I do not have time or resources to learn nitty-gritty details of hardware engineering. However, learning about hardware is essential to write efficient and clean programs. I have learned it hard way during my stint at Samsung. We will be looking at discrete GPU setup, then understand how modern NVIDIA GPUs look like and then try to understand each part (from NVIDIA whitepaper) ...

OperationPass in MLIR

We will be reviewing the shape inference pass implemented in the toy chapter 4. In this article, we will be seeing how to create interface for operation and use that interface to perform modification to IR. The operation which satisfy condition for modification must implement this interface. Interface for Operation We can create interface for an operation by inheriting OpInterface class. We can declare the functions that the interface forces the entity to implement can be added via InterfaceMethod. ...

In-lining in MLIR

In the context of compilers, inlining or inline expansion is a process (or optimization depending on the use case) that replaces function call with the body of the function. Now let us see how we can inline a function defined in the IR. Prerequisites Before proceeding, I am implementing these features in my dialect Glow, please find more information here . One of the main requirements is defining the function feature in the dialect, we will be utilizing traits and tablegen to implement these. ...

Conanicalization 💣

This article summarizes my understanding of canonical forms from the perspective of intermediate representation (compilers). We also go through how we can use MLIR pattern rewrite to implement canonicalization of operations. What is Canonicalization ? It is a process which converts data (that can have one more possible representation) into a ‘standard’, ’normal’ or ‘canonical form’. We can visualize the data through simple arithmetic expression which can have multiple forms: ...

Torch Script Bazel

In this article, we will be exploring how to use PyTorch in C++. The Python has lots of overhead and baggage when using in application where the performance is critical, for example in game engines, embedded device application, use of python as front end is bad. The Pytorch provides us C++ front-end APIs and library to write ML application in a static compiled language. You can find documentation here In this article, we will be using Bazel to build a C++ project which can use Pytorch APIs. The main goal is to read a ML model that has been exported from Pytorch (python) using a C++ application. ...