How to set up CUDA environment on Google Colab?

September 28, 2023

TL;DR

Script for environment setup

!pip install git+https://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc_plugin
!mkdir -p /usr/local/cuda/bin
!ln -s /usr/bin/nvcc /usr/local/cuda/bin/nvcc

Recently I've been working on a research project for AI model privacy. After getting the theory done, I turned to the implementation part, the dirty job. In the beginning, it is always good to have some tests run on free online platforms. Google Colab fits my needs so it is currently my primary choice.

Colab is mainly designed for hosting Python codes. This is reasonable because in practice most ML codes are in Python. However, my work involves lots of low-level stuff so Python is not enough for me. I have to be able to run CUDA Cpp directly on Colab. Fortunately, Colab provides that support. Simply add %%cu to the beginning of a cell, which will run as a CUDA Cpp file. There are fairly many tutorials on how to do this (This link is an example, but there are many more). However, all tutorials I found date back to 2021 and now they simply do not work! This is because a few things have changed since then:

Google add CUDA toolkit support to Colab so there is no need to reinstall that as in previous tutorials;
Github bans user/key login for cloning using ssh protocol;
Colab changed nvcc path.

Therefore, we will need an updated version of the setup script.

Previously, the script looks like this:

# Part 1
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg --purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

# Part 2
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

# Check
nvcc --version

# Part 3
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc_plugin

As explained before, part 1 and part 2 are no longer necessary. You can run nvcc --version directly to verify this. However, if you run part 3, it runs forever. At first, I thought it was a network problem, but I wouldn't believe it. Then I found the problem: git+git should be git+https, which is caused by the new Github policy. Some other guys found this problem as well.

But when you try to run a %%cu cell, it just won't work: nvcc not found. But why? I've just verified the existence of nvcc seconds ago. Well, after digging into the nvcc4jupyter plugin, I found that they hardcoded compiler = '/usr/local/cuda/bin/nvcc' in their source code, and it mismatches with the result of !which nvcc, which gives /usr/bin/nvcc. This is also due to the fact that we are now using a prebuilt nvcc provided by Colab, not one that we install by ourselves.

So a quick fix should solve the problem:

!mkdir -p /usr/local/cuda/bin
!ln -s /usr/bin/nvcc /usr/local/cuda/bin/nvcc

And it works!

%%cu

#include <cstdio>
#include <iostream>

using namespace std;

__global__ void maxi(int* a, int* b, int n)
{
	int block = 256 * blockIdx.x;
	int max = 0;

	for (int i = block; i < min(256 + block, n); i++) {

		if (max < a[i]) {
			max = a[i];
		}
	}
	b[blockIdx.x] = max;
}

int main()
{

	int n;
	n = 3 << 2;
	int a[n];

	for (int i = 0; i < n; i++) {
		a[i] = rand() % n;
		cout << a[i] << "\t";
	}

	cudaEvent_t start, end;
	int *ad, *bd;
	int size = n * sizeof(int);
	cudaMalloc(&ad, size);
	cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
	int grids = ceil(n * 1.0f / 256.0f);
	cudaMalloc(&bd, grids * sizeof(int));

	dim3 grid(grids, 1);
	dim3 block(1, 1);

	cudaEventCreate(&start);
	cudaEventCreate(&end);
	cudaEventRecord(start);

	while (n > 1) {
		maxi<<<grids, block>>>(ad, bd, n);
		n = ceil(n * 1.0f / 256.0f);
		cudaMemcpy(ad, bd, n * sizeof(int), cudaMemcpyDeviceToDevice);
	}

	cudaEventRecord(end);
	cudaEventSynchronize(end);

	float time = 0;
	cudaEventElapsedTime(&time, start, end);

	int ans[2];
	cudaMemcpy(ans, ad, 4, cudaMemcpyDeviceToHost);

	cout << "The maximum element is : " << ans[0] << endl;

	cout << "The time required : ";
	cout << time << endl;
}

The above code cell results in:

7	10	9	7	5	7	10	0	9	1	2	7
The maximum element is : 10
The time required : 0.028032