Computer graphics programs usually work with *homogeneous coordinates*, whose
main purpose is to allow for some operations like translation to be linear and
thus vectorizable. Vectorized operations can be heavily parallelized on the
processing unit which results in a tremendous performance boost. Video games make
heavy use of this technique.

Homogeneous coordinates are often introduced as a *trick*. I think this is a
misconception and it paves the way for a lot of misunderstandings and errors
when working with computer graphics.

In the following article I will detail a more natural approach to the concept.

In the following we will consider canonical Euclidean planes and spaces. \(x\), \(y\) and \(z\) are the canonical Euclidean dimensions or the vector components depending on the context.

The initial space is called the projective space.

All the transformations are supposed to be endomorphisms.

Linear transformations in a Euclidean space can be written as a matrix \(M\) so that the transformed point \(P'\) of \(P\) is:

\[ P' = M \cdot P \]In 2D, the matrix is of size \((2,2)\). A few examples follow:

- 2D scaling \((x,y) \to (sx, sy)\):

Rotation of angle \(\theta\):

In 2D:

\[ \begin{pmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{pmatrix} \]In 3D, around the \(z\)-axis:

\[ \begin{pmatrix} \cos(\theta) & -\sin(\theta) & 0 \\ \sin(\theta) & \cos(\theta) & 0 \\ 0 & 0 & 1 \\ \end{pmatrix} \]

Other linear transformations include shearing and mirroring.

In the following section we will consider the projective space to be a 2D plane. This is only to make the explanation more visual, since this can be easily generalized to any dimension.

Some operations are not linear in the projective plane. But what if we consider the projective plane as being a plane in a 3D space? Then we can apply any 3D transformation on the points of this plane.

We say that we *promote* the plane to the next dimension.

The *fundamental principle* here is that the higher dimension allows for more
complex transformations, e.g. some specific shearing operations in 3D can be
seen as a translation on the plane.

After the desired transformations have been applied, we project the resulting points back to the plane.

There are some parameters we need to specify:

How do we promote the points? I.e. where does the projective plane lie in 3D?

How do we project the transformation result back to the plane?

In order to answer this we need to analyze how some non-linear operations behave.

Translations are not linear in the 2D-Euclidean space. Now if we promote the input points to a 3D space, they will all lie in a plane. It is obviously simpler to keep the \(x\) and \(y\) vectors identical. Then the only parameter for defining the plane is \(z\).

Let’s have a look at the following shearing of some point \(P=(i,j,k)\) by the scalars \(a\) and \(b\):

\[ \begin{pmatrix} 1 & 0 & a \\ 0 & 1 & b \\ 0 & 0 & 1 \\ \end{pmatrix} \cdot (i, j, k) = (i+ka, j+kb, k) \]If \(k=1\), then the transformed point is \((i+a, j+b, 1)\), which is a translation on the plane \(z=1\).

Considering the projective plane at \(z=1\) makes it easy to write a translation matrix.

Conclusion: the points from a projective space of any dimension can be promoted to the next dimension by adding a coordinate with the scalar value 1.

If the transformed points lie in the plane, the projection back to the projective space is trivial: we simply cut off the last dimension.

However, the transformed points might end up somewhere outside the plane. In that case we need some way to map the 3D points to the 2D points. There are several ways of doing this.

One would be to simply cut off the last dimension.

Another solution would be to divide all dimensions by the last one when it is non-zero. The last dimension would then always be 1, which matches the position of the plane \(z=1\) as we defined it previously.

First let’s consider some point \(P=(i,j,k)\) and let’s see what happens if we apply the same transformation as before:

\[ \begin{pmatrix} 1 & 0 & a \\ 0 & 1 & b \\ 0 & 0 & 1 \\ \end{pmatrix} \cdot (i,j,k) = (i+ka, j+ka, k) = P' \]If we project by cutting off the last coordinate, the projection of \(P\) is \((i, j)\), the projection of \(P'\) is \((i+ka, j+kb)\). The transformation is generally

*not*the translation by \((a, b)\).Using the division by the last coordinate, the projection of \(P\) is \((i/k, j/k)\), the projection of \(P'\) is \((i/k+a, j/k+b)\). This time the transformation is the translation by \((a, b)\).

Conclusion: the projection back to the projective space is done by dividing all dimensions by the last one when non-zero. This allows for translations in the projective space.

We have seen one example that covers it all. But what if the above analysis was good enough for translations only? Let us have a look at a very common operation in computer graphics: perspective correction.

In this section we consider a 3D projective space. The higher dimension is 4D and the 4th coordinates we be written \(w\).

When the camera and the screen are centered on the \(z\)-axis, perspective projection results from simple triangular relations, i.e. by dividing \(x\) and \(y\) by \(\alpha+\beta\cdot z\), where \(\alpha\) and \(\beta\) are scalars depending on the distance from the camera and the screen to the origin.

In the previous scheme, \(i'=i\frac{z_s-z_c}{k-z_c}\) and \(j'=j\frac{z_s-z_c}{k-z_c}\).

In the simplest case the camera is at the origin and the screen is at \(z=1\), as above. We get the simple transformation where \(\alpha=1\) and \(\beta=1\), that is, in our example, \(i'=i/k\) and \(j'=j/k\).

But there is no way a matrix multiplication can induce a division by one of the dimension. After all, matrices are a notation for linear transformations.

There is one way out though: a division by one dimension is performed during the projection back to the projective space. As such, our lead is to “inject” \(z\) into \(w\):

\[ \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & \beta & \alpha \\ \end{pmatrix} \cdot (i, j, k, 1) = (i, j, k, \alpha+\beta k) \]If we project the result back to the projective space, we get the 3D point \((\frac{i}{\alpha+\beta k}, \frac{j}{\alpha+\beta k}, \frac{k}{\alpha+\beta k})\). If we project the result on the plane \(z=z_s\), we obtain the perspective projection of \((i, j, k)\).

Conclusion: one more time, the division by the last coordinate is a projection technique that allows for operations in the next dimension to be properly mapped to the projective space.

The above analysis issued a relation between points in the projective space and points in the next dimension.

Making use of the above reasoning, homogeneous coordinates offer an alternative
*notation* to cartesian coordinates. We insist on the word ‘notation’ as there
are no *new points* nor *new spaces*. It means that \((i,j,k)\) and \((i,j,k,1)_h\)
specify the same point in 3D.

We use the \(_h\) index to avoid confusion between homogeneous coordinates and cartesian coordinates in a higher dimension.

The mapping between the two coordinate systems is central: the cartesian coordinates are obtained by dividing all dimensions by the homogeneous component if not null, and discarding this last component. This has several consequences:

Infinity can be represented using finite coordinates, e.g. \((x, y, 0)_h\) in 2D.

The \(0_h\) vector is ignored.

The origin is the vector where all components are 0 but the last one is 1, e.g. \((0,0,1)_h\) in 2D.

All points on a line going through \(0_h\) represent the same point. A non-null scalar multiplication over homogeneous coordinates does not change the point. For instance \((i, j) = (i, j, 1)_h = (2i, 2j, 2)_h = (\alpha i, \alpha j, \alpha)_h\) for all scalars \(\alpha\).

In computer graphics, homogeneous coordinates play an important role as they allow for perspective transformations in 3D.

The 3D rendering pipeline can be summarized as follows (*C* stands for
*cartesian*, *H* stands for *homogeneous*):

Vertex data (3D C) → Object data (3D H) → Eye coordinates (3D H) → Clip coordinates (3D H) → Normalized device coordinates (3D C) → Window coordinates (2D C)

Vertex data are 3D vectors. The graphic pipeline uses homogeneous coordinates for object data, with the homogeneous component being 1, for the reason detailed above.

Object data is then translated and rotated as needed via \((4,4)_h\)-matrices. This yields the homogeneous eye coordinates.

A \((4,4)_h\)-perspective matrix is applied on the eye coordinates, which leads to the homogeneous clip coordinates.

The homogeneous clip coordinates are converted to cartesian coordinates, a.k.a. normalized device coordinates (NDC), by dividing every coordinate by the last one, as usual.

The NDC are converted to the 2D window coordinates by removing the third dimension and applying some scaling.

Let us consider the following problem: we want to rectify the perspective of a picture, e.g. we want a tilted building facade picture to appear as if it had been taken orthogonally.

We proceed with control points: we select pixels in the input picture and tell the program where we would like it to be on the transformed picture.

A \((2,2)\)-matrix will not do the trick as it is not able to handle translations for instance.

Using homogeneous coordinates, we can transform the entire plane to the orthogonal picture we would like. In 3D, this can be seen as tilting the plane to make it match the screen.

This transformation can be embodied within a \((3,3)_h\)-matrix \(M\). The answer to our problem is \(M\).

Let us consider a set of known point pairs \(\langle \text{ control point } (i,j), \text{ target point } (I,J)\rangle\), so that

\[ \begin{pmatrix} m_{11} & m_{12} & m_{13} \\ m_{21} & m_{22} & m_{23} \\ m_{31} & m_{32} & m_{33} \\ \end{pmatrix} \cdot (i, j, 1)_h = (\alpha I, \alpha J, \alpha)_h \]We could be tempted to think we only need the first 2 rows of the matrix: this is wrong! The result of the matrix multiplication yields the homogeneous coordinates of a point, which still needs to be projected to the 2D plane to be significant. Therefore we need to compute the homogeneous component of every point.

The previous matrix operation yields the following equations:

\[ \left\{ \begin{array}{l l} m_{11}i + m_{12}j + m_{13} &= \alpha I \\ m_{21}i + m_{22}j + m_{23} &= \alpha J \\ m_{31}i + m_{32}j + m_{33} &= \alpha \\ \end{array} \right. \]or, cancelling the scale factor:

\[ \left\{ \begin{array}{l l} \frac{m_{11}i + m_{12}j + m_{13}}{m_{31}i + m_{32}j + m_{33}} &= I \\ \frac{m_{21}i + m_{22}j + m_{23}}{m_{31}i + m_{32}j + m_{33}} &= J \\ \end{array} \right. \]The matrix has 9 unknown coefficients, but since it is homogeneous, it is defined up to a scale factor. For instance, we have \(M=M/m_{11}\) if \(m_{11}≠0\). As such, only 8 coefficient determines a homogeneous matrix uniquely.

We can solve the system of 8 linearly independent equations constructed as above from 4 well chosen points. The above equation pair yields

\[ \left\{ \begin{array}{l l} m_{11}i + m_{12}j + m_{13} - m_{31}iI - m_{32}jI - m_{33}I &= 0 \\ m_{21}i + m_{22}j + m_{23} - m_{31}iJ - m_{32}jJ - m_{33}J &= 0 \\ \end{array} \right. \]which can be written as a multiplication

\[ \left\{ \begin{array}{l l} (i, j, 1, 0, 0, 0, -iI, -jI, -I) \cdot m^T &= 0 \\ (0, 0, 0, i, j, 1, -iJ, -jJ, -J) \cdot m^T &= 0 \\ \end{array} \right. \]with \(m=(m_{11}, m_{12}, m_{13}, m_{21}, m_{22}, m_{23}, m_{31}, m_{32}, m_{33})\).

We can build the following matrix from 4 point sets:

\[ S = \begin{pmatrix} i_1 & j_1 & 1 & 0 & 0 & 0 & -i_1I_1 & -j_1I_1 & -I_1 \\ 0 & 0 & 0 & i_1 & j_1 & 1 & -i_1J_1 & -j_1J_1 & -J_1 \\ ... \\ i_4 & j_4 & 1 & 0 & 0 & 0 & -i_4I_4 & -j_4I_4 & -I_4 \\ 0 & 0 & 0 & i_4 & j_4 & 1 & -i_4J_4 & -j_4J_4 & -J_4 \\ \end{pmatrix} \]If \(det(S) ≠ 0\), then the result can be found by solving

\[ S\cdot m = 0 \]which is equivalent to finding the eigenvalues of \(S\). An SVD will numerically find \(m\).

The present article does not claim to provide any new results when it comes to
homogeneous coordinates. It only aims at explaining the theory more *visually*,
and thus making it easier to understand and remember. It all makes perfect
sense, it is not just a trick. Mathematically speaking, the key argument that I
introduced above is to *consider the projective space as an affine hyperplane in
the next dimension*. Then everything falls into place:

the link between cartesian and homogeneous coordinates;

why this actually works;

when we should manipulate homogeneous coordinates, when we should project back to the projective space.

The central idea can be seen as a hook in the process: we momentarily promote
our data to the next dimension, manipulate it in there, then project the result
back to the original dimension. It generally *does not make sense* to exploit a
result in homogeneous coordinates.