Visualizing complex data in geographical space using PCA colouring

I have recently had some very interesting dicussions with scientists in China about visualizing complex AMR data (e.g. patterns of gene abundance or bacterial taxa) in geographical space to show if data from nearby locations are similar or different from each other. Our collaborators had already shown that PCA was successful at separating the data, so the idea I had was to use the PCA scores to colour points which could then be plotted on a map at the locations from which the data were derived. Nearby points with similar data should then have similar colours, while nearby points with different data should have different colours.

To test the idea, I used a built in data set in R (state.x77 in the datasets package). This has demographic data about states in the USA, as well as longitude/latitude coordinates for the centres of the states. In the analysis I have:

1. Run a PCA on the demographic data
2. Normalized the first three PCA scores to be between 0 and 1 (since this is what the rgb() function in R requires to define colours)
3. Used a simple map library to plot a map of the USA including state boundaries
4. Plotted the points into the centres of the states using the rgb() defined colours

What you can see in the map (below) is that generally nearby states are similar to each other in that they have similar colours – but there are some exceptions where the colours are very different.


The code is:

require(maps) # Simple R interface for maps
require(datasets) # Contains some example data

# This function converts a range into the range [0,1) which we need for the rgb colour map
normalize = function(x,eps=1e-3) { # eps is a small number to ensure the outputs are all <1 as rgb doesn't like values of 1
    xnorm = as.numeric((x-min(x))/(max(x)-min(x)+eps))

spc = princomp(state.x77[,3:6]) # this is some demographic data about states in the USA - it is just an example
sred = normalize(spc[[6]][,1])
sgreen = normalize(spc[[6]][,3]) # this order, i.e. using blue on the 2nd PC, is to help red-green colour blind people
sblue = normalize(spc[[6]][,2])

map('usa') # draws a simple map of USA
map('state',add=T) # adds state boundaries
points($x,$y,pch=19,col=rgb(red=sred,green=sgreen,blue=sblue)) # puts coloured points into the centre of each state. An alternative could be to fill the states