There exist publicly accessible data which describe the socio-economic characteristics of a geographic location. In Australia where I reside, the Government through the Australian Bureau of Statistics (ABS) collects and publishes individual and household data regularly in respect of income, occupation, education, employment and housing at an area level. Some examples of the published data points include:

- Percentage of individuals on relatively high / low income
- Percentage of individuals classified as managers of their respective occupations
- Percentage of individuals with no formal educational attainment
- Percentage of individuals unemployed
- Percentage of properties with 4 or more bedrooms

Whilst these data points appear to focus heavily on individual people, it reflects people’s access to material and social resources, and their ability to take part in society in a specific geographic area, ultimately informing the socio-economic advantage and drawback of this area.

Given these data points, is there a technique to derive a rating which ranks geographic areas from essentially the most to the least advantaged?

The goal to derive a rating may formulate this as a regression problem, where each data point or feature is used to predict a goal variable, on this scenario, a numerical rating. This requires the goal variable to be available in some instances for training the predictive model.

Nevertheless, as we don’t have a goal variable to begin with, we may have to approach this problem in one other way. For example, under the belief that every geographic areas is different from a socio-economic standpoint, can we aim to grasp which data points help explain essentially the most variations, thereby deriving a rating based on a numerical combination of those data points.

We will do exactly that using a method called the Principal Component Evaluation (PCA), and this text demonstrates how!

ABS publishes data points indicating the socio-economic characteristics of a geographic area within the “Data Download” section of this webpage, under the “Standardised Variable Proportions data cube”[1]. These data points are published on the Statistical Area 1 (SA1) level, which is a digital boundary segregating Australia into areas of population of roughly 200–800 people. It is a far more granular digital boundary in comparison with the Postcode (Zipcode) or the States digital boundary.

For the aim of demonstration in this text, I’ll be deriving a socio-economic rating based on 14 out of the 44 published data points provided in Table 1 of the information source above (I’ll explain why I choose this subset afterward). These are :

- INC_LOW: Percentage of individuals living in households with stated annual household equivalised income between $1 and $25,999 AUD
- INC_HIGH: Percentage of individuals with stated annual household equivalised income greater than $91,000 AUD
- UNEMPLOYED_IER: Percentage of individuals aged 15 years and over who’re unemployed
- HIGHBED: Percentage of occupied private properties with 4 or more bedrooms
- HIGHMORTGAGE: Percentage of occupied private properties paying mortgage greater than $2,800 AUD monthly
- LOWRENT: Percentage of occupied private properties paying rent lower than $250 AUD per week
- OWNING: Percentage of occupied private properties with out a mortgage
- MORTGAGE: Per cent of occupied private properties with a mortgage
- GROUP: Percentage of occupied private properties that are group occupied private properties (e.g. apartments or units)
- LONE: Percentage of occupied properties that are lone person occupied private properties
- OVERCROWD: Percentage of occupied private properties requiring a number of extra bedrooms (based on Canadian National Occupancy Standard)
- NOCAR: Percentage of occupied private properties with no cars
- ONEPARENT: Percentage of 1 parent families
- UNINCORP: Percentage of properties with no less than one one that is a business owner

On this section, I’ll be stepping through the Python code for deriving a socio-economic rating for a SA1 region in Australia using PCA.

I’ll start by loading within the required Python packages and the information.

`## Load the required Python packages`### For dataframe operations

import numpy as np

import pandas as pd

### For PCA

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

### For Visualization

import matplotlib.pyplot as plt

import seaborn as sns

### For Validation

from scipy.stats import pearsonr

`## Load data`file1 = 'data/standardised_variables_seifa_2021.xlsx'

### Reading from Table 1, from row 5 onwards, for column A to AT

data1 = pd.read_excel(file1, sheet_name = 'Table 1', header = 5,

usecols = 'A:AT')

`## Remove rows with missing value (113 out of 60k rows)`data1_dropna = data1.dropna()

A crucial cleansing step before performing PCA is to standardise each of the 14 data points (features) to a mean of 0 and standard deviation of 1. That is primarily to make sure the loadings assigned to every feature by PCA (consider them as indicators of how necessary a feature is) are comparable across features. Otherwise, more emphasis, or higher loading, could also be given to a feature which is definitely not significant or vice versa.

Note that the ABS data source quoted above have already got the features standardised. That said, for an unstandardised data source:

`## Standardise data for PCA`### Take all but the primary column which is merely a location indicator

data_final = data1_dropna.iloc[:,1:]

### Perform standardisation of information

sc = StandardScaler()

sc.fit(data_final)

### Standardised data

data_final = sc.transform(data_final)

With the standardised data, PCA will be performed in only a number of lines of code:

`## Perform PCA`pca = PCA()

pca.fit_transform(data_final)

PCA goals to represent the underlying data by Principal Components (PC). The variety of PCs provided in a PCA is the same as the variety of standardised features in the information. On this instance, 14 PCs are returned.

Each PC is a linear combination of all of the standardised features, only differentiated by its respective loadings of the standardised feature. For instance, the image below shows the loadings assigned to the primary and second PCs (PC1 and PC2) by feature.

With 14 PCs, the code below provides a visualization of how much variation each PC explains:

## Create visualization for variations explained by each PCexp_var_pca = pca.explained_variance_ratio_

plt.bar(range(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.7,

label = '% of Variation Explained',color = 'darkseagreen')

plt.ylabel('Explained Variation')

plt.xlabel('Principal Component')

plt.legend(loc = 'best')

plt.show()

As illustrated within the output visualization below, Principal Component 1 (PC1) accounts for the biggest proportion of variance in the unique dataset, with each following PC explaining less of the variance. To be specific, PC1 explains circa. 35% of the variation throughout the data.

For the aim of demonstration in this text, PC1 is chosen because the only PC for deriving the socio-economic rating, for the next reasons:

- PC1 explains sufficiently large variation throughout the data on a relative basis.
- Whilst selecting more PCs potentially allows for (marginally) more variation to be explained, it makes interpretation of the rating difficult within the context of socio-economic advantage and drawback by a specific geographic area. For instance, as shown within the image below, PC1 and PC2 may provide conflicting narratives as to how a specific feature (e.g. ‘INC_LOW’) influences the socio-economic variation of a geographic area.

`## Show and compare loadings for PC1 and PC2`### Using df_plot dataframe per Image 1

sns.heatmap(df_plot, annot = False, fmt = ".1f", cmap = 'summer')

plt.show()

To acquire a rating for every SA1, we simply multiply the standardised portion of every feature by its PC1 loading. This will be achieved by:

## Obtain raw rating based on PC1### Perform sum product of standardised feature and PC1 loading

pca.fit_transform(data_final)

### Reverse the sign of the sum product above to make output more interpretable

pca_data_transformed = -1.0*pca.fit_transform(data_final)

### Convert to Pandas dataframe, and join raw rating with SA1 column

pca1 = pd.DataFrame(pca_data_transformed[:,0], columns = ['Score_Raw'])

score_SA1 = pd.concat([data1_dropna['SA1_2021'].reset_index(drop = True), pca1]

, axis = 1)

### Inspect the raw rating

score_SA1.head()

The upper the rating, the more advantaged a SA1 is in terms its access to socio-economic resource.

How will we know the rating we derived above was even remotely correct?

For context, the ABS actually published a socio-economic rating called the Index of Economic Resource (IER), defined on the ABS website as:

*“The Index of Economic Resources (IER) focuses on the financial points of relative socio-economic advantage and drawback, by summarising variables related to income and housing. IER excludes education and occupation variables as they are usually not direct measures of economic resources. It also excludes assets equivalent to savings or equities which, although relevant, can’t be included as they are usually not collected within the Census.”*

Without disclosing the detailed steps, the ABS stated of their Technical Paper that the IER was derived using the identical features (14) and methodology (PCA, PC1 only) as what we had performed above. That’s, if we did derive the proper scores, they ought to be comparable against the IER scored published here (“Statistical Area Level 1, Indexes, SEIFA 2021.xlsx”, Table 4).

Because the published rating is standardised to a mean of 1,000 and standard deviation of 100, we start the validation by standardising the raw rating the identical:

`## Standardise raw scores`score_SA1['IER_recreated'] =

(score_SA1['Score_Raw']/score_SA1['Score_Raw'].std())*100 + 1000

For comparison, we read within the published IER scores by SA1:

`## Read in ABS published IER scores`

## similarly to how we read within the standardised portion of the featuresfile2 = 'data/Statistical Area Level 1, Indexes, SEIFA 2021.xlsx'

data2 = pd.read_excel(file2, sheet_name = 'Table 4', header = 5,

usecols = 'A:C')

data2.rename(columns = {'2021 Statistical Area Level 1 (SA1)': 'SA1_2021', 'Rating': 'IER_2021'}, inplace = True)

col_select = ['SA1_2021', 'IER_2021']

data2 = data2[col_select]

ABS_IER_dropna = data2.dropna().reset_index(drop = True)

**Validation 1— PC1 Loadings**

As shown within the image below, comparing the PC1 loading derived above against the PC1 loading published by the ABS suggests that they differ by a continuing of -45%. As that is merely a scaling difference, it doesn’t impact the derived scores that are standardised (to a mean of 1,000 and standard deviation of 100).

(You need to give you the chance to confirm the ‘Derived (A)’ column with the PC1 loadings in Image 1).

**Validation 2— Distribution of Scores**

The code below creates a histogram for each scores, whose shapes look to be almost equivalent.

`## Check distribution of scores`score_SA1.hist(column = 'IER_recreated', bins = 100, color = 'darkseagreen')

plt.title('Distribution of recreated IER scores')

ABS_IER_dropna.hist(column = 'IER_2021', bins = 100, color = 'lightskyblue')

plt.title('Distribution of ABS IER scores')

plt.show()

**Validation 3— IER rating by SA1**

As the last word validation, let’s compare the IER scores by SA1:

## Join the 2 scores by SA1 for comparison

IER_join = pd.merge(ABS_IER_dropna, score_SA1, how = 'left', on = 'SA1_2021')## Plot scores on x-y axis.

## If scores are equivalent, it should show a straight line.

plt.scatter('IER_recreated', 'IER_2021', data = IER_join, color = 'darkseagreen')

plt.title('Comparison of recreated and ABS IER scores')

plt.xlabel('Recreated IER rating')

plt.ylabel('ABS IER rating')

plt.show()

A diagonal straight line as shown within the output image below supports that the 2 scores are largely equivalent.

So as to add to this, the code below shows the 2 scores have a correlation near 1:

The demonstration in this text effectively replicates how the ABS calibrates the IER, considered one of the 4 socio-economic indexes it publishes, which will be used to rank the socio-economic status of a geographic area.

Taking a step back, what we’ve achieved in essence is a discount in dimension of the information from 14 to 1, losing some information conveyed by the information.

Dimensionality reduction technique equivalent to the PCA can be commonly seen in helping to scale back high-dimension space equivalent to text embeddings to 2–3 (visualizable) Principal Components.