
A Julia-based approach to constructing a fraud-detection model

This is a component 2 in my two part series on getting began with Julia for applied data science. In the primary article, we went through a couple of examples of straightforward data manipulation and conducting exploratory data evaluation with Julia. On this blog, we’ll carry on the duty of constructing a fraud detection model to discover fraudulent transactions.
To recap briefly, we used a bank card fraud detection dataset obtained from Kaggle. The dataset accommodates 30 features including transaction time, amount, and 28 principal component features obtained with PCA. Below is a screenshot of the primary 5 instances of the dataset, loaded as a dataframe in Julia. Note that the transaction time feature records the elapsed time (in second) between the present transaction and the primary transaction within the dataset.
Before training the fraud detection model, let’s prepare the info ready for the model to eat. Because the predominant purpose of this blog is to introduce Julia, we are usually not going to perform any feature selection or feature synthesis here.
Data splitting
When training a classification model, the info is often split for training and test in a stratified manner. The predominant purpose is to take care of the distribution of the info with respect to the goal class variable in each the training and test data. This is particularly obligatory once we are working with a dataset with extreme imbalance. The MLDataUtils package in Julia provides a series of preprocessing functions including data splitting, label encoding, and have normalisation. The next code shows perform stratified sampling using the stratifiedobs
function from MLDataUtils. A random seed might be set in order that the identical data split might be reproduced.
The usage of the stratifiedobs function is kind of much like the train_test_split function from the sklearn library in Python. Take note that the input features X have to undergo twice of transpose to revive the unique dimensions of the dataset. This might be confusing for a Julia novice like me. I’m undecided why the writer of MLDataUtils developed the function in this fashion.
The equivalent Python sklearn implementation is as follows.
Feature scaling
As a advisable practice in machine learning, feature scaling brings the features to the identical or similar ranges of values or distribution. Feature scaling helps improve the speed of convergence when training neural networks, and likewise avoids the domination of any individual feature during training.
Although we are usually not training a neural network model on this work, I’d still prefer to learn the way feature scaling might be performed in Julia. Unfortunately, I couldn’t discover a Julia library which provides each functions of fitting scaler and reworking features. The feature normalization functions provided within the MLDataUtils package allow users to derive the mean and standard deviation of the features, but they can’t be easily applied on the training / test datasets to rework the features. Because the mean and standard deviation of the features might be easily calculated in Julia, we will implement the strategy of standard scaling manually.
The next code creates a replica of X_train and X_test, and calculates the mean and standard deviation of every feature in a loop.
The transformed and original features are shown as follows.
In Python, sklearn provides various options for feature scaling, including normalization and standardization. By declaring a feature scaler, the scaling might be done with two lines of code. The next code gives an example of using a RobustScaler.
Oversampling (by PyCall)
A fraud detection dataset is often severely imbalanced. For example, the ratio of negative over positive examples of our dataset is above 500:1. Since obtaining more data points will not be possible, undersampling will lead to an enormous loss of knowledge points from the bulk class, oversampling becomes the very best option on this case. Here I apply the favored SMOTE method to create synthetic examples for the positive class.
Currently, there isn’t any working Julia library which provides implementation of SMOTE. The ClassImbalance package has not been maintained for 2 years, and can’t be used with the recent versions of Julia. Fortunately, Julia allows us to call the ready-to-use Python packages using a wrapper library called PyCall.
To import a Python library to Julia, we’d like to put in PyCall and specify the PYTHONPATH as an environment variable. I attempted create a Python virtual environment here however it didn’t work out. On account of some reason, Julia cannot recognize the python path of the virtual environment. Because of this I even have to specify the system default python path. After this, we will import the Python implementation of SMOTE, which is provided within the imbalanced-learn library. The pyimport
function provided by PyCall might be used to import the Python libraries in Julia. The next code shows activate PyCall and ask for help from Python in a Julia kernel.
The equivalent Python implementation is as follows. We will see the fit_resample function is utilized in the identical way in Julia.
Now we reach the stage of model training. We shall be training a binary classifier, which might be done with quite a lot of ML algorithms, including logistic regression, decision tree, and neural networks. Currently, the resources for ML in Julia are distributed across multiple Julia libraries. Let me list down a couple of hottest options with their specialised set of models.
Here I’m going to decide on XGBoost, considering its simplicity and superior performance over the normal regression and classification problems. The strategy of training a XGBoost model in Julia is identical as that of Python, albeit there’s some minor difference in syntax.
The equivalent Python implementation is as follows.
Finally, let’s take a look at how our model performs by the precision, recall obtained on the test data, in addition to the time spent on training the model. In Julia, the precision, recall metrics might be calculated using the EvalMetrics library. Another package is MLJBase for a similar purpose.
In Python, we will employ sklearn to calculate the metrics.
So which is the winner between Julia and Python? To make a good comparison, the 2 models were each trained with the default hyperparameters, and learning rate = 0.1, no. of estimators = 1000. The performance metrics are summarised in the next table.
It may be observed that the Julia model achieves a greater precision and recall with a rather longer training time. Because the XGBoost library used for training the Python model is written in C++ under the hood, whereas the Julia XGBoost library is totally written in Julia, Julia does run as fast as C++, just because it claimed!
The hardware used for the aforementioned test: eleventh Gen Intel® Core™ i7–1165G7 @ 2.80GHz — 4 cores.
Jupyter notebook might be found on Github.
I’d prefer to end this series with a summary of the mentioned Julia libraries for various data science tasks.
On account of the shortage of community support, the usability of Julia can’t be in comparison with Python in the mean time. Nonetheless, given its superior performance, Julia still has an incredible potential in future.