Predicting Fraud with Autoencoders and Keras

January 30, 2024

54

Overview

On this publish we’ll practice an autoencoder to detect bank card fraud. We can even display methods to practice Keras fashions within the cloud utilizing CloudML.

The idea of our mannequin would be the Kaggle Credit score Card Fraud Detection dataset, which was collected throughout a analysis collaboration of Worldline and the Machine Studying Group of ULB (Université Libre de Bruxelles) on large information mining and fraud detection.

The dataset comprises bank card transactions by European cardholders remodeled a two day interval in September 2013. There are 492 frauds out of 284,807 transactions. The dataset is very unbalanced, the constructive class (frauds) account for less than 0.172% of all transactions.

Studying the information

After downloading the information from Kaggle, you may learn it in to R with read_csv():

library(readr)
df <- read_csv("data-raw/creditcard.csv", col_types = record(Time = col_number()))

The enter variables include solely numerical values that are the results of a PCA transformation. So as to protect confidentiality, no extra details about the unique options was supplied. The options V1, …, V28 have been obtained with PCA. There are nevertheless 2 options (Time and Quantity) that weren’t remodeled.
Time is the seconds elapsed between every transaction and the primary transaction within the dataset. Quantity is the transaction quantity and could possibly be used for cost-sensitive studying. The Class variable takes worth 1 in case of fraud and 0 in any other case.

Autoencoders

Since solely 0.172% of the observations are frauds, we now have a extremely unbalanced classification downside. With this type of downside, conventional classification approaches often don’t work very properly as a result of we now have solely a really small pattern of the rarer class.

An autoencoder is a neural community that’s used to be taught a illustration (encoding) for a set of knowledge, sometimes for the aim of dimensionality discount. For this downside we’ll practice an autoencoder to encode non-fraud observations from our coaching set. Since frauds are imagined to have a unique distribution then regular transactions, we count on that our autoencoder can have larger reconstruction errors on frauds then on regular transactions. Because of this we are able to use the reconstruction error as a amount that signifies if a transaction is fraudulent or not.

If you wish to be taught extra about autoencoders, a very good start line is that this video from Larochelle on YouTube and Chapter 14 from the Deep Studying e-book by Goodfellow et al.

Visualization

For an autoencoder to work properly we now have a robust preliminary assumption: that the distribution of variables for regular transactions is completely different from the distribution for fraudulent ones. Let’s make some plots to confirm this. Variables have been remodeled to a [0,1] interval for plotting.

We will see that distributions of variables for fraudulent transactions are very completely different then from regular ones, apart from the Time variable, which appears to have the very same distribution.

Preprocessing

Earlier than the modeling steps we have to do some preprocessing. We are going to break up the dataset into practice and check units after which we’ll Min-max normalize our information (that is achieved as a result of neural networks work a lot better with small enter values). We can even take away the Time variable because it has the very same distribution for regular and fraudulent transactions.

Primarily based on the Time variable we’ll use the primary 200,000 observations for coaching and the remaining for testing. That is good follow as a result of when utilizing the mannequin we wish to predict future frauds based mostly on transactions that occurred earlier than.

Now let’s work on normalization of inputs. We created 2 features to assist us. The primary one will get descriptive statistics concerning the dataset which are used for scaling. Then we now have a operate to carry out the min-max scaling. It’s vital to notice that we utilized the identical normalization constants for coaching and check units.

library(purrr)

#' Will get descriptive statistics for each variable within the dataset.
get_desc <- operate(x) {
  map(x, ~record(
    min = min(.x),
    max = max(.x),
    imply = imply(.x),
    sd = sd(.x)
  ))
} 

#' Given a dataset and normalization constants it is going to create a min-max normalized
#' model of the dataset.
normalization_minmax <- operate(x, desc) {
  map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}

Now let’s create normalized variations of our datasets. We additionally remodeled our information frames to matrices since that is the format anticipated by Keras.

We are going to now outline our mannequin in Keras, a symmetric autoencoder with 4 dense layers.

library(keras)
mannequin <- keras_model_sequential()
mannequin %>%
  layer_dense(items = 15, activation = "tanh", input_shape = ncol(x_train)) %>%
  layer_dense(items = 10, activation = "tanh") %>%
  layer_dense(items = 15, activation = "tanh") %>%
  layer_dense(items = ncol(x_train))

abstract(mannequin)

___________________________________________________________________________________
Layer (sort)                         Output Form                     Param #      
===================================================================================
dense_1 (Dense)                      (None, 15)                       450          
___________________________________________________________________________________
dense_2 (Dense)                      (None, 10)                       160          
___________________________________________________________________________________
dense_3 (Dense)                      (None, 15)                       165          
___________________________________________________________________________________
dense_4 (Dense)                      (None, 29)                       464          
===================================================================================
Whole params: 1,239
Trainable params: 1,239
Non-trainable params: 0
___________________________________________________________________________________

We are going to then compile our mannequin, utilizing the imply squared error loss and the Adam optimizer for coaching.

mannequin %>% compile(
  loss = "mean_squared_error", 
  optimizer = "adam"
)

Coaching the mannequin

We will now practice our mannequin utilizing the match() operate. Coaching the mannequin is fairly quick (~ 14s per epoch on my laptop computer). We are going to solely feed to our mannequin the observations of regular (non-fraudulent) transactions.

We are going to use callback_model_checkpoint() with a purpose to save our mannequin after every epoch. By passing the argument save_best_only = TRUE we’ll carry on disk solely the epoch with smallest loss worth on the check set.
We can even use callback_early_stopping() to cease coaching if the validation loss stops reducing for five epochs.

checkpoint <- callback_model_checkpoint(
  filepath = "mannequin.hdf5", 
  save_best_only = TRUE, 
  interval = 1,
  verbose = 1
)

early_stopping <- callback_early_stopping(persistence = 5)

mannequin %>% match(
  x = x_train[y_train == 0,], 
  y = x_train[y_train == 0,], 
  epochs = 100, 
  batch_size = 32,
  validation_data = record(x_test[y_test == 0,], x_test[y_test == 0,]), 
  callbacks = record(checkpoint, early_stopping)
)

Practice on 199615 samples, validate on 84700 samples
Epoch 1/100
199615/199615 [==============================] - 17s 83us/step - loss: 0.0036 - val_loss: 6.8522e-04d from inf to 0.00069, saving mannequin to mannequin.hdf5
Epoch 2/100
199615/199615 [==============================] - 17s 86us/step - loss: 4.7817e-04 - val_loss: 4.7266e-04d from 0.00069 to 0.00047, saving mannequin to mannequin.hdf5
Epoch 3/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.7753e-04 - val_loss: 4.2430e-04d from 0.00047 to 0.00042, saving mannequin to mannequin.hdf5
Epoch 4/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.3937e-04 - val_loss: 4.0299e-04d from 0.00042 to 0.00040, saving mannequin to mannequin.hdf5
Epoch 5/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.2259e-04 - val_loss: 4.0852e-04 enhance
Epoch 6/100
199615/199615 [==============================] - 18s 91us/step - loss: 3.1668e-04 - val_loss: 4.0746e-04 enhance
...

After coaching we are able to get the ultimate loss for the check set by utilizing the consider() fucntion.

loss <- consider(mannequin, x = x_test[y_test == 0,], y = x_test[y_test == 0,])
loss

        loss 
0.0003534254

Tuning with CloudML

We could possibly get higher outcomes by tuning our mannequin hyperparameters. We will tune, for instance, the normalization operate, the training fee, the activation features and the scale of hidden layers. CloudML makes use of Bayesian optimization to tune hyperparameters of fashions as described in this weblog publish.

We will use the cloudml bundle to tune our mannequin, however first we have to put together our undertaking by making a coaching flag for every hyperparameter and a tuning.yml file that may inform CloudML what parameters we wish to tune and the way.

The total script used for coaching on CloudML may be discovered at https://github.com/dfalbel/fraud-autoencoder-example. A very powerful modifications to the code have been including the coaching flags:

FLAGS <- flags(
  flag_string("normalization", "minmax", "One in all minmax, zscore"),
  flag_string("activation", "relu", "One in all relu, selu, tanh, sigmoid"),
  flag_numeric("learning_rate", 0.001, "Optimizer Studying Charge"),
  flag_integer("hidden_size", 15, "The hidden layer dimension")
)

We then used the FLAGS variable contained in the script to drive the hyperparameters of the mannequin, for instance:

mannequin %>% compile(
  optimizer = optimizer_adam(lr = FLAGS$learning_rate), 
  loss = 'mean_squared_error',
)

We additionally created a tuning.yml file describing how hyperparameters ought to be various throughout coaching, in addition to what metric we needed to optimize (on this case it was the validation loss: val_loss).

tuning.yml

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  hyperparameters:
    objective: MINIMIZE
    hyperparameterMetricTag: val_loss
    maxTrials: 10
    maxParallelTrials: 5
    params:
      - parameterName: normalization
        sort: CATEGORICAL
        categoricalValues: [zscore, minmax]
      - parameterName: activation
        sort: CATEGORICAL
        categoricalValues: [relu, selu, tanh, sigmoid]
      - parameterName: learning_rate
        sort: DOUBLE
        minValue: 0.000001
        maxValue: 0.1
        scaleType: UNIT_LOG_SCALE
      - parameterName: hidden_size
        sort: INTEGER
        minValue: 5
        maxValue: 50
        scaleType: UNIT_LINEAR_SCALE

We describe the kind of machine we wish to use (on this case a standard_gpu occasion), the metric we wish to reduce whereas tuning, and the the utmost variety of trials (i.e. variety of mixtures of hyperparameters we wish to check). We then specify how we wish to range every hyperparameter throughout tuning.

You may be taught extra concerning the tuning.yml file on the Tensorflow for R documentation and at Google’s official documentation on CloudML.

Now we’re able to ship the job to Google CloudML. We will do that by operating:

library(cloudml)
cloudml_train("practice.R", config = "tuning.yml")

The cloudml bundle takes care of importing the dataset and putting in any R bundle dependencies required to run the script on CloudML. In case you are utilizing RStudio v1.1 or larger, it is going to additionally will let you monitor your job in a background terminal. You may as well monitor your job utilizing the Google Cloud Console.

After the job is completed we are able to gather the job outcomes with:

It will copy the information from the job with the most effective val_loss efficiency on CloudML to your native system and open a report summarizing the coaching run.

Since we used a callback to avoid wasting mannequin checkpoints throughout coaching, the mannequin file was additionally copied from Google CloudML. Information created throughout coaching are copied to the “runs” subdirectory of the working listing from which cloudml_train() is known as. You may decide this listing for the latest run with:

[1] runs/cloudml_2018_01_23_221244595-03

You may as well record all earlier runs and their validation losses with:

ls_runs(order = metric_val_loss, reducing = FALSE)

                    run_dir metric_loss metric_val_loss
1 runs/2017-12-09T21-01-11Z      0.2577          0.1482
2 runs/2017-12-09T21-00-11Z      0.2655          0.1505
3 runs/2017-12-09T19-59-44Z      0.2597          0.1402
4 runs/2017-12-09T19-56-48Z      0.2610          0.1459

Use View(ls_runs()) to view all columns

In our case the job downloaded from CloudML was saved to runs/cloudml_2018_01_23_221244595-03/, so the saved mannequin file is accessible at runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5. We will now use our tuned mannequin to make predictions.

Making predictions

Now that we educated and tuned our mannequin we’re able to generate predictions with our autoencoder. We have an interest within the MSE for every commentary and we count on that observations of fraudulent transactions can have larger MSE’s.

First, let’s load our mannequin.

mannequin <- load_model_hdf5("runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5", 
                         compile = FALSE)

Now let’s calculate the MSE for the coaching and check set observations.

pred_train <- predict(mannequin, x_train)
mse_train <- apply((x_train - pred_train)^2, 1, sum)

pred_test <- predict(mannequin, x_test)
mse_test <- apply((x_test - pred_test)^2, 1, sum)

A very good measure of mannequin efficiency in extremely unbalanced datasets is the Space Beneath the ROC Curve (AUC). AUC has a pleasant interpretation for this downside, it’s the likelihood {that a} fraudulent transaction can have larger MSE then a traditional one. We will calculate this utilizing the Metrics bundle, which implements all kinds of widespread machine studying mannequin efficiency metrics.

[1] 0.9546814
[1] 0.9403554

To make use of the mannequin in follow for making predictions we have to discover a threshold (ok) for the MSE, then if if (MSE > ok) we contemplate that transaction a fraud (in any other case we contemplate it regular). To outline this worth it’s helpful to have a look at precision and recall whereas various the brink (ok).

possible_k <- seq(0, 0.5, size.out = 100)
precision <- sapply(possible_k, operate(ok) {
  predicted_class <- as.numeric(mse_test > ok)
  sum(predicted_class == 1 & y_test == 1)/sum(predicted_class)
})

qplot(possible_k, precision, geom = "line") 
  + labs(x = "Threshold", y = "Precision")

recall <- sapply(possible_k, operate(ok) {
  predicted_class <- as.numeric(mse_test > ok)
  sum(predicted_class == 1 & y_test == 1)/sum(y_test)
})
qplot(possible_k, recall, geom = "line") 
  + labs(x = "Threshold", y = "Recall")

A very good start line could be to decide on the brink with most precision however we may additionally base our determination on how a lot cash we would lose from fraudulent transactions.

Suppose every handbook verification of fraud prices us $1 but when we don’t confirm a transaction and it’s a fraud we’ll lose this transaction quantity. Let’s discover for every threshold worth how a lot cash we might lose.

cost_per_verification <- 1

lost_money <- sapply(possible_k, operate(ok) {
  predicted_class <- as.numeric(mse_test > ok)
  sum(cost_per_verification * predicted_class + (predicted_class == 0) * y_test * df_test$Quantity) 
})

qplot(possible_k, lost_money, geom = "line") + labs(x = "Threshold", y = "Misplaced Cash")

We will discover the most effective threshold on this case with:

[1] 0.005050505

If we would have liked to manually confirm all frauds, it will price us ~$13,000. Utilizing our mannequin we are able to cut back this to ~$2,500.

Predicting Fraud with Autoencoders and Keras

Overview

Studying the information

Autoencoders

Visualization

Preprocessing

Coaching the mannequin

Tuning with CloudML

Making predictions

Related Articles

What’s the Rust language? Protected, quick, and straightforward software program growth

The Obtain: how OpenAI assessments its fashions, and the ethics of uterus transplants

The Buyer Adoption Journey of Cisco Safe Workload

LEAVE A REPLY Cancel reply

Latest Articles

What’s the Rust language? Protected, quick, and straightforward software program growth

The Obtain: how OpenAI assessments its fashions, and the ethics of uterus transplants

The Buyer Adoption Journey of Cisco Safe Workload

Azure AI Foundry instruments up for modifications in AI functions

Cisco Safe Workload: Main in Segmentation Maturity