Home Artificial Intelligence Fraud Detection with Entity Resolution and Graph Neural Networks Example Data Creating the Dataset Model Architecture Training and Testing Conclusion

Fraud Detection with Entity Resolution and Graph Neural Networks Example Data Creating the Dataset Model Architecture Training and Testing Conclusion

0
Fraud Detection with Entity Resolution and Graph Neural Networks
Example Data
Creating the Dataset
Model Architecture
Training and Testing
Conclusion

A practical guide to how entity resolution improves machine learning to detect fraud

Towards Data Science
Representation of a Graph Neural Network (Image generated by the Creator using Bing Image Creator)

Online fraud is an ever-growing issue for finance, e-commerce and other related industries. In response to this threat, organizations use fraud detection mechanisms based on machine learning and behavioral analytics. These technologies enable the detection of surprising patterns, abnormal behaviors, and fraudulent activities in real time.

Unfortunately, often only the present transaction, e.g. an order, is considered, or the method is predicated solely on historic data from the client’s profile, which is identified by a customer id. Nonetheless, skilled fraudsters may create customer profiles using low value transactions to accumulate a positive image of their profile. Moreover, they may create multiple similar profiles at the identical time. It is just after the fraud took place that the attacked company realizes that these customer profiles were related to one another.

Using entity resolution it is feasible to simply mix different customer profiles right into a single 360° customer view, allowing one to see the total picture of all historic transactions. While using this data in machine learning, e.g. using a neural network and even an easy linear regression, would already provide additional value for the resulting model, the actual value arises from also taking a look at how the person transactions are connected to one another. That is where graph neural networks (GNN) come into play. Beside taking a look at features extracted from the transactional records, they provide also the likelihood to take a look at features generated from the graph edges (how transactions are linked with one another) and even just the final layout of the entity graph.

Before we dive deeper into the small print, I even have one disclaimer to place here: I’m a developer and entity resolution expert and never a knowledge scientist or ML expert. While I feel the final approach is correct, I may not be following best practices, nor can I explain certain features resembling the variety of hidden nodes. Use this text as an inspiration and draw upon your individual experience in terms of the GNN layout or configuration.

For the needs of this text I would like to give attention to the insights gained from the entity graph’s layout. For this purpose I created a small Golang script that generates entities. Each entity is labeled as either fraudulent or non-fraudulent and consists of records (orders) and edges (how those orders are linked). See the next example of a single entity:

{
"fraud":1,
"records":[
{
"id":0,
"totalValue":85,
"items":2
},
{
"id":1,
"totalValue":31,
"items":4
},
{
"id":2,
"totalValue":20,
"items":9
}
],
"edges":[
{
"a":1,
"b":0,
"R1":1,
"R2":1
},
{
"a":2,
"b":1,
"R1":0,
"R2":1
}
]
}

Each record has two (potential) features, the full value and the variety of items purchased. Nonetheless, the generation script completely randomized these values, hence they shouldn’t provide value in terms of guessing the fraud label. Each edge also comes with two features R1 and R2. These could e.g. represent whether the 2 records A and B are linked via an analogous name and address (R1) or the via an analogous email address (R2). Moreover I intentionally unnoticed all of the attributes that will not be relevant for this instance (name, address, email, phone number, etc.), but are frequently relevant for the entity resolution process beforehand. As R1 and R2 are also randomized, in addition they don’t provide value for the GNN. Nonetheless, based on the fraud label, the perimeters are specified by two possible ways: a star-like layout (fraud=0) or a random layout (fraud=1).

The thought is that a non-fraudulent customer is more more likely to provide accurate matching relevant data, normally the identical address and same name, with only just a few spelling errors here and there. Hence recent transactions may get recognized as a replica.

Deduplicated Entity (Image by the Creator)

A fraudulent customer might wish to hide the incontrovertible fact that they’re still the identical person behind the pc, using various names and addresses. Nonetheless, entity resolution tools should still recognize the similarity (e.g. geographical and temporal similarity, recurring patterns in the e-mail address, device IDs etc.), however the entity graph may look more complex.

Complex, Possibly Fraudulent Entity (Image by the Creator)

To make it a little bit less trivial, the generation script also has a 5% error rate, meaning that entities are labeled as fraudulent once they have a star-like layout and labeled as non-fraudulent for the random layout. Also there are some cases where the info is insufficient to find out the actual layout (e.g. just one or two records).

{
"fraud":1,
"records":[
{
"id":0,
"totalValue":85,
"items":5
}
],
"edges":[

]
}

In point of fact you more than likely would gain invaluable insights from all three sorts of features (record attributes, edge attributes and edge layout). The next code examples will consider this, however the generated data doesn’t.

The instance uses python (apart from the info generation) and DGL with a pytorch backend. You will discover the total jupyter notebook, the info and the generation script on github.

Let’s start with importing the dataset:

import os

os.environ["DGLBACKEND"] = "pytorch"
import pandas as pd
import torch
import dgl
from dgl.data import DGLDataset

class EntitiesDataset(DGLDataset):
def __init__(self, entitiesFile):
self.entitiesFile = entitiesFile
super().__init__(name="entities")

def process(self):
entities = pd.read_json(self.entitiesFile, lines=1)

self.graphs = []
self.labels = []

for _, entity in entities.iterrows():
a = []
b = []
r1_feat = []
r2_feat = []
for edge in entity["edges"]:
a.append(edge["a"])
b.append(edge["b"])
r1_feat.append(edge["R1"])
r2_feat.append(edge["R2"])
a = torch.LongTensor(a)
b = torch.LongTensor(b)
edge_features = torch.LongTensor([r1_feat, r2_feat]).t()

node_feat = [[node["totalValue"], node["items"]] for node in entity["records"]]
node_features = torch.tensor(node_feat)

g = dgl.graph((a, b), num_nodes=len(entity["records"]))
g.edata["feat"] = edge_features
g.ndata["feat"] = node_features
g = dgl.add_self_loop(g)

self.graphs.append(g)
self.labels.append(entity["fraud"])

self.labels = torch.LongTensor(self.labels)

def __getitem__(self, i):
return self.graphs[i], self.labels[i]

def __len__(self):
return len(self.graphs)

dataset = EntitiesDataset("./entities.jsonl")
print(dataset)
print(dataset[0])

This processes the entities file, which is a JSON-line file, where each row represents a single entity. While iterating over each entity, it generates the sting features (long tensor with shape [e, 2], e=variety of edges) and the node features (long tensor with shape [n, 2], n=variety of nodes). It then proceeds to construct the graph based on a and b (long tensors each with shape [e, 1]) and assigns the sting and graph features to that graph. All resulting graphs are then added to the dataset.

Now that we now have the info ready, we’d like to think in regards to the architecture of our GNN. That is what I got here up with, but probably could be adjusted far more to the actual needs:

import torch.nn as nn
import torch.nn.functional as F
from dgl.nn import NNConv, SAGEConv

class EntityGraphModule(nn.Module):
def __init__(self, node_in_feats, edge_in_feats, h_feats, num_classes):
super(EntityGraphModule, self).__init__()
lin = nn.Linear(edge_in_feats, node_in_feats * h_feats)
edge_func = lambda e_feat: lin(e_feat)
self.conv1 = NNConv(node_in_feats, h_feats, edge_func)

self.conv2 = SAGEConv(h_feats, num_classes, "pool")

def forward(self, g, node_features, edge_features):
h = self.conv1(g, node_features, edge_features)
h = F.relu(h)
h = self.conv2(g, h)
g.ndata["h"] = h
return dgl.mean_nodes(g, "h")

The constructor takes the variety of node features, variety of edge features, variety of hidden nodes and the variety of labels (classes). It then creates two layers: a NNConv layer which calculates the hidden nodes based on the sting and node features, after which a GraphSAGE layer that calculates the resulting label based on the hidden nodes.

Almost there. Next we prepare the info for training and testing.

from torch.utils.data.sampler import SubsetRandomSampler
from dgl.dataloading import GraphDataLoader

num_examples = len(dataset)
num_train = int(num_examples * 0.8)

train_sampler = SubsetRandomSampler(torch.arange(num_train))
test_sampler = SubsetRandomSampler(torch.arange(num_train, num_examples))

train_dataloader = GraphDataLoader(
dataset, sampler=train_sampler, batch_size=5, drop_last=False
)
test_dataloader = GraphDataLoader(
dataset, sampler=test_sampler, batch_size=5, drop_last=False
)

We split with a 80/20 ratio using random sampling and create a knowledge loader for every of the samples.

The last step is to initialize the model with our data, run the training and afterwards test the result.

h_feats = 64
learn_iterations = 50
learn_rate = 0.01

model = EntityGraphModule(
dataset.graphs[0].ndata["feat"].shape[1],
dataset.graphs[0].edata["feat"].shape[1],
h_feats,
dataset.labels.max().item() + 1
)
optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)

for _ in range(learn_iterations):
for batched_graph, labels in train_dataloader:
pred = model(batched_graph, batched_graph.ndata["feat"].float(), batched_graph.edata["feat"].float())
loss = F.cross_entropy(pred, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:
pred = model(batched_graph, batched_graph.ndata["feat"].float(), batched_graph.edata["feat"].float())
num_correct += (pred.argmax(1) == labels).sum().item()
num_tests += len(labels)

acc = num_correct / num_tests
print("Test accuracy:", acc)

We initialize the model by providing the feature sizes for nodes and edges (each 2 in our case), the hidden nodes (64) and the quantity of labels (2 since it’s either fraud or not). The optimizer is then initialized with a learning rate of 0.01. Afterwards we run a complete of fifty training iterations. Once the training is completed, we test the outcomes using the test data loader and print the resulting accuracy.

For various runs, I had a typical accuracy within the range of 70 to 85%. Nonetheless, with just a few exceptions going right down to something like 55%.

Provided that the one usable information from our example dataset is the reason of how the nodes are connected, the initial results look very promising and suggest that higher accuracy rates could be possible with real-world data and more training.

Obviously when working with real data, the layout just isn’t that consistent and doesn’t provide an obvious correlation between the layout and fraudulent behavior. Hence, you need to also take the sting and node features into consideration. The important thing takeaway from this text needs to be that entity resolution provides the best data for fraud detection using graph neural networks and needs to be considered a part of a fraud detection engineer’s arsenal of tools.

LEAVE A REPLY

Please enter your comment!
Please enter your name here