Crypto News

Before AI Predicts Your Next Friend, It Needs to Do This First

In this post we’ll continue working on link prediction with the Twitch dataset: we’ll export the graph data from the Neptune DB cluster to an S3 bucket using the neptune-export utility provided by AWS. We’ll choose the ‘neptune_ml’ profile when we create the data export task and the utility will create the ‘training-data-configuration.json’ file that we’ll use later in the pipeline. The exported data will be ready for feature encoding and data processing, which is the next step required for link prediction.

Read part 1 here.

GRAPH DATA IN NEPTUNE DB

We start with the graph data that we have in Neptune DB after uploading the lists of vertices and edges using the Neptune Bulk Loader API (as described in Part 1 of this guide).

The vertices represent users. All vertices contain the same set of properties, and a single vertex looks like this:

{: '153', : 'user', 'days': 1629, 'mature': True, 'views': 3615, 'partner': False}

All edges have the same label (‘follows’), each edge connects 2 users. A single edge looks like this:

{: '0', : 'follows', : {: '255', : 'user'}, : {: '6194', : 'user'}}

Our goal is to export the data so that it can be used in the next part of our data pipeline: preprocessing and feature encoding.

RUNNING THE NEPTUNE-EXPORT UTILITY ON EC2

We’ll use the neptune-export utility provided by AWS to export data from the database. To allow the utility access to the DB, we’ll run it on an EC2 instance inside the VPC where the Neptune DB cluster is. The utility will get the data from the DB, save it in local storage (an EBS volume), and then it will upload the exported data to S3.

Although AWS provides a Cloudformation template that deploys a private API inside your VPC to allow the export process to be started with an HTTP request, we won’t focus on that this time. As our goal is to demonstrate how the data pipeline works (and not to set up an API), we’ll just use the EC2 instance’s console to interact with the neptune-export utility. By the way, those console commands can be automated with AWS Systems Manager Run Command and Step Functions.

Let’s create the EC2 instance that we’ll run neptune-export on. For AMI, we choose Ubuntu 24.04 LTS. We need to make sure that the Neptune cluster is reachable from the EC2 instance, so we’ll create the instance in the same VPC where the Neptune cluster is, and we’ll configure the security groups to allow network traffic between the instance and the cluster. We also need to attach an EBS volume of sufficient size to contain the exported data. For the dataset that we’re working on, an 8GB volume is enough.

While the instance is starting, we need to create an IAM role that allows write access to the destination S3 bucket, and also some RDS actions, as it is shown in the policy below. While the first statement of the policy is mandatory, the second one is only needed if you export data from a cloned cluster. Exporting data from cloned clusters will be be discussed later in this post.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "RequiredPart",
			"Effect": "Allow",
			"Action": [
				"rds:ListTagsForResource",
				"rds:DescribeDBInstances",
				"rds:DescribeDBClusters"
			],
			"Resource": "*"
		},
		{
			"Sid": "OptionalPartOnlyRequiredForExportingFromClonedCluster",
			"Effect": "Allow",
			"Action": [
                "rds:AddTagsToResource",
                "rds:DescribeDBClusters",
                "rds:DescribeDBInstances",
                "rds:ListTagsForResource",
                "rds:DescribeDBClusterParameters",
                "rds:DescribeDBParameters",
                "rds:ModifyDBParameterGroup",
                "rds:ModifyDBClusterParameterGroup",
                "rds:RestoreDBClusterToPointInTime",
                "rds:DeleteDBInstance",
                "rds:DeleteDBClusterParameterGroup",
                "rds:DeleteDBParameterGroup",
                "rds:DeleteDBCluster",
                "rds:CreateDBInstance",
                "rds:CreateDBClusterParameterGroup",
                "rds:CreateDBParameterGroup"
            ],
			"Resource": "*"
		}
	]
}

You can allow access to just the target cluster (instead of all clusters) by editing the ‘Resource’ field.

The role must also have a trust policy that allows EC2 to assume the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Once the EC2 instance and the role is ready, we’ll attach the role to the instance.

Next, we need to install the neptune-export utility on the instance. To do that, we’ll log into the instance and use these commands to install JDK 8 and download the utility:

sudo apt update -y
sudo apt install -y openjdk-8-jdk
curl -O https://s3.amazonaws.com/aws-neptune-customer-samples/neptune-export/bin/neptune-export.jar

Now that we’ve prepared the EC2 instance, the destination S3 bucket, and attached the IAM the role that allows write access to the destination S3 bucket to the instance, we can start exporting data. We’ll use this command to initiate the process, providing the required parameters as a JSON object:

java -jar /home/ubuntu/neptune-export.jar nesvc \
  --root-path /home/ubuntu/neptune-export \
  --json '{
    "command": "export-pg",
    "outputS3Path" : "s3://YOUR_TARGET_S3_BUCKET/neptune-export",
    "params": {
       "endpoint" : "YOUR_CLUSTER_ENDPOINT",
       "profile": "neptune_ml"
    }
  }'

We used only the required parameters here but you can easily extend the config. You can choose what part of the graph you export using the ‘filter’ parameter: you can select nodes, edges, and their properties.

If you’re exporting data from a live database, you can use the ‘cloneCluster‘ and ‘cloneClusterReplicaCount‘ parameters to make the neptue-export utility take a snapshot of the database, create a new Neptune cluster from that snapshot, deploy read replicas, and use them to export the data. By doing that, you can make sure the live database is not affected by the additional load from the data exporting.

The full list of parameters can be found here (https://docs.aws.amazon.com/neptune/latest/userguide/export-parameters.html).

VIEWING EXPORTED DATA AND NEXT STEPS

When the export process is completed, neptune-export prints some stats including the numbers of vertices and edges:

Source:
  Nodes: 7126
  Edges: 70648
Export:
  Nodes: 7126
  Edges: 70648
  Properties: 28504
Details:
  Nodes: 
    user: 7126
        |_ days {propertyCount=7126, minCardinality=1, maxCardinality=1, recordCount=7126, dataTypeCounts=[Integer:7126]}
        |_ mature {propertyCount=7126, minCardinality=1, maxCardinality=1, recordCount=7126, dataTypeCounts=[Boolean:7126]}
        |_ views {propertyCount=7126, minCardinality=1, maxCardinality=1, recordCount=7126, dataTypeCounts=[Integer:7126]}
        |_ partner {propertyCount=7126, minCardinality=1, maxCardinality=1, recordCount=7126, dataTypeCounts=[Boolean:7126]}
  Edges: 
    (user)-follows-(user): 70648

And then it uploads the exported data to S3.

Let’s look at the files that are created in the target S3 bucket:

The ‘nodes’ and ‘edges’ directories contain CSV files with the lists of nodes and edges that are similar to what we used in Part 1 when we uploaded data. For large graphs, there are multiple files, but our dataset is small and there’s just one file in each directory. There’s also the training-data-configuration.json file that we’ll edit and use in the next step of our process.

If you’re doing a one time export, now it’s safe to delete the EC2 instance and the EBS volume, since only the files in the target S3 bucket will be used in the next step. Otherwise, you can just stop the EC2 instance to avoid getting charged for idle time (you’ll still be charged for EBS storage unless you delete it).

At this point we have the graph data in S3 in the format that can be used in the next step of the process, and we’re ready to do feature encoding and data processing, which will be discussed in our next post.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button