For very large migrations, AWS Database Migration Service (AWS DMS) replication can run for hours or days depending on the data being replicated. It’s advisable to monitor the AWS DMS resources for a smooth migration. Monitoring your resources can help you detect anomalies and trigger notifications based on the threshold metrics configured. You can use Amazon CloudWatch to collect, track, and monitor AWS resources using metrics. With CloudWatch, you can create alarms that watch metrics and send notifications when a threshold is breached. You can configure CloudWatch to monitor your replication using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or AWS DMS API.

AWS DMS replication is set up between two databases with multiple tasks performing full load or change data capture (CDC). With AWS DMS, you can perform one-time migrations and replicate ongoing changes to keep sources and targets in sync. These replication tasks run for hours or days, and can run into various replication issues, such as the following:

  • Low memory – The replication instance is running low on memory
  • High CPU – CPU is used at capacity
  • Excessive swap usage – Not enough memory is allocated
  • Network disruptions – Network failure between source and target instances

To avoid these replication issues and redundant activities by DBAs, which can delay the migration timelines, it’s recommended to identify and detect these failures ahead of time to avoid any replication disruption. Setting up the CloudWatch monitoring alarms on AWS DMS replication and its tasks helps you alert proactively, so that you can act accordingly.

This post describes how to set up and configure the monitoring of AWS DMS resources. You can implement the solution with the AWS CLI or the console. For this post, we use the AWS CLI to create our CloudWatch alarms.

Solution overview

The AWS CLI is an open-source tool that enables you to interact with AWS services using commands in your command line shell. For more information about installing and configuring the AWS CLI, see Installing the AWS CLI.

AWS DMS helps you migrate databases to AWS securely. It supports homogeneous and heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora. AWS DMS supports continuous data replication while maintaining high availability, and has been widely adopted for database migrations because it’s easy to configure. For more information, see What Is AWS Database Migration Service?

To use AWS DMS, you need to create a DMS replication instance, create a source endpoint that connects the source database to read data, and create a target endpoint that connects to the target database and loads the data. You can create one or more replication tasks to replicate and migrate your data between source and target databases. In this post, we describe the metrics that you can monitor on your AWS DMS resources.

An AWS DMS replication instance is an Amazon Elastic Compute Cloud (Amazon EC2) instance that performs the actual data migration between source and target databases. The replication instance also caches the changes during the migration, which is very CPU and memory intensive. You can set up various metrics for AWS DMS replication. For this post, we set up the following:

  • CPUUtilization
  • FreeableMemory
  • FreeStorageSpace
  • WriteIOPS

AWS DMS replication tasks are responsible for migrating and replicating data between the source and target endpoints. For this post, we set up the following key metrics:

  • FullLoadThroughputRowsSource
  • FullLoadThroughputRowsTarget
  • CDCThroughputRowsSource
  • CDCThroughputRowsTarget

Prerequisites

To follow along with this solution, you need the following resources:

  • An AWS account with permissions to access resources in AWS DMS and CloudWatch
  • Access to a Linux/Unix machine installed and configured with the AWS CLI
  • Necessary AWS Identity and Access Management (IAM) permissions granted to the EC2 instance for accessing the CloudWatch alarms
  • An AWS DMS replication instance running in your AWS account

Setting up your replication instance

To set up monitoring, it’s a best practice to name your CloudWatch alarm metrics uniquely. For this post, we create an AWS DMS replication instance (sample-dms-replication-instance) and AWS DMS tasks (dms-task-1) and configure its source and target endpoints. Complete the following steps:

  1. On the AWS DMS console, choose Replication instances.
  2. Choose your desired instance (for this post, we use sample-dms-replication-instance).

To name your alarm uniquely, use the following parameters:

  • team_tag_value – The parameter should be unique tag value
  • resource_name – The unique resource name of your choice (this post uses dms_instance)
  • metric_name – The alarm metrics
  • replicaton_instance_identifier – The replication_instance_identifier name (for this post, we use sample-dms-replication-instance)
  • region – The Region of your AWS DMS replication instance
  • replication_instance_arn – The ARN of your replication instance (for this post, it’s arn:aws:dms:us-east-1:999999999999:rep:G7EBKJQL2EBNETCOE352XV77K4)
  1. Run the following commands on any Linux/Unix system to set up the environment variables referring to the preceding parameters:
    export team_tag_value=team1
    export resource_name=dms_instance 
    export replication_instance_identifier=sample-dms-replication-instance
    export region=us-east-1
    export replication_instance_arn=arn:aws:dms:us-east-1:999999999999:rep:G7EBKJQL2EBNETCOE352XV77K4

 

  1. To check all the available metrics on your AWS DMS resource, enter the following code:
    aws cloudwatch list-metrics --namespace AWS/DMS --dimensions "Name=ReplicationInstanceIdentifier,Value=$replication_instance_identifier" --region $region
    
    Output:
    {
        "Metrics": [
            {
                "Namespace": "AWS/DMS",
                "MetricName": "FullLoadThroughputBandwidthSource",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "FullLoadThroughputRowsTarget",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "RunCounter",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
    
    .
    .
    .
    .
    .
    .
    .
    .
            {
                "Namespace": "AWS/DMS",
                "MetricName": "CDCLatencyTarget",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "CDCLatencySource",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "CDCThroughputRowsTarget",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "FullLoadThroughputRowsSource",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "CDCChangesMemorySource",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            },
            {
                "Namespace": "AWS/DMS",
                "MetricName": "CDCChangesMemoryTarget",
                "Dimensions": [
                    {
                        "Name": "ReplicationInstanceIdentifier",
                        "Value": "sample-dms-replication-instance"
                    },
                    {
                        "Name": "ReplicationTaskIdentifier",
                        "Value": "TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ"
                    }
                ]
            }
        ]
    }

Setting up AWS DMS replication instance CloudWatch metrics

The following are some of the significant CloudWatch metrics for monitoring the AWS DMS replication instance:

  • CPUUtilization – Amount of CPU used
  • FreeStorageSpace – Amount of available storage space in bytes
  • FreeableMemory – Amount of available random-access memory
  • WriteIOPS – Average number of disk write I/O operations per second

To set up an alarm for CPU utilization, run the following command, which creates a CloudWatch alarm for the replication instance when the CPU utilization is more than 70 percent for a period of 5 minutes for 3 data points:

$ aws cloudwatch put-metric-alarm --alarm-name ${team_tag_value}-${resource_name}-cpu --alarm-description "alarm when cpu is more than threshold" --metric-name CPUUtilization --namespace AWS/DMS --statistic Average --period 300 --threshold 70 --comparison-operator GreaterThanThreshold --dimensions "Name=ReplicationInstanceIdentifier,Value=$replication_instance_identifier" --evaluation-periods 3 --unit Percent  --region=$region

To check the allocated storage (in GB) for your running AWS DMS instance, run the following command:

$ aws dms describe-replication-instances --filters Name=replication-instance-arn,Values=$replication_instance_arn --query "ReplicationInstances[:].{AllocatedStorage:AllocatedStorage}" --region=$region

Output:
[
    {
        "AllocatedStorage": 50
    }
]

To set up an alarm for free storage space (in bytes), run the following command:

$ aws cloudwatch put-metric-alarm --alarm-name ${team_tag_value}-${resource_name}-FreeStorage --alarm-description "when less than 20% of the allocated storage" --metric-name FreeStorageSpace --namespace AWS/DMS --statistic Average --period 300 --threshold 1.0e+09 --comparison-operator LessThanOrEqualToThreshold --dimensions "Name=ReplicationInstanceIdentifier,Value= $replication_instance_identifier" --evaluation-periods 1 --unit Bytes --region=$region

FreeableMemory

Freeable memory is memory that can be freed and used for other processes; it’s the amount of cache used on the replication instance. Although the FreeableMemory metric doesn’t reflect actual free memory available, the combination the FreeableMemory and SwapUsage metrics can indicate if the replication instance is overloaded.

To set up an alarm for freeable memory less than 1 GB, run the following command:

$ aws cloudwatch put-metric-alarm --alarm-name ${team_tag_value}-${resource_name}-FreeableMemory --alarm-description "Free Memory is lower than 1GB" --metric-name "FreeableMemory" --namespace "AWS/DMS" --statistic Average --period 300 --threshold 1.0e+09 --comparison-operator LessThanOrEqualToThreshold --dimensions "Name=ReplicationInstanceIdentifier,Value= $replication_instance_identifier" --evaluation-periods 3 --unit Bytes  --region=$region

 WriteIOPS

To set up the alarm for the average number of WriteIOPS, run the following command:

$ aws cloudwatch put-metric-alarm --alarm-name ${team_tag_value}-${resource_name}-WriteIOPS --alarm-description "High Write IOPs ALARM: More than 1000/Secs" --metric-name WriteIOPS --namespace "AWS/DMS" --statistic Average --period 120 --threshold 1000 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 2 --unit Count/Second --dimensions "Name=ReplicationInstanceIdentifier,Value= $replication_instance_identifier" --region=$region

After you set up these metrics, you can see the alarms in CloudWatch. The following screenshot shows the alarms set up for sample-dms-replication-instance with the state OK.

To describe the alarm that you configured, run the following command (the following code checks the alarm for WriteIOPS):

$ export metric_name=WriteIOPS
$ aws cloudwatch describe-alarms --alarm-names ${team_tag_value}-${resource_name}-${metric_name}  --region=$region

Output:
{
    "MetricAlarms": [
        {
            "AlarmName": "team1-dms_instance-WriteIOPS",
            "AlarmArn": "arn:aws:cloudwatch:us-east-1:999999999999:alarm:team1-dms_instance-WriteIOPS",
            "AlarmDescription": "High Write IOPs ALARM: More than 1000/Secs",
            "AlarmConfigurationUpdatedTimestamp": "2020-10-29T15:43:36.545Z",
            "ActionsEnabled": true,
            "OKActions": [],
            "AlarmActions": [],
            "InsufficientDataActions": [],
            "StateValue": "OK",
            "StateReason": "Threshold Crossed: 2 datapoints [0.0 (29/10/20 15:42:00), 0.10833333333333334 (29/10/20 15:40:00)] were not greater than or equal to the threshold (1000.0).",
            "StateReasonData": "{"version":"1.0","queryDate":"2020-10-29T15:44:38.061+0000","startDate":"2020-10-29T15:40:00.000+0000","unit":"Count/Second","statistic":"Average","period":120,"recentDatapoints":[0.10833333333333334,0.0],"threshold":1000.0}",
            "StateUpdatedTimestamp": "2020-10-29T15:44:38.066Z",
            "MetricName": "WriteIOPS",
            "Namespace": "AWS/DMS",
            "Statistic": "Average",
            "Dimensions": [
                {
                    "Name": "ReplicationInstanceIdentifier",
                    "Value": "sample-dms-replication-instance"
                }
            ],
            "Period": 120,
            "Unit": "Count/Second",
            "EvaluationPeriods": 2,
            "Threshold": 1000.0,
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
    ],
    "CompositeAlarms": []
}

Setting up an alarm for replication tasks

After you configure the AWS DMS replication alarms, you set up the replication tasks.

  1. On the AWS DMS console, choose Database migration tasks.
  2. Choose the task you want to set up an alarm for.

You need the following parameters to name your alarm uniquely:

  • replication_task_identifier – Replication task identifier
  • replication_task_arn – Replication task ARN
  • task_name – Name of the replication task

Run the following commands on any Linux/Unix system to set up the CloudWatch metrics:

$ export task_name=dms-task-1
$ export replication_task_identifier=TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ 
$ export replication_task_arn=arn:aws:dms:us-east-1:99999999999:task:TKTVOIKYZJY665F72C5OLEZJ4HYQX7USNHTZSRQ 

To check the available metrics for your task, run the following command:

$ aws cloudwatch list-metrics --namespace AWS/DMS --dimensions "Name=ReplicationInstanceIdentifier,Value=$replication_instance_identifier " "Name=ReplicationTaskIdentifier,Value=$replication_task_identifier " --region=$region

Setting up AWS DMS replication tasks CloudWatch metrics

The following are some of the significant metrics for monitoring AWS DMS replication tasks:

  • CDCThroughputRowsTarget – Outgoing task changes for the target in rows per second
  • CDCThroughputRowsSource – Incoming task changes from the source in rows per second

To describe the task the ReplicationInstanceIdentifier and ReplicationTaskIdentifier, enter the following code:

$ aws dms describe-replication-tasks --filters Name=replication-task-arn,Values=$replication_task_arn --query "ReplicationTasks[:].{ReplicationTaskIdentifier:ReplicationTaskIdentifier,ReplicationInstanceArn:ReplicationInstanceArn,ReplicationTaskArn:ReplicationTaskArn}" --region=$region

To set up the alarm for CDCThroughputRowsTarget, run the following command:

$ aws cloudwatch put-metric-alarm --alarm-name ${team_tag_value}-${task_name}-cdcthroughputrowssource --alarm-description "Outgoing task changes for the target is more than 1000 rows per second" --metric-name CDCThroughputRowsSource --namespace "AWS/DMS" --statistic Average --period 60 --threshold 1000 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --dimensions "Name=ReplicationInstanceIdentifier,Value=$replication_instance_identifier" "Name=ReplicationTaskIdentifier,Value=$replication_task_identifier" --region=$region 

To set up the alarm for CDCThroughputRowsSource, run the following command:

$ aws cloudwatch put-metric-alarm --alarm-name ${team_tag_value}-${task_name}-cdcthroughputrowstarget --alarm-description "Outgoing task changes for the target is more than 1000 rows per second" --metric-name CDCThroughputRowsTarget --namespace "AWS/DMS" --statistic Average --period 60 --threshold 1000 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --dimensions "Name=ReplicationInstanceIdentifier,Value=$replication_instance_identifier" "Name=ReplicationTaskIdentifier,Value=$replication_task_identifier" --region=$region 

These alarms aren’t initialized until the task is started.

The following screenshot shows the alarms configured after running the alarm setup commands.

To describe the alarms for this task, run the following command:

$ export metric_name=cdcthroughputrowssource
$ aws cloudwatch describe-alarms --alarm-names ${team_tag_value}-${task_name}-${metric_name} --query "MetricAlarms[:].{AlarmArn:AlarmArn,Dimensions:Dimensions,StateReason:StateReason}" --region=$region
You can set up additional alarms based on your requirements, such as the following: 
  • CDCLatencySource – The gap, in seconds, between the last event captured from the source endpoint and current system timestamp of the AWS DMS instance. If no changes have been captured from the source due to task scoping, AWS DMS sets this value to zero.
  • CDCLatencyTarget – The gap, in seconds, between the first event timestamp waiting to commit on the target and the current timestamp of the AWS DMS instance. Target latency should never be smaller than the source latency.
  • NetworkTransmitThroughput – The outgoing (transmit) network traffic on the replication instance, including customer database traffic and AWS DMS traffic used for monitoring and replication.
  • NetworkReceiveThroughput – The incoming (receive) network traffic on the replication instance, including customer database traffic and AWS DMS traffic used for monitoring and replication.

Summary:

This post discussed some of the CloudWatch metrics you can use to monitor your AWS DMS replication instance and replication tasks.

For information about debugging AWS DMS, see Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 1) and Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 2).

Try this approach in your environment and see the benefits. We hope this post helps you with your AWS DMS replication monitoring. Please reach out with questions or feature requests via the comments.


About the Authors

Jeevith Anumalla is an Oracle Database Cloud Architect with the Professional Services team at Amazon Web Services. He works as database migration specialist to help internal and external Amazon customers to move their on-premises database environment to AWS data stores.

 

 

 

Sagar Patel is a Database Specialty Architect with the Professional Services team at Amazon Web Services. He works as a database migration specialist to provide technical guidance and help Amazon customers to migrate their on-premises databases to AWS.