Monitor Metrics for Cluster Linking on Confluent Cloud¶

To monitor Cluster Linking on Confluent Cloud, use the Confluent Cloud Metrics. As shown below, Cluster Linking exposes metrics in the API to determine the number of cluster links on a cluster, the number of mirror topics on a cluster, mirroring throughput, and mirroring lag.

Monitoring cluster link status¶

If Cluster Linking is powering a business-critical workload for your business, you should monitor your cluster link(s) using the Confluent Cloud Metrics API to ensure Cluster Linking is running smoothly, while proactively alerting you on any issues.

Characteristics of a healthy cluster¶

During steady state, a healthy cluster link will have the following:

Cluster link’s link_state is active, as shown on Number of cluster links on a cluster).
Mirroring lag is at or near zero (0).

Source and Destination throughput (as shown in Mirroring throughput) is predictable and at least as high as the total production (write) throughput on the source topics.

Tip

Extra metadata is mirrored by cluster linking, so the total cluster link throughput is often greater than the total production to the source topics.

Signs of trouble and solutions¶

Cluster link’s link_state becomes unavailable.

This will happen if the destination cluster cannot reach the source cluster–and therefore cannot replicate data–for 5 minutes. Describe the cluster link in the REST API or CLI to get more information on the specific error. There are two possibilities for the error:
- Misconfiguration. If a cluster link’s bootstrap server or authentication credentials are incorrect, the cluster link will be unable to reach the source cluster.
  
  Action: Update the bootstrap server or authentication credentials so the cluster link can reach its source cluster.
- Outage. If the source cluster, or the network between the source and destination cluster, experiences an outage, then the cluster link will not be able to replicate data.
  
  Action: Investigate for other signs of an outage.
Mirroring lag suddenly rises, and is not tied to a rise in production throughput on the source topics.

Take the following actions:
- Check to see if either cluster is constrained on their throughput capacity. If the source cluster becomes constrained on its read (consume) capacity, or the destination cluster becomes constrained on its write (produce) capacity, then that can cause lag and throughput to drop. If this is the case, either expand the constrained cluster, wait for capacity to free up once the producers or consumers return to normal, or intervene with the consumers or producers to free up capacity.
- Check to see if one particular mirror partition is responsible for the lag. If so, load may need to be redistributed to other partitions.
- Make sure there is no client quota set on the cluster link’s principal on the source cluster, as reaching this quota can create lag.
- If the clusters have plenty of capacity remaining, there may be an issue with the availability of the source cluster, destination cluster, or the network between them. Further investigation is warranted. If both clusters are Confluent Cloud clusters, then contact Confluent Support.

Tip

If the source cluster is unavailable, then the destination cluster will not report any mirroring lag. You can detect this error through the unavailable link state recommended above, or by observing a sudden, unexplained drop in cluster think throughput on the destination cluster.

Limitations¶

For metrics on Enterprise and high link count clusters, link_name will be missing on reported metrics.
Bidirectional and source-initiated links are not supported on Enterprise clusters in the same region.

Metrics¶

Number of cluster links on a cluster¶

io.confluent.kafka.server/cluster_link_count: Total number of cluster links (in any state) connected to the cluster. You can filter or group by the direction (source or destination), link name, and link state of the cluster link.

Labels

Label	Description
`mode`	Either `source` or `destination`.
`link_state`	`unavailable`, `paused`, or `active`
`link_name`	Cluster link name

Tip

For unavailable, use the CLI or REST API to get more detailed error information.

Example 1

Get the count of all cluster links on a cluster, regardless of state.

{
  "aggregations": [
    {
      "metric": "io.confluent.kafka.server/cluster_link_count"
    }
  ],
  "filter": {
    "field": "resource.kafka.id",
    "op": "AND",
    "filters": [
      {
        "field": "resource.kafka.id",
        "op": "EQ",
        "value": "lkc-123"
      }
    ]
  },
  "granularity": "PT1M",
  "group_by": [
    "metric.link_state",
    "metric.link_name",
    "metric.mode"
  ],
  "intervals": [
    "PT5M/now"
  ]
}

Example 2

Get the count of unavailable links on a cluster.

{
  "aggregations": [
    {
      "metric": "io.confluent.kafka.server/cluster_link_count"
    }
  ],
  "filter": {
    "field": "resource.kafka.id",
    "op": "AND",
    "filters": [
      {
        "field": "resource.kafka.id",
        "op": "EQ",
        "value": "lkc-123"
      },
      {
        "field": "metric.link_state",
        "op": "EQ",
        "value": "unavailable"
      }
    ]
  },
  "granularity": "PT1M",
  "group_by": [
    "metric.link_state",
    "metric.link_name",
    "metric.mode"
  ],
  "intervals": [
    "PT5M/now"
  ]
}

Number of mirror topics on a cluster¶

io.confluent.kafka.server/cluster_link_mirror_topic_count: The count of mirror topics on the cluster. You can filter or group by the name of the cluster link, or by the state of the mirror topic.

Labels

Label	Description
`link_name`	Name of the cluster link.
`link_mirror_topic_state`	The state the mirror topic is in.

Possible states for mirror topic are as follows:

Mirror Topic State	Description
`Mirror`	Actively mirroring data. Corresponds to the `ACTIVE` state in REST API. Known issue: also contains topics that are in the `SOURCE_UNAVAILABLE` state in REST API.
`PausedMirror`	A user has paused this mirror topic, and it is not mirroring data. Corresponds to the `PAUSED` state in REST API.
`PendingStoppedMirror`	A user has called `promote` on the mirror topic, and the promotion is in progress. Corresponds to the `PENDING_STOPPED` state in REST API.
`StoppedMirror`	A `promote` or `failover` command has completed, and this topic has changed from a mirror topic to a regular topic. Corresponds to the `STOPPED` state in REST API.
`FailedMirror`	The mirror topic has permanently failed, and will no longer mirror data. Corresponds to the `FAILED` state in REST API.

Example

Get the count of active mirror topics over the past hour, grouped by cluster link name.

{
  "aggregations": [
  {
    "metric": "io.confluent.kafka.server/cluster_link_mirror_topic_count"
  }
  ],
  "filter": {
    "op": "AND",
    "filters": [
      {
        "field": "resource.kafka.id",
        "op": "EQ",
        "value": "lkc-52p82"
      },
      {
        "field": "metric.link_mirror_topic_state",
        "op": "EQ",
        "value": "Mirror"
      }
    ]
  },
  "granularity": "PT1M",
  "group_by": [
    "metric.link_name"
  ],
  "intervals": [
    "now-1h/now"
  ],
  "limit": 25
}

Cluster link task states¶

io.confluent.kafka.server/cluster_link_task_count: Monitor the state of link level tasks. For example, monitor if consumer offset syncing is working. If the task is in error, a reason code is provided. The error message can then be found using the CLI or REST API.

Labels

Label	Description
`mode`	Either `source` or `destination`.
`link_name`	Name of link, based on customer input
`link_task_name`	Can take on values: `auto-create-mirror`, `acl-sync`, `consumer-offset-sync`, `topic-configs-sync`
`link_task_state`	For example, `active`, `in_error`, `not_configured`
`link_task_reason`	For example, `internal`, `authentication`, `authorization`, `remote_link_not_found`

Example

{
 "aggregations": [
   {
     "metric": "io.confluent.kafka.server/cluster_link_task_count"
   }
 ],
 "filter": {
   "field": "resource.kafka.id",
   "op": "AND",
   "filters": [
     {
       "field": "resource.kafka.id",
       "op": "EQ",
       "value": "lkc-71gba"
     }
   ]
 },
 "granularity": "PT1H",
 "group_by": [
   "metric.link_task_name",
   "metric.link_task_state",
   "metric.link_task_reason",
   "metric.link_name"
 ],
 "intervals": [
   "2024-01-07T00:00:00.000Z/2024-01-07T00:02:00.000Z"
 ]
}

Mirror topic state transition¶

io.confluent.kafka.server/cluster_link_mirror_transition_in_error: Monitor mirror topic state transition errors. For example, if a mirror topic encounters errors during the promotion process; that is, while its state is pending_stopped and it is being transitioned to stopped.

Labels

Label	Description
`mode`	Either `source` or `destination`.
`link_name`	Name of link, based on customer input
`link_mirror_topic_state`	The state the mirror topic is in.

Example

{
 "aggregations": [
   {
     "metric": "io.confluent.kafka.server/cluster_link_mirror_transition_in_error"
 }
 ],
 "filter": {
   "field": "resource.kafka.id",
   "op": "AND",
   "filters": [
     {
       "field": "resource.kafka.id",
       "op": "EQ",
       "value": "lkc-71gba"
     }
   ]
 },
 "granularity": "PT1H",
 "group_by": [
   "metric.link_mirror_topic_reason",
   "metric.link_mirror_topic_state",
   "metric.link_name"
 ],
 "intervals": [
   "2024-01-07T00:00:00.000Z/2024-01-07T00:02:00.000Z"
 ]
}

Mirroring throughput¶

Source¶

io.confluent.kafka.server/cluster_link_source_response_bytes: Rate of mirroring throughput, in bytes per second, sent by the source. For a maximum of 30 links per cluster, the full link name is reported in the tag for this metric. Beyond this limit, the cluster link name is reported simply as _confluent. This limit can be raised for specific “aggregation” use cases (currently to as much as 100-200 links) by contacting Confluent Support. To learn more, see limits on Cluster types and networking.

Labels

None.

Destination¶

io.confluent.kafka.server/cluster_link_destination_response_bytes: Rate of mirroring throughput, in bytes per second, received by the destination. You can filter or group by cluster link name. For a maximum of 30 links per cluster, the full link name is reported in the tag for this metric. Beyond this limit, the cluster link name is reported simply as _confluent. This limit can be raised for specific “aggregation” use cases (currently to as much as 100-200 links) by contacting Confluent Support. To learn more, see limits on Cluster types and networking.

Labels

Label	Description
`link_name`	Name of the cluster link.

Example

Get mirroring throughput on a destination cluster for the past hour, grouped by cluster link name.

{
  "aggregations": [
  {
    "metric": "io.confluent.kafka.server/cluster_link_destination_response_bytes"
  }
  ],
  "filter": {
    "field": "resource.kafka.id",
    "op": "EQ",
    "value": "lkc-XXXXX"
  },
  "granularity": "PT1M",
  "group_by": [
    "metric.link_name"
  ],
  "intervals": [
    "now-1h/now"
  ],
  "limit": 25
}

Mirror Topics¶

io.confluent.kafka.server/cluster_link_mirror_topic_bytes: The amount of bytes sent over each mirror topic on a destination cluster.

Labels

Label	Description
`link_name`	Name of the cluster link.
`topic`	Name of the mirror topic.

Example

Get the total number of bytes sent each day over the last week on a cluster link called from_west, grouped by mirror topic name.

{
  "aggregations": [
      {
          "metric": "io.confluent.kafka.server/cluster_link_mirror_topic_bytes"
      }
  ],
  "filter": {
      "op": "AND",
      "filters": [
          {
              "field": "resource.kafka.id",
              "op": "EQ",
              "value": "lkc-odq3o"
          },
          {
              "field": "metric.link_name",
              "op": "EQ",
              "value": "from-west"
          }
      ]
  },
  "granularity": "P1D",
  "group_by": [
      "metric.topic"
  ],
  "intervals": [
      "now-7d/now"
  ],
  "limit": 25
}

Mirroring lag¶

io.confluent.kafka.server/cluster_link_mirror_topic_offset_lag

The mirroring lag indicates how far behind the destination is from the source in terms of processing events. This is measured as the maximum number of messages lagging on any of the partitions for a mirror topic.

For example, given a mirror topic with three partitions: one partition lags 4 messages behind the source topic, another lags 24 messages behind, and the third lags 92 messages behind, the mirror topic’s lag is reported as 92.

Each mirror topic’s lag is measured once per minute. If your query’s granularity is higher than a minute (PT1M), then the API will return the maximum lag from each of the minutes in that time range.

If your query does not group by topic, then it will return the maximum lag over all of the mirror topics that match the filter clause. For example, if your query filters on a specific link_name, then it will return the maximum lag among all of that link’s mirror topics.

Labels

Label	Description
`link_name`	Name of the cluster link.
`topic`	Name of the mirror topic.

Example

Get the maximum mirroring lag for each mirror topic on a destination cluster.

{
    "aggregations": [
        {
            "metric": "io.confluent.kafka.server/cluster_link_mirror_topic_offset_lag"
        }
    ],
    "filter": {
        "field": "resource.kafka.id",
        "op": "EQ",
        "value": "lkc-odq3o"
    },
    "granularity": "PT1M",
    "group_by": [
        "metric.topic"
    ],
    "intervals": [
        "2021-08-14T07:00:00Z/2021-08-14T08:00:00Z"
    ],
    "limit": 25
}

Number of active cluster Links on a cluster (DEPRECATED)¶

io.confluent.kafka.server/cluster_active_link_count: Total number of cluster links connected to the cluster. You can filter or group by the direction (source or destination) of the cluster link.

Labels

Label	Description
`mode`	Either `source` or `destination`.

Example

Get the count of active cluster links on a cluster for the past 24 hours, grouped by the direction of the link.

{
  "aggregations": [
  {
    "metric": "io.confluent.kafka.server/cluster_active_link_count"
  }
  ],
  "filter": {
    "field": "resource.kafka.id",
    "op": "EQ",
    "value": "lkc-XXXXX"
  },
  "granularity": "PT1H",
  "group_by": [
    "metric.mode"
  ],
  "intervals": [
    "now-24h/now"
  ],
  "limit": 25
}