SFTP Source Connector for Confluent Cloud¶

The fully-managed SFTP Source connector for Confluent Cloud watches an SFTP directory for files and reads the data as new files get written to the directory. Each file is parsed based on one of the following property values used with the input.file.parser.format configuration, which are also selectable in the UI.

BINARY
CSV
JSON (the default)
SCHEMALESS_JSON

Once a file has been read, it is placed into a finished.path directory or an error.path directory.

Note

This Quick Start is for the fully-managed Confluent Cloud connector. If you are installing the connector locally for Confluent Platform, see SFTP Source Connector for Confluent Platform.
If you require private networking for fully-managed connectors, make sure to set up the proper networking beforehand. For more information, see Manage Networking for Confluent Cloud Connectors.

Features¶

The SFTP Source connector supports the following features:

At least once delivery: The connector guarantees that records are delivered at least once to the Kafka topic (if the file row parsed is valid).
Supports one task: The connector supports running one task per connector instance.
Supported output data formats: The connector supports Avro, JSON Schema (JSON-SR), Protobuf, JSON (schemaless), Bytes, and String output record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf).

For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Connect Usage Examples section.

Limitations¶

Be sure to review the following information.

For connector limitations, see SFTP Source Connector limitations.
If you plan to use one or more Single Message Transforms (SMTs), see SMT Limitations.
If you plan to use Confluent Cloud Schema Registry, see Schema Registry Enabled Environments.

Quick Start¶

Use this quick start to get up and running with the Confluent Cloud SFTP Source connector. The quick start provides the basics of selecting the connector and configuring it to get data from an SFTP host.

Prerequisites

Authorized access to a Confluent Cloud cluster on Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud.
The Confluent CLI installed and configured for the cluster. See Install the Confluent CLI.
Access to an SFTP host.
Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON_SR (JSON Schema), or Protobuf).
At least one source Kafka topic must exist in your Confluent Cloud cluster before creating the source connector.

Using the Confluent Cloud Console¶

Step 1: Launch your Confluent Cloud cluster¶

See the Quick Start for Confluent Cloud for installation instructions.

Step 2: Add a connector¶

In the left navigation menu, click Connectors. If you already have connectors in your cluster, click + Add connector.

Step 3: Select your connector¶

Click the SFTP Source connector card.

Note

Make sure you have all your prerequisites completed.
An asterisk ( * ) designates a required entry.
The steps provide information about how to use the required configuration properties. See Configuration Properties for other configuration property values and descriptions.

At the Add SFTP Source Connector screen, complete the following:

Select the topic you want to send data to from the Topics list. To create a new topic, click +Add new topic.

Enter the SFTP Details:
- SFTP Host: The host address for the SFTP server. For example 192.168.1.231.
- SFTP Port: The port number of the SFTP server. Defaults to 22.
- Username: The username the connector will use to connect to the host.
- Password: The password for the SFTP connection. A password is not required if you’re using TLS.
- Upload a PEM file.
- TLS passphrase: Used to decrypt the private key if the given private key is encrypted.
Click Continue.

Add the SFTP directory details:
- Input file parser format: The parser that should be used to fetch files from the SFTP directory.
- Output message format: The connector supports Avro, JSON Schema, Protobuf, JSON, Bytes, and String output Kafka record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf). See Schema Registry Enabled Environments for additional information.
- Input path: The SFTP directory to read files that will be processed. This directory must exist and be by the user running Kafka Connect.
- Finished path: The SFTP directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect.
- Error path: The SFTP directory to place files in which there are errors. This directory must exist and be writable by the user running Kafka Connect.
- Input file pattern (regex): A regular expression to check input file names against. This expression must match the entire filename. This is equivalent to Matcher.matches(). Using .* accepts all files in the directory.
- For Schema generation enabled select whether schemas should be dynamically generated–select true or false.
Show advanced configurations
- Schema context: Select a schema context to use for this connector, if using a schema-based data format. This property defaults to the Default context, which configures the connector to use the default schema set up for Schema Registry in your Confluent Cloud environment. A schema context allows you to use separate schemas (like schema sub-registries) tied to topics in different Kafka clusters that share the same Schema Registry environment. For example, if you select a non-default context, a Source connector uses only that schema context to register a schema and a Sink connector uses only that schema context to read from. For more information about setting up a schema context, see What are schema contexts and when should you use them?.
- Batch size: The number of records that should be returned with
  each batch.
- Empty poll wait (ms): The amount of time to wait if a poll returns and empty list of records.
- Cleanup policy: Determines how the connector should clean up the files that have been processed.
- Behavior on error: Whether the task should halt when it encounters an error or continue to next file.
- File minimum age (ms): The amount of time in milliseconds after the file was last written to before the file can be processed.
- For information about transforms and predicates, see the Single Message Transforms (SMT) documentation for details. See Unsupported transformations for a list of SMTs that are not supported with this connector.
- Transforms and Predicates: See the Single Message Transforms (SMT) documentation for details.
Click Continue.

Step 5: Check for records¶

Verify that records are being produced in the Kafka topic.

For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Connect Usage Examples section.

Tip

When you launch a connector, a Dead Letter Queue topic is automatically created. See View Connector Dead Letter Queue Errors in Confluent Cloud for details.

Using the Confluent CLI¶

To set up and run the connector using the Confluent CLI, complete the following steps.

Note

Make sure you have all your prerequisites completed.

Step 1: List the available connectors¶

Enter the following command to list available connectors:

confluent connect plugin list

Step 2: List the connector configuration properties¶

Enter the following command to show the connector configuration properties:

confluent connect plugin describe <connector-plugin-name>

The command output shows the required and optional configuration properties.

Step 3: Create the connector configuration file¶

Create a JSON file that contains the connector configuration properties. The following example shows the required connector properties.

{
  "connector.class": "SftpSource",
  "name": "SftpSourceConnector_0",
  "kafka.api.key": "****************",
  "kafka.api.secret": "*********************************",
  "kafka.topic": "orders",
  "output.data.format": "JSON",
  "input.file.parser.format": "CSV",
  "schema.generation.enable": "true",
  "sftp.host": "192.168.1.231",
  "sftp.username": "connect-user",
  "sftp.password:": "****************",
  "input.path": "/path/to/data",
  "finished.path": "/path/to/finished",
  "error.path": "/path/to/error",
  "input.file.pattern": "csv-sftp-source.csv",
  "tasks.max": "1",
}

Note the following property definitions:

"connector.class": Identifies the connector plugin name.
"name": Sets a name for your new connector.

"kafka.auth.mode": Identifies the connector authentication mode you want to use. There are two options: SERVICE_ACCOUNT or KAFKA_API_KEY (the default). To use an API key and secret, specify the configuration properties kafka.api.key and kafka.api.secret, as shown in the example configuration (above). To use a service account, specify the Resource ID in the property kafka.service.account.id=<service-account-resource-ID>. To list the available service account resource IDs, use the following command:
```
confluent iam service-account list
```
For example:
```
confluent iam service-account list

   Id     | Resource ID |       Name        |    Description
+---------+-------------+-------------------+-------------------
   123456 | sa-l1r23m   | sa-1              | Service account 1
   789101 | sa-l4d56p   | sa-2              | Service account 2
```
"kafka.topic": Enter the topic name or a comma-separated list of topic names.
"output.data.format": The connector supports Avro, JSON Schema (JSON_SR), Protobuf, JSON (schemaless), Bytes, and String output Kafka record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf). See Schema Registry Enabled Environments for additional information.
Note

Note the following relationship between output.data.format and the input.file.parser.format property.
- If you use BINARY for input.file.parser.format, you must use BYTES for output.data.format.
- If you use SCHEMALESS_JSON for input.file.parser.format, you must use STRING for output.data.format.
- If you leave this to JSON (the default) or use CSV for input.file.parser.format, you can use any format for output.data.format.

"input.file.parser.format": The parser format used to parse fetched files from the SFTP directory. Defaults to JSON. Options are BINARY, CSV, JSON, SCHEMALESS_JSON.

Important

If you use JSON (the default) or CSV as the input.file.parser.format, then you must add the property schema.generation.enable and set it to true. If you set this property to false, you must provide a key.schema and a value.schema.

key.schema and value.schema properties require the actual schema, not the schema ID. To generate the schema, use the tool available here.

Schema example:

key.schema={\"name\" : \"com.example.users.UserKey\",\"type\" : \"STRUCT\",\"isOptional\" : false,
  \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false}}}
value.schema={\"name\" : \"com.example.users.User\",\"type\" : \"STRUCT\",\"isOptional\" : false,
  \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false},
  \"first_name\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"last_name\" : {\"type\" : \"STRING\",
  \"isOptional\" : true},\"email\" : {\"type\" : \"STRING\",\"isOptional\" : true},
  \"gender\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"ip_address\" : {\"type\" : \"STRING\",
  \"isOptional\" : true},\"last_login\" : {\"type\" : \"STRING\",\"isOptional\" : true},
  \"account_balance\" : {\"name\" : \"org.apache.kafka.connect.data.Decimal\",\"type\" : \"BYTES\",
  \"version\" : 1,\"parameters\" : {\"scale\" : \"2\"},\"isOptional\" : true},
  \"country\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"favorite_color\" : {\"type\" : \"STRING\",
  \"isOptional\" : true}}}

"sftp.host": Enter the host address for the SFTP server. For example 192.168.1.231. Note that the port defaults to 22. To change this, add the property "sftp.port".
"sftp.username": Enter the user name that the connector will use to connect to the host. The "sftp.password" property is not required if a PEM file is used for key based authentication to the host.
"input.path": Add the SFTP directory where the connector places files that are successfully processed. This directory must exist and be writable by the connector.
"finished.path": Add the SFTP directory from which the connector reads files that will be processed. This directory must exist and be writable by the connector.
"error.path": Add the SFTP directory where the connector places files in which there are errors. This directory must exist and be writable by the connector.
"input.file.pattern": Add a regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches(). Using .* accepts all files in the directory.
"tasks.max": The connector supports running one tasks per connector.

Single Message Transforms: See the Single Message Transforms (SMT) documentation for details about adding SMTs using the CLI.

See Configuration Properties for all property values and descriptions.

Step 4: Load the properties file and create the connector¶

Enter the following command to load the configuration and start the connector:

confluent connect cluster create --config-file <file-name>.json

For example:

confluent connect cluster create --config-file sftp-source-config.json

Example output:

Created connector SftpSourceConnector_0 lcc-do6vzd

Step 5: Check the connector status.¶

Enter the following command to check the connector status:

confluent connect cluster list

Example output:

ID           |             Name            | Status  | Type   | Trace
+------------+-----------------------------+---------+--------+-------+
lcc-do6vzd   | SftpSourceConnector_0       | RUNNING | source |       |

Step 6: Check the Kafka topic¶

Verify that records are being produced at the Kafka topic.

For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Connect Usage Examples section.

Tip

When you launch a connector, a Dead Letter Queue topic is automatically created. See View Connector Dead Letter Queue Errors in Confluent Cloud for details.

Configuration Properties¶

Use the following configuration properties with the fully-managed connector. For self-managed connector property definitions and other details, see the connector docs in Self-managed connectors for Confluent Platform.

How should we connect to your data?¶

name

Sets a name for your connector.

Type: string
Valid Values: A string at most 64 characters long
Importance: high

Kafka Cluster credentials¶

kafka.auth.mode

Kafka Authentication mode. It can be one of KAFKA_API_KEY or SERVICE_ACCOUNT. It defaults to KAFKA_API_KEY mode.

Type: string
Default: KAFKA_API_KEY
Valid Values: KAFKA_API_KEY, SERVICE_ACCOUNT
Importance: high

kafka.api.key

Kafka API Key. Required when kafka.auth.mode==KAFKA_API_KEY.

Type: password
Importance: high

kafka.service.account.id

The Service Account that will be used to generate the API keys to communicate with Kafka Cluster.

Type: string
Importance: high

kafka.api.secret

Secret associated with Kafka API key. Required when kafka.auth.mode==KAFKA_API_KEY.

Type: password
Importance: high

Which topic do you want to send data to?¶

kafka.topic

Identifies the topic name to write the data to.

Type: string
Importance: high

Schema Config¶

schema.context.name

Add a schema context name. A schema context represents an independent scope in Schema Registry. It is a separate sub-schema tied to topics in different Kafka clusters that share the same Schema Registry instance. If not used, the connector uses the default schema configured for Schema Registry in your Confluent Cloud environment.

Type: string
Default: default
Importance: medium

Output messages¶

output.data.format

Sets the output message format. Valid entries are AVRO, JSON_SR, PROTOBUF, JSON, STRING or BYTES. Note that you need to have Confluent Cloud Schema Registry configured if using a schema-based message format like AVRO, JSON_SR, and PROTOBUF

Type: string
Importance: high

Input file parser format¶

input.file.parser.format

Parser that should be used to parse fetched files from sftp directory

Type: string
Default: JSON
Importance: high

SFTP Details¶

sftp.host

Host address of the SFTP server.

Type: string
Importance: high

sftp.port

Port number of the SFTP server.

Type: int
Default: 22
Importance: medium

sftp.username

Username for the SFTP connection.

Type: string
Importance: high

sftp.password

Password for the SFTP connection (not required if using TLS).

Type: password
Importance: high

tls.pemfile

PEM file to be used for authentication via TLS.

Type: password
Importance: high

tls.passphrase

Passphrase that will be used to decrypt the private key if the given private key is encrypted.

Type: password
Importance: high

SFTP directory¶

input.path

The SFTP directory to read files that will be processed.This directory must exist and be writable by the user running Kafka Connect.

Type: string
Importance: high

finished.path

The SFTP directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect.

Type: string
Importance: high

error.path

The SFTP directory to place files in which there are error(s). This directory must exist and be writable by the user running Kafka Connect.

Type: string
Importance: high

File System¶

cleanup.policy

Determines how the connector should cleanup the files that have been successfully processed. NONE leaves the files in place which could cause them to be reprocessed if the connector is restarted. DELETE removes the file from the filesystem. MOVE will move the file to a finished directory.

Type: string
Default: MOVE
Importance: medium

input.file.pattern

Regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches().

Type: string
Importance: high

behavior.on.error

Should the task halt when it encounters an error or continue to the next file.

Type: string
Default: FAIL
Importance: high

file.minimum.age.ms

The amount of time in milliseconds after the file was last written to before the file can be processed. For default 0, connector processes all files irrespective of age

Type: long
Default: 0
Importance: low

Connection details¶

batch.size

The number of records that should be returned with each batch.

Type: int
Default: 1000
Importance: low

empty.poll.wait.ms

The amount of time to wait if a poll returns an empty list of records.

Type: long
Default: 250
Importance: low

Schema¶

key.schema

The schema for the key written to Kafka. Set the actual schema, not the schema ID. To generate the schema, use the tool available here: https://github.com/jcustenborder/kafka-connect-spooldir?tab=readme-ov-file#tip-1

Type: string
Importance: high

value.schema

The schema for the value written to Kafka. Set the actual schema, not the schema ID. To generate the schema, use the tool available here: https://github.com/jcustenborder/kafka-connect-spooldir?tab=readme-ov-file#tip-1

Type: string
Importance: high

Schema Generation¶

schema.generation.enabled

Flag to determine if schemas should be dynamically generated. If set to true, key.schema and value.schema can be omitted, but schema.generation.key.name and schema.generation.value.name must be set.

Type: boolean
Importance: medium

schema.generation.key.fields

The field(s) to use to build a key schema. This is only used during schema generation.

Type: list
Importance: medium

schema.generation.key.name

The name of the generated key schema.

Type: string
Importance: medium

schema.generation.value.name

The name of the generated value schema.

Type: string
Importance: medium

Timestamps¶

timestamp.mode

Determines how the connector will set the timestamp for the ConnectRecord. If set to FIELD then the timestamp will be read from a field in the value. This field cannot be optional and must be a Timestamp. Specify the field in timestamp.field. If set to FILE_TIME then the last modified time of the file will be used. If set to PROCESS_TIME the time the record is read will be used.

Type: string
Importance: medium

timestamp.field

The field in the value schema that will contain the parsed timestamp for the record. This field cannot be marked as optional and must be a [Timestamp] (https://kafka.apache.org/0102/javadoc/org/apache/kafka/connect/data/Schema.html)

Type: string
Importance: medium

parser.timestamp.timezone

The timezone that all of the dates will be parsed with.

Type: string
Importance: low

parser.timestamp.date.formats

The date formats that are expected in the file. This is a list of strings that will be used to parse the date fields in order. The most accurate date format should be the first in the list. Take a look at the Java documentation for more info. https://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

Type: list
Importance: low

CSV Parsing¶

csv.skip.lines

Number of lines to skip in the beginning of the file.

Type: int
Default: 0
Importance: low

csv.separator.char

The character that separates each field in the form of an integer. Typically in a CSV this is a ,(44) character. A TSV would use a tab(9) character. If csv.separator.char is defined as a null(0), then the RFC 4180 parser must be utilized by default. This is the equivalent of csv.rfc.4180.parser.enabled = true.

Type: int
Default: 44
Importance: low

csv.quote.char

The character that is used to quote a field. Typically in a CSV this is a “(34) character. This typically happens when the csv.separator.char character is within the data.

Type: int
Default: 34
Importance: low

csv.escape.char

The character as an integer to use when a special character is encountered. The default escape character is typically a (92)

Type: int
Default: 92
Importance: low

csv.strict.quotes

Sets the strict quotes setting - if true, characters outside the quotes are ignored.

Type: string
Default: false
Importance: low

csv.ignore.leading.whitespace

Sets the ignore leading whitespace setting - if true, white space in front of a quote in a field is ignored.

Type: string
Importance: low

csv.ignore.quotations

Sets the ignore quotations mode - if true, quotations are ignored.

Type: string
Default: false
Importance: low

csv.keep.carriage.return

Flag to determine if the carriage return at the end of the line should be maintained.

Type: string
Default: false
Importance: low

csv.null.field.indicator

Indicator to determine how the CSV Reader can determine if a field is null. Valid values are EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH, NEITHER. For more information see http://opencsv.sourceforge.net/apidocs/com/opencsv/enums/CSVReaderNullFieldIndicator.html.

Type: string
Default: NEITHER
Importance: low

csv.first.row.as.header

Flag to indicate if the fist row of data contains the header of the file. If true the position of the columns will be determined by the first row to the CSV. The column position will be inferred from the position of the schema supplied in value.schema. If set to true the number of columns must be greater than or equal to the number of fields in the schema.

Type: string
Importance: medium

csv.file.charset

Character set to read wth file with.

Type: string
Default: UTF-8
Importance: low

ui.csv.pre.validate.file.enabled

Flag to enable validating the integrity of all records in the CSV file before processing any of its records. For example, if any of the records have a linefeed within an unquoted field, which would incorrectly break the record at that point, then the entire fil will be considered erroneous and no records from that file will be processed. The failed file would be moved to the configured error path. Important: If the number of records in a file is larger than the configured batch size, then portions of the file may be retrieved from the sftp server by the connector more than once.

Type: string
Default: NO
Valid Values: NO, YES
Importance: low

Number of tasks for this connector¶

tasks.max

Maximum number of tasks for the connector.

Type: int
Valid Values: [1,…,1]
Importance: high

Next Steps¶

For an example that shows fully-managed Confluent Cloud connectors in action with Confluent Cloud ksqlDB, see the Cloud ETL Demo. This example also shows how to use Confluent CLI to manage your resources in Confluent Cloud.