SFTP Source Connector for Confluent Cloud¶
The fully-managed SFTP Source connector for Confluent Cloud watches an SFTP directory
for files and reads the data as new files get written to the directory. Each
file is parsed based on one of the following property values used with the
input.file.parser.format
configuration, which are also selectable in the UI.
BINARY
CSV
JSON
(the default)SCHEMALESS_JSON
Once a file has been read, it is placed into a finished.path
directory or an error.path
directory.
Note
This is a Quick Start for the fully-managed cloud connector. If you are installing the connector locally for Confluent Platform, see SFTP Source Connector for Confluent Platform.
Features¶
The SFTP Source connector supports the following features:
- At least once delivery: The connector guarantees that records are delivered at least once to the Kafka topic (if the file row parsed is valid).
- Supports one task: The connector supports running one task per connector instance.
- Supported output data formats: The connector supports Avro, JSON Schema (JSON-SR), Protobuf, JSON (schemaless), Bytes, and String output record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf).
For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Managed and Custom Connectors section.
Limitations¶
Be sure to review the following information.
- For connector limitations, see SFTP Source Connector limitations.
- If you plan to use one or more Single Message Transforms (SMTs), see SMT Limitations.
- If you plan to use Confluent Cloud Schema Registry, see Schema Registry Enabled Environments.
Quick Start¶
Use this quick start to get up and running with the Confluent Cloud SFTP Source connector. The quick start provides the basics of selecting the connector and configuring it to get data from an SFTP host.
- Prerequisites
- Authorized access to a Confluent Cloud cluster on Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud.
- The Confluent CLI installed and configured for the cluster. See Install the Confluent CLI.
- Access to an SFTP host.
- Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON_SR (JSON Schema), or Protobuf).
- At least one source Kafka topic must exist in your Confluent Cloud cluster before creating the source connector.
Using the Confluent Cloud Console¶
Step 1: Launch your Confluent Cloud cluster¶
See the Quick Start for Confluent Cloud for installation instructions.
Step 2: Add a connector¶
In the left navigation menu, click Connectors. If you already have connectors in your cluster, click + Add connector.
Step 3: Select your connector¶
Click the SFTP Source connector card.
Note
- Make sure you have all your prerequisites completed.
- An asterisk ( * ) designates a required entry.
- The steps provide information about how to use the required configuration properties. See Configuration Properties for other configuration property values and descriptions.
At the Add SFTP Source Connector screen, complete the following:
- Select the way you want to provide Kafka Cluster credentials. You can
choose one of the following options:
- My account: This setting allows your connector to globally access everything that you have access to. With a user account, the connector uses an API key and secret to access the Kafka cluster. This option is not recommended for production.
- Service account: This setting limits the access for your connector by using a service account. This option is recommended for production.
- Use an existing API key: This setting allows you to specify an API key and a secret pair. You can use an existing pair or create a new one. This method is not recommended for production environments.
- Click Continue.
- Enter the SFTP Details:
- SFTP Host: The host address for the SFTP server. For example
192.168.1.231
. - SFTP Port: The port number of the SFTP server. Defaults to
22
. - Username: The username the connector will use to connect to the host.
- Password: The password for the SFTP connection. A password is not required if you’re using TLS.
- Upload a PEM file.
- TLS passphrase: Used to decrypt the private key if the given private key is encrypted.
- SFTP Host: The host address for the SFTP server. For example
- Click Continue.
Add the SFTP directory details:
- Input file parser format: The parser that should be used to fetch files from the SFTP directory.
- Output message format: The connector supports Avro, JSON Schema, Protobuf, JSON, Bytes, and String output Kafka record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf). See Schema Registry Enabled Environments for additional information.
- Input path: The SFTP directory to read files that will be processed. This directory must exist and be by the user running Kafka Connect.
- Finished path: The SFTP directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect.
- Error path: The SFTP directory to place files in which there are errors. This directory must exist and be writable by the user running Kafka Connect.
- Input file pattern (regex): A regular expression to check input
file names against. This expression must match the entire filename.
This is equivalent to
Matcher.matches()
. Using.*
accepts all files in the directory. - For Schema generation enabled select whether schemas should be
dynamically generated–select
true
orfalse
.
Show advanced configurations
Schema context: Select a schema context to use for this connector, if using a schema-based data format. This property defaults to the Default context, which configures the connector to use the default schema set up for Schema Registry in your Confluent Cloud environment. A schema context allows you to use separate schemas (like schema sub-registries) tied to topics in different Kafka clusters that share the same Schema Registry environment. For example, if you select a non-default context, a Source connector uses only that schema context to register a schema and a Sink connector uses only that schema context to read from. For more information about setting up a schema context, see What are schema contexts and when should you use them?.
- Batch size: The number of records that should be returned with
each batch.
Empty poll wait (ms): The amount of time to wait if a poll returns and empty list of records.
Cleanup policy: Determines how the connector should clean up the files that have been processed.
Behavior on error: Whether the task should halt when it encounters an error or continue to next file.
File minimum age (ms): The amount of time in milliseconds after the file was last written to before the file can be processed.
For information about transforms and predicates, see the Single Message Transforms (SMT) documentation for details. See Unsupported transformations for a list of SMTs that are not supported with this connector.
Transforms and Predicates: See the Single Message Transforms (SMT) documentation for details.
Click Continue.
Based on the number of topic partitions you select, you will be provided with a recommended number of tasks.
- To change the number of tasks, use the Range Slider to select the desired number of tasks.
- Click Continue.
Verify the connection details by previewing the running configuration.
Once you’ve validated that the properties are configured to your satisfaction, click Launch.
The status for the connector should go from Provisioning to Running.
Step 5: Check for records¶
Verify that records are being produced in the Kafka topic.
For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Managed and Custom Connectors section.
Tip
When you launch a connector, a Dead Letter Queue topic is automatically created. See Confluent Cloud Dead Letter Queue for details.
Using the Confluent CLI¶
To set up and run the connector using the Confluent CLI, complete the following steps.
Note
Make sure you have all your prerequisites completed.
Step 1: List the available connectors¶
Enter the following command to list available connectors:
confluent connect plugin list
Step 2: List the connector configuration properties¶
Enter the following command to show the connector configuration properties:
confluent connect plugin describe <connector-plugin-name>
The command output shows the required and optional configuration properties.
Step 3: Create the connector configuration file¶
Create a JSON file that contains the connector configuration properties. The following example shows the required connector properties.
{
"connector.class": "SftpSource",
"name": "SftpSourceConnector_0",
"kafka.api.key": "****************",
"kafka.api.secret": "*********************************",
"kafka.topic": "orders",
"output.data.format": "JSON",
"input.file.parser.format": "CSV",
"schema.generation.enable": "true",
"sftp.host": "192.168.1.231",
"sftp.username": "connect-user",
"sftp.password:": "****************",
"input.path": "/path/to/data",
"finished.path": "/path/to/finished",
"error.path": "/path/to/error",
"input.file.pattern": "csv-sftp-source.csv",
"tasks.max": "1",
}
Note the following property definitions:
"connector.class"
: Identifies the connector plugin name."name"
: Sets a name for your new connector.
"kafka.auth.mode"
: Identifies the connector authentication mode you want to use. There are two options:SERVICE_ACCOUNT
orKAFKA_API_KEY
(the default). To use an API key and secret, specify the configuration propertieskafka.api.key
andkafka.api.secret
, as shown in the example configuration (above). To use a service account, specify the Resource ID in the propertykafka.service.account.id=<service-account-resource-ID>
. To list the available service account resource IDs, use the following command:confluent iam service-account list
For example:
confluent iam service-account list Id | Resource ID | Name | Description +---------+-------------+-------------------+------------------- 123456 | sa-l1r23m | sa-1 | Service account 1 789101 | sa-l4d56p | sa-2 | Service account 2
"kafka.topic"
: Enter the topic name or a comma-separated list of topic names."output.data.format"
: The connector supports Avro, JSON Schema (JSON_SR), Protobuf, JSON (schemaless), Bytes, and String output Kafka record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf). See Schema Registry Enabled Environments for additional information.Note
Note the following relationship between
output.data.format
and theinput.file.parser.format
property.- If you use
BINARY
forinput.file.parser.format
, you must useBYTES
foroutput.data.format
. - If you use
SCHEMALESS_JSON
forinput.file.parser.format
, you must useSTRING
foroutput.data.format
. - If you leave this to
JSON
(the default) or useCSV
forinput.file.parser.format
, you can use any format foroutput.data.format
.
- If you use
"input.file.parser.format"
: The parser format used to parse fetched files from the SFTP directory. Defaults toJSON
. Options areBINARY
,CSV
,JSON
,SCHEMALESS_JSON
.Important
If you use
JSON
(the default) orCSV
as theinput.file.parser.format
, then you must add the propertyschema.generation.enable
and set it totrue
. If you set this property tofalse
, you must provide akey.schema
and avalue.schema
.key.schema
andvalue.schema
properties require the actual schema, not the schema ID. To generate the schema, use the tool available here.Schema example:
key.schema={\"name\" : \"com.example.users.UserKey\",\"type\" : \"STRUCT\",\"isOptional\" : false, \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false}}} value.schema={\"name\" : \"com.example.users.User\",\"type\" : \"STRUCT\",\"isOptional\" : false, \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false}, \"first_name\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"last_name\" : {\"type\" : \"STRING\", \"isOptional\" : true},\"email\" : {\"type\" : \"STRING\",\"isOptional\" : true}, \"gender\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"ip_address\" : {\"type\" : \"STRING\", \"isOptional\" : true},\"last_login\" : {\"type\" : \"STRING\",\"isOptional\" : true}, \"account_balance\" : {\"name\" : \"org.apache.kafka.connect.data.Decimal\",\"type\" : \"BYTES\", \"version\" : 1,\"parameters\" : {\"scale\" : \"2\"},\"isOptional\" : true}, \"country\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"favorite_color\" : {\"type\" : \"STRING\", \"isOptional\" : true}}}
"sftp.host"
: Enter the host address for the SFTP server. For example192.168.1.231
. Note that the port defaults to22
. To change this, add the property"sftp.port"
."sftp.username"
: Enter the user name that the connector will use to connect to the host. The"sftp.password"
property is not required if a PEM file is used for key based authentication to the host."input.path"
: Add the SFTP directory where the connector places files that are successfully processed. This directory must exist and be writable by the connector."finished.path"
: Add the SFTP directory from which the connector reads files that will be processed. This directory must exist and be writable by the connector."error.path"
: Add the SFTP directory where the connector places files in which there are errors. This directory must exist and be writable by the connector."input.file.pattern"
: Add a regular expression to check input file names against. This expression must match the entire filename. The equivalent ofMatcher.matches()
. Using.*
accepts all files in the directory."tasks.max"
: The connector supports running one tasks per connector.
Single Message Transforms: See the Single Message Transforms (SMT) documentation for details about adding SMTs using the CLI.
See Configuration Properties for all property values and descriptions.
Step 4: Load the properties file and create the connector¶
Enter the following command to load the configuration and start the connector:
confluent connect cluster create --config-file <file-name>.json
For example:
confluent connect cluster create --config-file sftp-source-config.json
Example output:
Created connector SftpSourceConnector_0 lcc-do6vzd
Step 5: Check the connector status.¶
Enter the following command to check the connector status:
confluent connect cluster list
Example output:
ID | Name | Status | Type | Trace
+------------+-----------------------------+---------+--------+-------+
lcc-do6vzd | SftpSourceConnector_0 | RUNNING | source | |
Step 6: Check the Kafka topic¶
Verify that records are being produced at the Kafka topic.
For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Managed and Custom Connectors section.
Tip
When you launch a connector, a Dead Letter Queue topic is automatically created. See Confluent Cloud Dead Letter Queue for details.
Configuration Properties¶
Use the following configuration properties with the fully-managed connector. For self-managed connector property definitions and other details, see the connector docs in Self-managed connectors for Confluent Platform.
How should we connect to your data?¶
name
Sets a name for your connector.
- Type: string
- Valid Values: A string at most 64 characters long
- Importance: high
Kafka Cluster credentials¶
kafka.auth.mode
Kafka Authentication mode. It can be one of KAFKA_API_KEY or SERVICE_ACCOUNT. It defaults to KAFKA_API_KEY mode.
- Type: string
- Default: KAFKA_API_KEY
- Valid Values: KAFKA_API_KEY, SERVICE_ACCOUNT
- Importance: high
kafka.api.key
Kafka API Key. Required when kafka.auth.mode==KAFKA_API_KEY.
- Type: password
- Importance: high
kafka.service.account.id
The Service Account that will be used to generate the API keys to communicate with Kafka Cluster.
- Type: string
- Importance: high
kafka.api.secret
Secret associated with Kafka API key. Required when kafka.auth.mode==KAFKA_API_KEY.
- Type: password
- Importance: high
Which topic do you want to send data to?¶
kafka.topic
Identifies the topic name to write the data to.
- Type: string
- Importance: high
Schema Config¶
schema.context.name
Add a schema context name. A schema context represents an independent scope in Schema Registry. It is a separate sub-schema tied to topics in different Kafka clusters that share the same Schema Registry instance. If not used, the connector uses the default schema configured for Schema Registry in your Confluent Cloud environment.
- Type: string
- Default: default
- Importance: medium
Output messages¶
output.data.format
Sets the output message format. Valid entries are AVRO, JSON_SR, PROTOBUF, JSON, STRING or BYTES. Note that you need to have Confluent Cloud Schema Registry configured if using a schema-based message format like AVRO, JSON_SR, and PROTOBUF
- Type: string
- Importance: high
Input file parser format¶
input.file.parser.format
Parser that should be used to parse fetched files from sftp directory
- Type: string
- Default: JSON
- Importance: high
SFTP Details¶
sftp.host
Host address of the SFTP server.
- Type: string
- Importance: high
sftp.port
Port number of the SFTP server.
- Type: int
- Default: 22
- Importance: medium
sftp.username
Username for the SFTP connection.
- Type: string
- Importance: high
sftp.password
Password for the SFTP connection (not required if using TLS).
- Type: password
- Importance: high
tls.pemfile
PEM file to be used for authentication via TLS.
- Type: password
- Importance: high
tls.passphrase
Passphrase that will be used to decrypt the private key if the given private key is encrypted.
- Type: password
- Importance: high
SFTP directory¶
input.path
The SFTP directory to read files that will be processed.This directory must exist and be writable by the user running Kafka Connect.
- Type: string
- Importance: high
finished.path
The SFTP directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect.
- Type: string
- Importance: high
error.path
The SFTP directory to place files in which there are error(s). This directory must exist and be writable by the user running Kafka Connect.
- Type: string
- Importance: high
File System¶
cleanup.policy
Determines how the connector should cleanup the files that have been successfully processed. NONE leaves the files in place which could cause them to be reprocessed if the connector is restarted. DELETE removes the file from the filesystem. MOVE will move the file to a finished directory.
- Type: string
- Default: MOVE
- Importance: medium
input.file.pattern
Regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches().
- Type: string
- Importance: high
behavior.on.error
Should the task halt when it encounters an error or continue to the next file.
- Type: string
- Default: FAIL
- Importance: high
file.minimum.age.ms
The amount of time in milliseconds after the file was last written to before the file can be processed. For default 0, connector processes all files irrespective of age
- Type: long
- Default: 0
- Importance: low
Connection details¶
batch.size
The number of records that should be returned with each batch.
- Type: int
- Default: 1000
- Importance: low
empty.poll.wait.ms
The amount of time to wait if a poll returns an empty list of records.
- Type: long
- Default: 250
- Importance: low
Schema¶
key.schema
The schema for the key written to Kafka.
- Type: string
- Importance: high
value.schema
The schema for the value written to Kafka.
- Type: string
- Importance: high
Schema Generation¶
schema.generation.enabled
Flag to determine if schemas should be dynamically generated. If set to true, key.schema and value.schema can be omitted, but schema.generation.key.name and schema.generation.value.name must be set.
- Type: boolean
- Importance: medium
schema.generation.key.fields
The field(s) to use to build a key schema. This is only used during schema generation.
- Type: list
- Importance: medium
schema.generation.key.name
The name of the generated key schema.
- Type: string
- Importance: medium
schema.generation.value.name
The name of the generated value schema.
- Type: string
- Importance: medium
Timestamps¶
timestamp.mode
Determines how the connector will set the timestamp for the ConnectRecord. If set to FIELD then the timestamp will be read from a field in the value. This field cannot be optional and must be a Timestamp. Specify the field in timestamp.field. If set to FILE_TIME then the last modified time of the file will be used. If set to PROCESS_TIME the time the record is read will be used.
- Type: string
- Importance: medium
timestamp.field
The field in the value schema that will contain the parsed timestamp for the record. This field cannot be marked as optional and must be a [Timestamp] (https://kafka.apache.org/0102/javadoc/org/apache/kafka/connect/data/Schema.html)
- Type: string
- Importance: medium
parser.timestamp.timezone
The timezone that all of the dates will be parsed with.
- Type: string
- Importance: low
parser.timestamp.date.formats
The date formats that are expected in the file. This is a list of strings that will be used to parse the date fields in order. The most accurate date format should be the first in the list. Take a look at the Java documentation for more info. https://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
- Type: list
- Importance: low
CSV Parsing¶
csv.skip.lines
Number of lines to skip in the beginning of the file.
- Type: int
- Default: 0
- Importance: low
csv.separator.char
The character that separates each field in the form of an integer. Typically in a CSV this is a ,(44) character. A TSV would use a tab(9) character. If csv.separator.char is defined as a null(0), then the RFC 4180 parser must be utilized by default. This is the equivalent of csv.rfc.4180.parser.enabled = true.
- Type: int
- Default: 44
- Importance: low
csv.quote.char
The character that is used to quote a field. Typically in a CSV this is a “(34) character. This typically happens when the csv.separator.char character is within the data.
- Type: int
- Default: 34
- Importance: low
csv.escape.char
The character as an integer to use when a special character is encountered. The default escape character is typically a (92)
- Type: int
- Default: 92
- Importance: low
csv.strict.quotes
Sets the strict quotes setting - if true, characters outside the quotes are ignored.
- Type: string
- Default: false
- Importance: low
csv.ignore.leading.whitespace
Sets the ignore leading whitespace setting - if true, white space in front of a quote in a field is ignored.
- Type: string
- Importance: low
csv.ignore.quotations
Sets the ignore quotations mode - if true, quotations are ignored.
- Type: string
- Default: false
- Importance: low
csv.keep.carriage.return
Flag to determine if the carriage return at the end of the line should be maintained.
- Type: string
- Default: false
- Importance: low
csv.null.field.indicator
Indicator to determine how the CSV Reader can determine if a field is null. Valid values are EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH, NEITHER. For more information see http://opencsv.sourceforge.net/apidocs/com/opencsv/enums/CSVReaderNullFieldIndicator.html.
- Type: string
- Default: NEITHER
- Importance: low
csv.first.row.as.header
Flag to indicate if the fist row of data contains the header of the file. If true the position of the columns will be determined by the first row to the CSV. The column position will be inferred from the position of the schema supplied in value.schema. If set to true the number of columns must be greater than or equal to the number of fields in the schema.
- Type: string
- Importance: medium
csv.file.charset
Character set to read wth file with.
- Type: string
- Default: UTF-8
- Importance: low
ui.csv.pre.validate.file.enabled
Flag to enable validating the integrity of all records in the CSV file before processing any of its records. For example, if any of the records have a linefeed within an unquoted field, which would incorrectly break the record at that point, then the entire fil will be considered erroneous and no records from that file will be processed. The failed file would be moved to the configured error path. Important: If the number of records in a file is larger than the configured batch size, then portions of the file may be retrieved from the sftp server by the connector more than once.
- Type: string
- Default: NO
- Valid Values: NO, YES
- Importance: low
Number of tasks for this connector¶
tasks.max
Maximum number of tasks for the connector.
- Type: int
- Valid Values: [1,…,1]
- Importance: high
Next Steps¶
For an example that shows fully-managed Confluent Cloud connectors in action with Confluent Cloud ksqlDB, see the Cloud ETL Demo. This example also shows how to use Confluent CLI to manage your resources in Confluent Cloud.