Chapter 4. About Kafka Connect
Kafka Connect is an integration toolkit for streaming data between Kafka brokers and other systems. The other system is typically an external data source or target, such as a database. Kafka Connect uses a plugin architecture. Plugins allow connections to other systems and provide additional configuration to manipulate data. Plugins include connectors and other components, such as data converters and transforms. A connector operates with a specific type of external system. Each connector defines a schema for its configuration. You supply the configuration to Kafka Connect to create a connector instance within Kafka Connect. Connector instances then define a set of tasks for moving data between systems. AMQ Streams operates Kafka Connect in distributed mode , distributing data streaming tasks across one or more worker pods. A Kafka Connect cluster comprises a group of worker pods. Each connector is instantiated on a single worker. Each connector comprises one or more tasks that are distributed across the group of workers. Distribution across workers permits highly scalable pipelines. Workers convert data from one format into another format that’s suitable for the source or target system. Depending on the configuration of the connector instance, workers might also apply transforms (also known as Single Message Transforms, or SMTs). Transforms adjust messages, such as filtering certain data, before they are converted. Kafka Connect has some built-in transforms, but other transformations can be provided by plugins if necessary.
4.1. How Kafka Connect streams data
Kafka Connect uses connector instances to integrate with other systems to stream data. Kafka Connect loads existing connector instances on start up and distributes data streaming tasks and connector configuration across worker pods. Workers run the tasks for the connector instances. Each worker runs as a separate pod to make the Kafka Connect cluster more fault tolerant. If there are more tasks than workers, workers are assigned multiple tasks. If a worker fails, its tasks are automatically assigned to active workers in the Kafka Connect cluster. The main Kafka Connect components used in streaming data are as follows: Connectors to create tasks Tasks to move data Workers to run tasks Transforms to manipulate data Converters to convert data
4.1.1. Connectors
Connectors can be one of the following type:
Source connectors that push data into Kafka
Sink connectors that extract data out of Kafka
Plugins provide the implementation for Kafka Connect to run connector instances. Connector instances create the tasks required to transfer data in and out of Kafka. The Kafka Connect runtime orchestrates the tasks to split the work required between the worker pods.
MirrorMaker 2.0 also uses the Kafka Connect framework. In this case, the external data system is another Kafka cluster. Specialized connectors for MirrorMaker 2.0 manage data replication between source and target Kafka clusters.
In addition to the MirrorMaker 2.0 connectors, Kafka provides two built-in connectors as examples:
FileStreamSourceConnector
streams data from a file on the worker’s filesystem to Kafka, reading the input file and sending each line to a given Kafka topic.
FileStreamSinkConnector
streams data from Kafka to the worker’s filesystem, reading messages from a Kafka topic and writing a line for each in an output file.
The following source connector diagram shows the process flow for a source connector that streams records from an external data system. A Kafka Connect cluster might operate source and sink connectors at the same time. Workers are running in distributed mode in the cluster. Workers can run one or more tasks for more than one connector instance.
Source connector streaming data to Kafka
A plugin provides the implementation artifacts for the source connector A single worker initiates the source connector instance The source connector creates the tasks to stream data Tasks run in parallel to poll the external data system and return records Transforms adjust the records, such as filtering or relabelling them Converters put the records into a format suitable for Kafka The source connector is managed using KafkaConnectors or the Kafka Connect API The following sink connector diagram shows the process flow when streaming data from Kafka to an external data system.
Sink connector streaming data from Kafka
A plugin provides the implementation artifacts for the sink connector A single worker initiates the sink connector instance The sink connector creates the tasks to stream data Tasks run in parallel to poll Kafka and return records Converters put the records into a format suitable for the external data system Transforms adjust the records, such as filtering or relabelling them The sink connector is managed using KafkaConnectors or the Kafka Connect API
4.1.2. Tasks
Data transfer orchestrated by the Kafka Connect runtime is split into tasks that run in parallel. A task is started using the configuration supplied by a connector instance. Kafka Connect distributes the task configurations to workers, which instantiate and execute tasks.
A source connector task polls the external data system and returns a list of records that a worker sends to the Kafka brokers.
A sink connector task receives Kafka records from a worker for writing to the external data system.
For sink connectors, the number of tasks created relates to the number of partitions being consumed. For source connectors, how the source data is partitioned is defined by the connector. You can control the maximum number of tasks that can run in parallel by setting
tasksMax
in the connector configuration. The connector might create fewer tasks than the maximum setting. For example, the connector might create fewer tasks if it’s not possible to split the source data into that many partitions.
In the context of Kafka Connect, a
partition
can mean a topic partition or a
shard of data
in an external system.