Thursday, March 28, 2019

Apache Kafka

What is Kafka?

It is a distributed real time streaming platform.
Three capabilities;

    1. Publish and Subscribe to stream of records (like in a Message Queue) 
    2. Store streams of records with fault tolerance.
    3. Process streams of records, real time.

Application Areas;

  • Building real time streaming data pipelines.
  • Building real time streaming applications.

Concepts;
Run as cluster (can accommodate multiple data centers)
Stores data streams in categories called topics
Record, consists;

  • key
  • value
  • timestamp

Core APIs

  • Producer API
  • Consumer API
  • Streams API
  • Connector API

The communication between the client and the servers is done with TCP.

Source: https://kafka.apache.org/intro
Topics in Kafka
A category or feed name to which records are published.
Kafka topics are always multi subscriber; (Can have zero or many consumers)
For each topic, the Kafka cluster maintains a partitioned log.







Distribution
The partitions of the log are distributed over the servers in a Kafka cluster.
Each server in the Kafka cluster can handle data and requests for a share of the partitions. Each partition is replicated for fault tolerance.
A partition has,

  • one "leader",
  • zero or more "followers"

Geo Replication
Kafka  MirrorMaker is there for geo-replication support for the clusters. With MirrorMaker, messages are replicated across multiple data centers or cloud regions.

Producers
Producers can publish data to the topics. He is responsible for choosing which record to assign to which partition in the topic.

Consumers
Source: https://kafka.apache.org/intro
Consumers can label themselves with consumer group name. Each record published to a topic is delivered to one consumer instance within each subscribing consumer group.










Kafka for Stream Processing
Kafka as Storage System

Kafka as a Messaging System
Traditionally a messaging queue has two models; queuing and publish-subscribe. In a queue, a set of consumers can read from a server and each record goes to one of them. Publish-subscribe allows to broadcast data to multiple processes.The consumer group concept in Kafka generalizes these two concepts.

Kafka Architecture
Source :https://kafka.apache.org/21/documentation/streams/architecture

How to configure a Kafka environment in Ubuntu?

Prerequisites

  • updated package information in the system (apt update)
  • default java version
Get downloaded and extracted the binary distribution of Kafka

https://www.apache.org/dyn/closer.cgi?path=/kafka/2.2.0/kafka_2.11-2.2.0.tgz


Start Kafka server




As Kafka server uses Zookeeper server, we have to start Zookeeper first



Then start Kafka server

Create A Topic

Send Messages To Kafka

Using Kafka Consumer


How it was invented? 
Initially it was invented by LinkedIn corporation as a Message Queue.

No comments:

Post a Comment

Apache Kafka

What is Kafka? It is a distributed real time streaming platform. Three capabilities; Publish and Subscribe to stream of records (like ...