Thursday, March 28, 2019

Apache Kafka

What is Kafka?

It is a distributed real time streaming platform.
Three capabilities;

    1. Publish and Subscribe to stream of records (like in a Message Queue) 
    2. Store streams of records with fault tolerance.
    3. Process streams of records, real time.

Application Areas;

  • Building real time streaming data pipelines.
  • Building real time streaming applications.

Concepts;
Run as cluster (can accommodate multiple data centers)
Stores data streams in categories called topics
Record, consists;

  • key
  • value
  • timestamp

Core APIs

  • Producer API
  • Consumer API
  • Streams API
  • Connector API

The communication between the client and the servers is done with TCP.

Source: https://kafka.apache.org/intro
Topics in Kafka
A category or feed name to which records are published.
Kafka topics are always multi subscriber; (Can have zero or many consumers)
For each topic, the Kafka cluster maintains a partitioned log.







Distribution
The partitions of the log are distributed over the servers in a Kafka cluster.
Each server in the Kafka cluster can handle data and requests for a share of the partitions. Each partition is replicated for fault tolerance.
A partition has,

  • one "leader",
  • zero or more "followers"

Geo Replication
Kafka  MirrorMaker is there for geo-replication support for the clusters. With MirrorMaker, messages are replicated across multiple data centers or cloud regions.

Producers
Producers can publish data to the topics. He is responsible for choosing which record to assign to which partition in the topic.

Consumers
Source: https://kafka.apache.org/intro
Consumers can label themselves with consumer group name. Each record published to a topic is delivered to one consumer instance within each subscribing consumer group.










Kafka for Stream Processing
Kafka as Storage System

Kafka as a Messaging System
Traditionally a messaging queue has two models; queuing and publish-subscribe. In a queue, a set of consumers can read from a server and each record goes to one of them. Publish-subscribe allows to broadcast data to multiple processes.The consumer group concept in Kafka generalizes these two concepts.

Kafka Architecture
Source :https://kafka.apache.org/21/documentation/streams/architecture

How to configure a Kafka environment in Ubuntu?

Prerequisites

  • updated package information in the system (apt update)
  • default java version
Get downloaded and extracted the binary distribution of Kafka

https://www.apache.org/dyn/closer.cgi?path=/kafka/2.2.0/kafka_2.11-2.2.0.tgz


Start Kafka server




As Kafka server uses Zookeeper server, we have to start Zookeeper first



Then start Kafka server

Create A Topic

Send Messages To Kafka

Using Kafka Consumer


How it was invented? 
Initially it was invented by LinkedIn corporation as a Message Queue.

Wednesday, March 20, 2019

Radis Cluster





What is a Radis Cluster?

Image result for redis cluster storage management
Source: https://redislabs.com/redis-features/redis-cluster
Redis Cluster is a Active/Passive cluster implementation where data is automatically sharded across multiple Redis nodes. It consists Master and Slave nodes. A Redis cluster automatically splits the data set among multiple nodes. It gives the ability to continue operations when a subset of the nodes are experiencing failures or are unable to communicate with the rest of the cluster.





How Redis Cluster manages Stroage in Distributed Concept?

The cluster uses hash partitioning to split the key space into 16,384 key slots, with each master responsible for a subset of those slots. Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call an hash slot.
.
There are 16384 hash slots in Redis Cluster, and to compute what is the hash slot of a given key, we simply take the CRC16 of the key modulo 16384. Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where: 
  • Node A contains hash slots from 0 to 5500.
  • Node B contains hash slots from 5501 to 11000.
  • Node C contains hash slots from 11001 to 16383. 
Distribution of 10 hash keys in three nodes,

Image Source :https://blog.usejournal.com/first-step-to-redis-cluster-7712e1c31847?fbclid=IwAR1xXC1vvD9QBEkfDtFh8BpnC7mQsyhQ3mG4eEET_pFLIZoIilhBNNfYcIc

This allows to add and remove nodes in the cluster easily. For example if I want to add a new node D, I need to move some hash slot from nodes A, B, C to D. Similarly if I want to remove node A from the cluster I can just move the hash slots served by A to B and C. When the node A will be empty I can remove it from the cluster completely. Each slave replicates a specific master and can be reassigned to replicate another master or be elected to a master node as needed. Replication is completely asynchronous and does not block the master or the slave. Masters receive all read and write requests for their slots; slaves never have communication with clients.

Because moving hash slots from a node to another does not require to stop operations, adding and removing nodes, or changing the percentage of hash slots hold by nodes, does not require any downtime.


How it handles failover?

Redis Cluster master-slave model



Figure demonstrating master-slave across three servers
Source: https://www.linode.com/docs/applications/big-data/how-to-install-and-configure-a-redis-cluster-on-ubuntu-1604/
In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).
In the above example cluster with nodes A, B, C, if node B fails the cluster is not able to continue, since we no longer have a way to serve hash slots in the range 5501-11000.
However when the cluster is created (or at a later time) we add a slave node to every master, so that the final cluster is composed of A, B, C that are masters nodes, and A1, B1, C1 that are slaves nodes, the system is able to continue if node B fails.
Node B1 replicates B, and B fails, the cluster will promote node B1 as the new master and will continue to operate correctly.
However note that if nodes B and B1 fail at the same time Redis Cluster is not able to continue to operate.


Figure demonstrating server3 failure
Source: https://www.linode.com/docs/applications/big-data/how-to-install-and-configure-a-redis-cluster-on-ubuntu-1604/

Anyhow, Redis Cluster is not able to guarantee strong consistency. In practical terms this means that under certain conditions it is possible that Redis Cluster will lose writes that were acknowledged by the system to the client.

The first reason why Redis Cluster can lose writes is because it uses asynchronous replication. This means that during writes the following happens:
  • Client writes to the master B.
  • The master B replies OK to client.
  • The master B propagates the write to its slaves B1, B2 and B3.
As you can see B does not wait for an acknowledge from B1, B2, B3 before replying to the client, since this would be a prohibitive latency penalty for Redis, so if the client writes something, B acknowledges the write, but crashes before being able to send the write to its slaves, one of the slaves (that did not receive the write) can be promoted to master, losing the write forever.

This is very similar to what happens with most databases that are configured to flush data to disk every second, so it is a scenario you are already able to reason about because of past experiences with traditional database systems not involving distributed systems. Similarly you can improve consistency by forcing the database to flush data on disk before replying to the client, but this usually results into prohibitively low performance. That would be the equivalent of synchronous replication in the case of Redis Cluster.

Basically there is a trade-off to take between performance and consistency.

Redis Cluster has support for synchronous writes when absolutely needed, implemented via the WAIT command, this makes losing writes a lot less likely, however note that Redis Cluster does not implement strong consistency even when synchronous replication is used: it is always possible under more complex failure scenarios that a slave that was not able to receive the write is elected as master.
There is another notable scenario where Redis Cluster will lose writes, that happens during a network partition where a client is isolated with a minority of instances including at least a master.
Take as an example 6 nodes cluster composed of A, B, C, A1, B1, C1, with 3 masters and 3 slaves. There is also a client, that we will call Z1.
After a partition occurs, it is possible that in one side of the partition we have A, C, A1, B1, C1, and in the other side we have B and Z1.
Z1 is still able to write to B, that will accept its writes. If the partition heals in a very short time, the cluster will continue normally. However if the partition lasts enough time for B1 to be promoted to master in the majority side of the partition, the writes that Z1 is sending to B will be lost.


Note that there is a maximum window to the amount of writes Z1 will be able to send to B: if enough time has elapsed for the majority side of the partition to elect a slave as master, every master node in the minority side stops accepting writes. This amount of time is a very important configuration directive of Redis Cluster, and is called the node timeout.

After node timeout has elapsed, a master node is considered to be failing, and can be replaced by one of its replicas. Similarly after node timeout has elapsed without a master node to be able to sense the majority of the other master nodes, it enters an error state and stops accepting writes.


How it utilize performance?








In a Redis Cluster, replication is completely asynchronous and does not block the master or the slave. 
With Redis technology, it increases the performance when we write data into the database and receive them, as the asynchronous replication is done separately to the in memory transactions with clients.


Apache Kafka

What is Kafka? It is a distributed real time streaming platform. Three capabilities; Publish and Subscribe to stream of records (like ...