從分散式看Kafka

There are two basic tasks that any computer system needs to accomplish:

  • storage
  • computation

Distributed programming is the art of solving the same problem that you can solve on a single computer using multiple computers - usually, because the problem no longer fits on a single computer.

上述是一段高屋建瓴的文字,扼要地說清楚分散式系統是什麼。在分散式系統中,我們要關心什麼呢?簡單說就是擴展性、性能和可用性

擴展性(Scalability)

Scalability is the ability of a system, network, or process, to handle a growing amount of work in a capable manner

同樣一個問題,當數據規模較小時,不足為道;但數量增大後,就會難得多。比如計數問題: Its easy to count how many people are in a room, and hard to count how many people are in a country.

性能(Performance)

Performance is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.

性能可用「多、快、好、省」幾個字來形容,具體有

  1. Short response time/low latency for a given piece of work
  2. High throughput (rate of processing work)
  3. Low utilization of computing resource(s)

可用性(Availability )

Availability :the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable.

Fault tolerance:ability of a system to behave in a well-defined manner once faults occur

Divide and conquer - I mean, partition and replicate.

There are two basic techniques that can be applied to a data set. It can be split over multiple nodes (partitioning) to allow for more parallel processing. It can also be copied or cached on different nodes to reduce the distance between the client and the server and for greater fault tolerance (replication).

如圖,一個topic有4個Partition,每個Partition有兩個replication。

Topic具體是怎樣發送到那個partition中呢?

Key Hash或者Round Robin

參考資料:kafka工作原理

推薦閱讀:

Kafka Connect內部原理
《Kafka:The Definitive Guide》第四章Kafka Consumer問題集
如何使用Kafka在生產環境構建大規模機器學習
《Simplifying data pipelines with Apache Kafka》課程第一章Introduction問題集

TAG:分散式系統 | Kafka |