Segment: Managing and Monitoring Over 16,000 Containers on ECS
In this video Calvin French-Owen, the Co-founder and CTO of Segment explains the practices and tools for managing and monitoring of nearly 16,000 containers on ECS (Elastic Container Service).
At 0:22 he begins to list out the chief objectives of the company Segment that includes a single API (Application Programming Interface) for the customers to manage their customer’s data. At 0:36 Calvin briefs about the development environment used, he states that the company has used ECS for nearly two years and that all production workloads run within the containers that are managed by the ECS. He highlights that there are about 350 different ECS services that run 16,000 different containers.
At 1:05 Calvin begins to explain the monitoring tools that are built in order to manage and monitor this significant number of containers.
He states that the entire monitoring process begins with the logging pipeline that addresses three main use-cases that include audit, search, and the tailing use-case. At 1:46 Calvin explains the generation of the logs. He adds on that the logs start within the container and once created they are sent to the Docker Daemon. At 2:17 he stresses that this Daemon is configured by the ECS, where the logging driver can be chosen on both per container and per service basis. The output of the containers would provide where and how the logs are going to proceed.
At 2:40 he provides us the idea that the logs in Docker Daemon would be in a machine-readable format containing all the metadata. He adds on that generally these logs will be sent to the Journal D where the logs live on the modern Linux hosts.
Furthermore, he stresses that this would be a bad idea since the logs sent from docker daemon will not be in the form of a per-process basis. Even in the case of a failure in a single host, starving would result and may collapse the entire system. At 3:40 Mr. Craig states that the failure problem was resolved by creating the rate-limiting log proxy. This developed proxy would expose a syslog server to which the docker communicates directly.
At 3:58, he states that when the communication happens with the proxy server, the incoming logs are tagged with each of the container ID. At 4:13 he begins by stating that for each of the tags, the rate-limiting proxy would be keeping track of which containers are logging and how much do each of the logs. This has eventually ensured that if a container fails in its log lines, the proxy would limit stating that the logging has been more thereby providing visibility, isolation and prevents the collapse of the entire container system.
At 5:23 Calvin begins to explain about the critical tool called ECS logs which is an open source binary which tails the Journal D and looks for the tagged entries with the container ID and the service ID. He further adds on that the ECS logs take care of transferring the logs on to the other platforms. At 6:16 he states that from the ECS logs, all of them are sent to the handful of providers, the most prominent being Cloud watch. At 6:30 he further states that for each of these containers, groups are created containing the service and the container ID as the log stream thereby allowing to know all information about the logs from various containers by a single stream.
At 7:02 he provides the information that people can find these open-sources tools like the ECS logs, rate-limiting proxy on the GitHub under the segment IO organization.