Google의 컨테이너 관리 논문 (Borg Paper, Kubernetes)

지난달 말(2015-04-17) Google 이 “Large-scale cluster management at Google with Borg” 라는 논문을 발표했다. [1]
최근에 회사 업무로 kubernetes [2] 를 검토하고 있어서
- 궁금한 부분에 대해서 kubernetes 코드를 보고, Borg 논문을 읽고 혼자 정리하고 있었는데.
- 앞으로 회사에서 컨테이너 관련 업무를 안 하게 되서 정리한게 아까워서 블로그에라도 남김.

초록

Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

일단 초록은 위와 같다.
한마디로 Google 의 Borg 시스템은 수만대의 머신으로 구성된 클러스터들에 수십만개의 jobs 을 실행하는 클러스터 매니저라는 소리.
내용을 보면 borg 는 C++ 로 작성되어 있고, 이 노하우를 적용하여 오픈소스로 만든것이 kubernetes 이다.
- kubernetes 는 현재 go 언어와 docker 기반으로 만들어져 있다.
- Borg 를 만들었던 엔지니어 대부분이 kubernetes 로 옮겼다고도 함.
구글은 거의 모든 서버를 컨테이너로 돌린다고 하고, 매주 20억개의 컨테이너가 시작된다고 함. [3]

논문에서 관심있게 본 부분만.

2.1 The workload

workload 를 아래의 2가지로 나눔.
- long-running services (prod)
  - 절대 죽으면 안 됨.
  - 빠른 응답시간 (a few μs to a few hundred ms)
  - e.g. Gmail, Google Docs, web search, BigTable 같은 내부 인프라 서비스 등.
  - CPU 70% 를 할당받고 60% 를 사용함.
  - memory 55% 를 할당받고 85% 를 사용함.
- batch jobs (non-prod)
  - 몇 초에 끝나는 작업 부터 몇일씩 걸리는 작업들.
2가지 종류로 나눠서 같은 cell 에 함께 실행하는 것은 자원을 최대한 활용하기 위한 목적이라고 함.

2.5 Priority, quota, and admission control

사용자별 quota 와 job 우선순위가 있는데 자세히 안 읽음.

2.6 Naming and monitoring

“Borg Name Service” (BNS) 를 만들었다. cell 이름, job 이름, task 번호를 포함한 이름을 각 task 에 붙인다.
- 목적: 새로운 머신에 task 를 할당하기 위해 다른 머신에 할당된 task를 찾기 위함.

3.2 Scheduling

scoring 단계에서 선택된 machine 이 만약 새로운 task 를 돌리기에 충분한 자원이 없다면.
Borg 는 (task 가 돌아갈 수 있을때까지) 우선순위가 낮은 tasks 를 죽이고 자원을 점유한다(preempts).
그리고 죽은 tasks 는 scheduler 의 pending queue 에 들어가서 언젠가 돌아가는 방식인듯.
Task startup latency 는 평균 25 초이다.
- 이것의 80% 는 패키지 설치. (병목은 local disk 쓰기 성능)
참고로 kubernetes 의 소스코드도 요즘 오랫동안 훑어봤는데.
- kubernetes 의 스케줄링은 score 가 동일하면 랜덤으로 하나 뽑아서 실행하도록 구현되어 있다. [4]

3.3 Borglet

Borgmaster 는 각 Borglet 을 n 초에 한번씩 poll 해서, 머신의 현재 상태를 가져옴.
- 목적: 통신이 너무 많아 지는 것을 회피.

3.4 Scalability

Borg 의 중앙집중적인 아키텍처의 확장성에 대해선 확신이 없음.
지금까지는 limit 까지 근접하고 limit 을 제거해왔다.
Borgmaster 는 10~14 CPU 코어와 50 GiB 메모리를 사용.

4. Availability

Borgmaster 의 가용성이 99.99% 라고 함.
- 어떻게?
  - replication for machine failures
  - admission control to avoid overload
  - deploying instances using simple, low-level tools to minimize external dependencies.

6. Isolation

우리 machine 의 50% 는 9개 이상의 tasks 를 돌린다.
90%ile 의 machine 은 약 25 tasks 를 가지고 있고, 4500 개의 스레드를 돌린다.

6.2 Performance isolation

compressible resources (압축 가능한 자원)
- task 를 죽이지 않고도 서비스의 품질을 떨어뜨려서 task 로부터 자원을 돌려받는게 가능한 것들.
- e.g. CPU cycles, IO bandwidth
non-compressible resources (압축 불가능한 자원)
- task 를 죽이지 않고는 자원을 돌려받는게 불가능한 것들.
- e.g. memory, disk 공간