Kubernetes
Advanced
Posit Package Manager can be configured to run on AWS in a Kubernetes cluster with EKS for a non-air-gapped environment. In this architecture, Package Manager can handle a large number of users and benefits from deploying across multiple availability zones.
This configuration is suitable for teams of hundreds of data scientists who want or require multiple availability zones.
Most companies don’t need to run in this configuration unless they have many concurrent package uploads/downloads or are required to run across multiple availability zones for compliance reasons. Instead, the single server architecture of Package Manager would be more suitable for small teams that don’t need these requirements.
Architecture Overview#
This Posit Package Manager implementation deploys the application in an EKS cluster following the Kubernetes installation instructions. It additionally leverages:
- AWS Application Load Balancer (ALB) to route requests to the Posit Package Manager service.
- AWS Elastic Kubernetes Service (EKS) to provision and manage the Kubernetes cluster.
- AWS Relational Database Service (RDS) for PostgreSQL, serving as the application database for Posit Package Manager.
- AWS Simple Storage Service (S3) Posit Package Manager’s object storage.
Architecture Diagram#
EKS Cluster#
The Kubernetes cluster can be provisioned using AWS Elastic Kubernetes Service (EKS).
Nodes#
We recommend three worker nodes across multiple availability zones. We have tested with c6a.4xlarge
instances (16 vCPUs, 32 GiB Memory) for each of the nodes and can serve 30 million package installs per month, or 1 million package installs per day. This configuration can also handle 100 Git builders concurrently building packages from Git repositories.
Note
Each Posit Package Manager user could be downloading dozens or hundreds of packages a day. There are also other usage patterns such as an admin uploading local packages or the server building packages for Git builders, but package installations give a good idea of what load and throughput this configuration can handle. This is the configuration that the Posit Public Package Manager service currently runs, so we don’t anticipate any individual customer needing to scale beyond this configuration.
This reference architecture does not assume autoscaling node groups. It assumes you have a fixed number of nodes within your node group. However, it is safe for Posit Package Manager pods to run on auto-scaling nodes. If a pod is evicted from a node due to a scale-down event, any long-running jobs (e.g. git builders) that are in progress will be restarted on a different pod. All long-running jobs are tracked in a database queue.
Database#
This configuration uses RDS with Postgres on a db.t3.xlarge
instance with 100GB of storage and Multi-AZ enabled. Multi-AZ allows for the RDS instance to run in an active/passive configuration across 2 availability zones, with auto-failover when the primary instance goes down. This is a very generous configuration. In our testing, the Postgres database handled 1,000,000+ package installs per day without exceeding 10-20% CPU utilization.
The RDS instance should be configured with an empty Postgres database for the Posit Package Manager metadata. To handle a higher number of concurrent users, the configuration option PostgresPool.MaxOpenConnections
should be increased to 50.
Storage#
The S3 bucket is used to store data about packages and sources, as well as cached metadata to decrease response times for requests. S3 can also be used with KMS for client-side encryption.
Networking#
Posit Package Manager should be deployed in a EKS cluster with the control plane and node group in a private subnet with ingress using an Application Load Balancer in a public subnet. This should run across multiple availability zones.
Configuration Details#
The configuration of Package Manager is managed through the official Helm chart: https://github.com/rstudio/helm/tree/main/charts/rstudio-pm. For complete details refer to the Kubernetes installation steps
Encryption key#
When running with more than one replica, it is important that each replica has the same encryption key. To ensure that each replicas has access to the same encryption key you should create a Kubernetes secret and then expose it as an environment variable using the values.yaml
file. For more details on the encryption key see the Configuration Encryption page.
First, create an encryption key and Kubernetes secret:
# Create an encryption key
openssl rand -hex 64
# Create a secret. Replace 'xxx' with your encryption key.
kubectl create secret generic ppm-secrets --from-literal=ppm-encryption-key-user='xxxx'
Then, update your values.yaml
file to use the secret:
# How to use the secret in your values.yaml
pod:
env:
- name: PACKAGEMANAGER_ENCRYPTION_KEY
valueFrom:
secretKeyRef:
name: ppm-secrets
key: ppm-encryption-key-user
Replicas#
This reference architecture uses three replicas for the Posit Package Manager service. To ensure that each replicas run on a different node you should set a topologySpreadConstraints
:
replicas: 3
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
# The helm chart will add this label to the pods automatically
app.kubernetes.io/name: rstudio-pm
Resiliency and Availability#
This configuration of Posit Package Manager has been deployed on the Posit Public Package Manager service. As a publicly available service, the architecture is tested by the R and Python communities that use it. Public Package Manager is used by many more users than any private Posit Package Manager instance. The current uptime for the Posit Public Package Manager service can be found on the status page.
FAQ#
See the Frequently Asked Questions page for more information for the general FAQ.