Infrastructure Overview¶
The infrastructure runs on EPFL's Kubernetes platform (ENAC-K8S) with PostgreSQL for persistence and monitoring via Prometheus/Grafana. This overview covers deployment, hosting, and operational procedures.
For system architecture and deployment decisions, see:
- Deployment Topology - Infrastructure architecture
- Environments - Environment configuration
- CI/CD Pipeline - Automated deployment
- Tech Stack - Infrastructure technology choices
Infrastructure Overview¶
Platform¶
- Kubernetes: EPFL internal cluster (ENAC-K8S)
- Container Runtime: Docker
- Orchestration: Kubernetes with Helm charts
- Load Balancing: EPFL load balancer (internal only, not publicly accessible)
- Storage: RCP NAS for persistent volumes
- Monitoring: Prometheus, Grafana, OpenTelemetry
- Secrets Management: Infisical
- Deployment: ArgoCD (GitOps) + GitHub Actions
Network Architecture¶
Internet
↓
EPFL VPN (required for external access)
↓
EPFL Load Balancer (internal only)
↓
Kubernetes Ingress Controller
↓
Services (frontend, backend, database)
Access: Application is NOT publicly accessible outside EPFL network. Users must connect via EPFL VPN or be on EPFL campus network.
Deployment¶
Helm Chart Structure¶
helm/
├── Chart.yaml # Chart metadata
├── values.yaml # Default configuration
├── values-dev.yaml # Development overrides
├── values-staging.yaml # Staging overrides
├── values-prod.yaml # Production overrides
└── templates/
├── frontend-deployment.yaml
├── backend-deployment.yaml
├── postgres-statefulset.yaml
├── redis-deployment.yaml
├── celery-deployment.yaml
├── ingress.yaml
├── services.yaml
├── configmaps.yaml
└── secrets.yaml (managed by Infisical)
Deploying to Environments¶
Via ArgoCD (Recommended):
ArgoCD automatically syncs from Git repository to Kubernetes cluster.
# View ArgoCD applications
argocd app list
# Sync application manually
argocd app sync co2-calculator-dev
# View deployment status
argocd app get co2-calculator-dev
Via Helm (Manual):
# Deploy to development
helm upgrade --install co2-calculator ./helm \
--namespace co2-calculator-dev \
--values helm/values-dev.yaml
# Deploy to staging
helm upgrade --install co2-calculator ./helm \
--namespace co2-calculator-staging \
--values helm/values-staging.yaml
# Deploy to production
helm upgrade --install co2-calculator ./helm \
--namespace co2-calculator-prod \
--values helm/values-prod.yaml
Deployment Checklist¶
Before deploying to production:
- All tests pass in CI/CD pipeline
- Staging deployment tested and validated
- Database migrations reviewed and tested
- Environment variables configured correctly
- Secrets rotated if needed
- Monitoring dashboards ready
- Rollback plan documented
- Stakeholders notified of deployment window
Kubernetes Resources¶
Namespaces¶
co2-calculator-dev: Development environmentco2-calculator-staging: Staging environmentco2-calculator-prod: Production environment
Deployments¶
# Frontend (Nginx + Vue SPA)
frontend:
replicas: 2
resources:
requests: { cpu: 100m, memory: 128Mi }
limits: { cpu: 500m, memory: 256Mi }
# Backend (FastAPI + Uvicorn)
backend:
replicas: 3
resources:
requests: { cpu: 200m, memory: 512Mi }
limits: { cpu: 1000m, memory: 1Gi }
# Celery Workers
celery-worker:
replicas: 2
resources:
requests: { cpu: 200m, memory: 512Mi }
limits: { cpu: 1000m, memory: 2Gi }
# Redis (task queue)
redis:
replicas: 1
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
StatefulSet¶
# PostgreSQL
postgres:
replicas: 1
storage: 50Gi # RCP NAS persistent volume
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2000m, memory: 4Gi }
Services¶
# Service definitions
frontend-service:
type: ClusterIP
port: 80
backend-service:
type: ClusterIP
port: 8000
postgres-service:
type: ClusterIP
port: 5432
redis-service:
type: ClusterIP
port: 6379
Ingress¶
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: co2-calculator-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- co2calculator.epfl.ch
secretName: co2calculator-tls
rules:
- host: co2calculator.epfl.ch
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: backend-service
port: 8000
- path: /
pathType: Prefix
backend:
service:
name: frontend-service
port: 80
Configuration Management¶
ConfigMaps¶
# Application configuration (non-sensitive)
apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
data:
DATABASE_HOST: postgres-service
DATABASE_PORT: "5432"
DATABASE_NAME: co2calculator
REDIS_HOST: redis-service
REDIS_PORT: "6379"
LOG_LEVEL: INFO
CORS_ORIGINS: https://co2calculator.epfl.ch
Secrets (Infisical)¶
Secrets are managed via Infisical and injected into pods:
apiVersion: v1
kind: Secret
metadata:
name: backend-secrets
type: Opaque
data:
DATABASE_PASSWORD: <base64-encoded>
SECRET_KEY: <base64-encoded>
OIDC_CLIENT_SECRET: <base64-encoded>
Management:
# View secrets (requires cluster admin access)
kubectl get secrets -n co2-calculator-prod
# Describe secret (no values shown)
kubectl describe secret backend-secrets -n co2-calculator-prod
# Rotate secret
# Update in Infisical → ArgoCD auto-syncs → Restart deployment
kubectl rollout restart deployment backend -n co2-calculator-prod
Storage¶
Persistent Volumes¶
# PostgreSQL persistent volume claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: rcp-nas # EPFL RCP NAS storage
resources:
requests:
storage: 50Gi
Storage Classes¶
- rcp-nas: RCP NAS storage (default, for database)
- local-path: Local node storage (for temporary files)
Backup Storage¶
- Database backups: RCP NAS (
/backups/postgres/) - Application files: RCP NAS (
/data/uploads/) - Long-term archives: EPFL S3-compatible storage (if configured)
Monitoring & Observability¶
Prometheus Metrics¶
Backend metrics (/metrics endpoint):
# Request metrics
http_requests_total{method="GET", endpoint="/api/v1/labs", status="200"}
http_request_duration_seconds_bucket{le="0.1"}
# Application metrics
db_connections_active
db_query_duration_seconds
celery_tasks_running
celery_tasks_failed_total
# Resource metrics
process_cpu_seconds_total
process_resident_memory_bytes
Scrape configuration:
# prometheus-config
scrape_configs:
- job_name: "co2-calculator-backend"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: backend
action: keep
Grafana Dashboards¶
Main Dashboards:
- Application Overview
- Request rate, latency, error rate
- Active users, database connections
-
Celery queue length, task processing time
-
Infrastructure Health
- CPU, memory, disk usage per pod
- Network I/O, request throughput
-
Pod restart count, OOMKills
-
Database Performance
- Query duration, slow queries
- Connection pool usage
- Table sizes, index usage
Access: https://grafana.epfl.ch (requires EPFL credentials)
Alerting¶
Critical Alerts (PagerDuty/email):
- Service down (all replicas unhealthy)
- Database connection failure
- Disk space > 90%
- Error rate > 5% for 5 minutes
Warning Alerts (Slack/email):
- High memory usage (> 80%)
- Slow response time (> 1s p95)
- Celery queue backlog (> 1000 tasks)
- Certificate expiry (< 7 days)
Configuration: Alert rules defined in helm/templates/prometheus-rules.yaml
Logging¶
Log Aggregation: Logs sent to centralized logging (ELK or similar)
# View pod logs
kubectl logs -n co2-calculator-prod deployment/backend --tail=100 -f
# View logs from all backend replicas
kubectl logs -n co2-calculator-prod -l app=backend --tail=50
# Search logs (if Kibana available)
# Access Kibana UI and filter by namespace: co2-calculator-prod
Log Format: Structured JSON logs
{
"timestamp": "2025-11-11T10:15:30Z",
"level": "INFO",
"logger": "app.services.labs",
"message": "Laboratory created",
"lab_id": "abc123",
"user_id": "user456",
"trace_id": "xyz789"
}
Scaling¶
Horizontal Pod Autoscaler (HPA)¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: backend-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: backend
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Trigger: Auto-scales when CPU > 70% or memory > 80%
Manual Scaling¶
# Scale backend replicas
kubectl scale deployment backend --replicas=5 -n co2-calculator-prod
# Scale Celery workers
kubectl scale deployment celery-worker --replicas=4 -n co2-calculator-prod
Operations¶
Common kubectl Commands¶
# View all resources in namespace
kubectl get all -n co2-calculator-prod
# Check pod status
kubectl get pods -n co2-calculator-prod
# Describe pod (troubleshooting)
kubectl describe pod backend-xxxxx -n co2-calculator-prod
# View pod logs
kubectl logs backend-xxxxx -n co2-calculator-prod
# Execute command in pod
kubectl exec -it backend-xxxxx -n co2-calculator-prod -- /bin/bash
# Port forward for local access
kubectl port-forward svc/backend-service 8000:8000 -n co2-calculator-prod
# Restart deployment
kubectl rollout restart deployment backend -n co2-calculator-prod
# View rollout status
kubectl rollout status deployment backend -n co2-calculator-prod
# Rollback deployment
kubectl rollout undo deployment backend -n co2-calculator-prod
Health Checks¶
# Check ingress
kubectl get ingress -n co2-calculator-prod
# Test backend health endpoint
curl https://co2calculator.epfl.ch/api/health
# Test from inside cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -n co2-calculator-prod -- \
curl http://backend-service:8000/health
Database Operations¶
# Connect to PostgreSQL
kubectl exec -it postgres-0 -n co2-calculator-prod -- psql -U postgres -d co2calculator
# Backup database
kubectl exec postgres-0 -n co2-calculator-prod -- \
pg_dump -U postgres co2calculator > backup_$(date +%Y%m%d).sql
# Restore database
kubectl exec -i postgres-0 -n co2-calculator-prod -- \
psql -U postgres co2calculator < backup.sql
Troubleshooting¶
Pod Not Starting¶
# Check pod events
kubectl describe pod backend-xxxxx -n co2-calculator-prod
# Common issues:
# - ImagePullBackOff: Check image name/tag, registry credentials
# - CrashLoopBackOff: Check logs, environment variables
# - Pending: Check resource requests, node capacity
Service Not Reachable¶
# Check service endpoints
kubectl get endpoints backend-service -n co2-calculator-prod
# Check ingress configuration
kubectl describe ingress co2-calculator-ingress -n co2-calculator-prod
# Test service from inside cluster
kubectl run -it --rm test --image=curlimages/curl --restart=Never -n co2-calculator-prod -- \
curl http://backend-service:8000/health
Database Connection Issues¶
# Check PostgreSQL pod status
kubectl get pod postgres-0 -n co2-calculator-prod
# Check PostgreSQL logs
kubectl logs postgres-0 -n co2-calculator-prod
# Test connection from backend pod
kubectl exec -it backend-xxxxx -n co2-calculator-prod -- \
psql postgresql://postgres:password@postgres-service:5432/co2calculator -c "SELECT 1;"
High Resource Usage¶
# Check resource usage
kubectl top pods -n co2-calculator-prod
# Check resource limits
kubectl describe pod backend-xxxxx -n co2-calculator-prod | grep -A 5 "Limits"
# View historical metrics in Grafana
# Or use kubectl metrics (if metrics-server installed)
Security¶
Network Policies¶
# Restrict backend to only accept traffic from frontend and ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-network-policy
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
- podSelector:
matchLabels:
app: ingress-nginx
ports:
- protocol: TCP
port: 8000
Pod Security¶
- Run as non-root: All containers run as non-root user
- Read-only filesystem: Containers have read-only root filesystem where possible
- Drop capabilities: Unnecessary Linux capabilities dropped
- Resource limits: CPU and memory limits enforced
TLS/SSL¶
- Certificate management: cert-manager with Let's Encrypt
- TLS termination: At ingress controller
- Internal communication: Unencrypted (within cluster network)
Additional Resources¶
Architecture Documentation¶
- Deployment Topology - Full deployment architecture
- Scalability Strategy - Scaling patterns
- CI/CD Pipeline - Deployment automation
EPFL Resources¶
- EPFL Kubernetes Documentation (internal)
- ENAC-IT Support (for infrastructure issues)
- RCP NAS Documentation (for storage)
External Documentation¶
Last Updated: November 11, 2025
Readable in: ~10 minutes