Skip to content

ArchiveBox

Self-hosted web archiving platform. ArchiveBox captures websites in multiple formats simultaneously — HTML, PDF, screenshot, WARC, media extraction, git clone — using a built-in Chromium headless browser. All archived content and the SQLite database are stored on a single large persistent volume.

Single instance only — no horizontal scaling

ArchiveBox uses SQLite as its database. SQLite allows only a single writer at a time. Running multiple replicas will cause database corruption. Keep replicaCount at 1.

Key Features

  • Multi-format archiving — HTML, PDF, screenshot, WARC, media, git clone in one pass
  • Chromium headless — full JavaScript rendering with /dev/shm tmpfs for stability
  • Three search backends — ripgrep (default), sqlite, or Sonic for full-text search
  • Access control — configurable public/private access for index, snapshots, and adding links
  • Non-root by default — runs as UID 911 out of the box
  • S3 backup — full /data directory backup (SQLite + all archived files) to S3-compatible storage
  • Persistent storage — single large PVC for all archived content and database

Installation

HTTPS repository:

helm repo add helmforge https://repo.helmforge.dev
helm repo update
helm install archivebox helmforge/archivebox -f values.yaml

OCI registry:

helm install archivebox oci://ghcr.io/helmforgedev/helm/archivebox -f values.yaml

Deployment Examples

# values.yaml — Basic ArchiveBox with Traefik ingress
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'

persistence:
  enabled: true
  size: 100Gi

ingress:
  enabled: true
  ingressClassName: traefik
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: archivebox-tls
      hosts:
        - archive.example.com
# values.yaml — Private instance (no public access)
# Recommended for internet-facing deployments
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'
  allowedHosts: 'archive.example.com'
  publicIndex: 'False'
  publicSnapshots: 'False'
  publicAddLinks: 'False'

persistence:
  enabled: true
  size: 100Gi

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix
# values.yaml — Production setup with explicit resource limits
# Chromium requires at least 2Gi RAM to archive pages reliably
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'
  timeout: '120'
  mediaMaxSize: '1g'

resources:
  requests:
    cpu: 500m
    memory: 2Gi
  limits:
    cpu: 2000m
    memory: 4Gi

persistence:
  enabled: true
  size: 200Gi

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix
# values.yaml — Daily S3 backup of the full /data directory
# Backup includes both the SQLite database and all archived files.
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'

persistence:
  enabled: true
  size: 100Gi

backup:
  enabled: true
  schedule: '0 3 * * *'
  s3:
    endpoint: https://s3.amazonaws.com
    bucket: my-archivebox-backups
    prefix: daily
    existingSecret: archivebox-s3

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix

Configuration Reference

Security Scan

Kubescape scan for the chart standards backfill:

Framework Score
MITRE 100%
NSA 65%
SOC2 80%

Production hardening should include explicit CPU and memory limits, restricted namespace network policy where required, and existing Secrets for admin and S3 credentials.

Core

Parameter Type Default Description
nameOverride string "" Override the chart name.
fullnameOverride string "" Override the full release name.
commonLabels object {} Extra labels added to all resources.

Image

Parameter Type Default Description
image.repository string docker.io/archivebox/archivebox ArchiveBox container image.
image.tag string "0.7.4" Image tag.
image.pullPolicy string IfNotPresent Image pull policy.
imagePullSecrets array [] Pull secrets for private registries.

ArchiveBox Configuration

Parameter Type Default Description
archivebox.port integer 8000 Internal HTTP port.
archivebox.adminUsername string admin Admin account username, created on first run.
archivebox.adminPassword string "" Admin account password. Auto-generated if empty.
archivebox.existingSecret string "" Existing Kubernetes Secret containing admin credentials.
archivebox.existingSecretUsernameKey string admin-username Key in the existing secret for the admin username.
archivebox.existingSecretPasswordKey string admin-password Key in the existing secret for the admin password.
archivebox.allowedHosts string "*" Comma-separated allowed hostnames. Set to your domain in production.
archivebox.publicIndex string "True" Allow unauthenticated access to the archive index page.
archivebox.publicSnapshots string "True" Allow unauthenticated access to archived snapshots.
archivebox.publicAddLinks string "False" Allow unauthenticated access to the add URL view (PUBLIC_ADD_VIEW).
archivebox.searchBackendEngine string ripgrep Search backend: ripgrep (default), sqlite, or sonic.
archivebox.mediaMaxSize string "750m" Maximum size for media downloads (e.g. 750m, 1g).
archivebox.timeout string "60" Timeout per URL archiving job, in seconds.
archivebox.timezone string UTC Timezone for scheduled tasks and timestamps.
archivebox.extraEnv array [] Extra environment variables for advanced configuration.
Restrict public access before exposing to the internet

By default, publicIndex and publicSnapshots are both "True". Anyone who reaches your ArchiveBox URL can browse your entire archive and view all captured pages. For internet-facing deployments, set both to "False" and restrict allowedHosts to your exact domain.

Search backend trade-offs
  • ripgrep (default) — fast grep-based full-text search, no extra dependencies, searches HTML files directly - sqlite — uses SQLite FTS5, no extra setup, slower on large archives - sonic — fastest on large archives, requires a separate Sonic server deployed alongside ArchiveBox

Persistence

The PVC stores the entire /data directory: SQLite database, all archived files (HTML, PDFs, screenshots, WARCs, media), and ArchiveBox configuration. Size it generously — archived pages with media can consume gigabytes quickly.

Parameter Type Default Description
persistence.enabled boolean true Enable a PVC for /data (database + all archived content).
persistence.size string 50Gi PVC size. Plan for 100GB+ for active archiving.
persistence.storageClass string "" StorageClass for the PVC.
persistence.accessModes array ["ReadWriteOnce"] PVC access modes.
persistence.existingClaim string "" Use an existing PVC instead of creating one.
NFS storage may require fsGroup configuration

ArchiveBox runs as UID/GID 911 by default (podSecurityContext.fsGroup: 911). Some NFS provisioners ignore fsGroup and may cause permission errors on the /data directory. If using NFS, configure your provisioner to support fsGroup or override podSecurityContext and securityContext accordingly.

Backup

The S3 backup archives the full /data directory — including the SQLite database and all archived files. This is a complete backup of the entire ArchiveBox dataset, unlike other charts where only media files are backed up.

Parameter Type Default Description
backup.enabled boolean false Enable scheduled S3 backup CronJob.
backup.schedule string "0 3 * * *" Cron schedule for backups.
backup.suspend boolean false Suspend the CronJob without deleting it.
backup.concurrencyPolicy string Forbid CronJob concurrency policy.
backup.successfulJobsHistoryLimit integer 3 Number of successful Job records to keep.
backup.failedJobsHistoryLimit integer 3 Number of failed Job records to keep.
backup.backoffLimit integer 1 Job retry limit.
backup.archivePrefix string archivebox Prefix for backup archive filenames.
backup.images.tar string docker.io/library/alpine:3.22 Image used for tar archive.
backup.images.uploader string docker.io/helmforge/mc:1.0.0 Image used for S3 upload.
backup.resources object {} Resources for backup containers.
backup.s3.endpoint string "" S3-compatible endpoint URL.
backup.s3.bucket string "" Target bucket name.
backup.s3.prefix string archivebox Key prefix within the bucket.
backup.s3.createBucketIfNotExists boolean true Create the bucket automatically if it does not exist.
backup.s3.existingSecret string "" Existing secret containing S3 access and secret keys.
backup.s3.existingSecretAccessKeyKey string access-key Key in the existing secret for the S3 access key.
backup.s3.existingSecretSecretKeyKey string secret-key Key in the existing secret for the S3 secret key.
backup.s3.accessKey string "" Inline S3 access key (ignored when existingSecret is set).
backup.s3.secretKey string "" Inline S3 secret key (ignored when existingSecret is set).

Service

Parameter Type Default Description
service.type string ClusterIP Kubernetes service type.
service.port integer 80 Service port exposed to the cluster.
service.annotations object {} Annotations for the Service.

Ingress

Parameter Type Default Description
ingress.enabled boolean false Enable an Ingress resource.
ingress.ingressClassName string traefik Ingress class name.
ingress.annotations object {} Annotations for the Ingress (e.g. cert-manager).
ingress.hosts array [] Ingress host and path rules.
ingress.tls array [] TLS configuration (secret name and hosts).

Probes

Probes use the /health/ endpoint.

Parameter Type Default Description
probes.startup.enabled boolean true Enable startup probe.
probes.startup.initialDelaySeconds integer 15 Startup probe initial delay.
probes.startup.periodSeconds integer 5 Startup probe period.
probes.startup.timeoutSeconds integer 3 Startup probe timeout.
probes.startup.failureThreshold integer 30 Startup probe failure threshold.
probes.liveness.enabled boolean true Enable liveness probe.
probes.liveness.initialDelaySeconds integer 0 Liveness probe initial delay.
probes.liveness.periodSeconds integer 15 Liveness probe period.
probes.liveness.timeoutSeconds integer 5 Liveness probe timeout.
probes.liveness.failureThreshold integer 3 Liveness probe failure threshold.
probes.readiness.enabled boolean true Enable readiness probe.
probes.readiness.initialDelaySeconds integer 0 Readiness probe initial delay.
probes.readiness.periodSeconds integer 10 Readiness probe period.
probes.readiness.timeoutSeconds integer 5 Readiness probe timeout.
probes.readiness.failureThreshold integer 3 Readiness probe failure threshold.

Resources and Security

ArchiveBox uses Chromium internally to render pages. The Chromium process requires at least 2 GB RAM to function reliably. Without memory limits, the container may be OOMKilled during archiving of JavaScript-heavy pages.

Parameter Type Default Description
resources object {} CPU and memory requests and limits. Recommended: 2–4 Gi RAM.
podSecurityContext object { fsGroup: 911 } Pod-level security context.
securityContext object { runAsUser: 911, runAsGroup: 911, runAsNonRoot: true } Container-level security context.

Service Account

Parameter Type Default Description
serviceAccount.create boolean false Create a dedicated ServiceAccount.
serviceAccount.name string "" Override the ServiceAccount name.
serviceAccount.annotations object {} Annotations for the ServiceAccount.

Scheduling

Parameter Type Default Description
nodeSelector object {} Node selector for scheduling.
tolerations array [] Tolerations for scheduling.
affinity object {} Affinity rules.
topologySpreadConstraints array [] Topology spread constraints.
priorityClassName string "" PriorityClass for the pod.
terminationGracePeriodSeconds integer 30 Termination grace period.
podLabels object {} Extra labels for the pod.
podAnnotations object {} Extra annotations for the pod.

Extra

Parameter Type Default Description
extraVolumes array [] Extra volumes to attach to the pod.
extraVolumeMounts array [] Extra volume mounts for the container.
extraManifests array [] Extra Kubernetes manifests deployed alongside the chart.

Common Issues

Pod OOMKilled during archiving

Archiving JavaScript-heavy or media-rich pages triggers full Chromium rendering. Without explicit resources limits, the container may be OOMKilled. Set at least memory: 2Gi in resources.requests and memory: 4Gi in resources.limits. Monitor memory usage during peak archiving.

Archiving times out on slow or complex pages

The default archivebox.timeout is 60 seconds. Pages with slow external resources or heavy JavaScript may time out before the snapshot is complete. Increase timeout to 120 or 180 for more reliable archiving of complex pages.

More Information