Skip to content

Data storage and processing in AWS

Use FastAPI as Python web API framework

Context and Problem Statement

We need to store:

  • Uploaded user imagery (raw drone images).
  • Final imagery products after processing (orthos, point clouds, 3D products).

This takes up a lot of space over time, with raw images being ~5–15MB each and projects commonly reaching hundreds of GB or more.

We need to process:

  • Imagery --> point cloud, orthomosaic, 3D products.

This requires significant compute and memory, with large projects potentially taking many hours or days to process.

In addition, we need to support:

  • Fast uploads to improve user experience.
  • Cost-effective downloads for UI display and sharing.
  • Long-term archiving of old projects at low cost.

Note: historically, we have benefited from free AWS credits as an NGO working on humanitarian use cases, but the system should remain cost-efficient even without credits.

Considered Options

  • On-prem compute + on-prem MinIO instance (S3-compatible storage).
  • On-prem compute + Cloudflare R2 storage.
  • AWS compute + S3 storage (with CloudFront for delivery).

The key factors at play here are:

  • Actual cost of services (compute, storage, requests).
  • The cost of Egress (network traffic), which can easily dominate total costs at multi-TB scale.
  • Data locality: processing imagery where it is stored to avoid repeated large data transfers.
  • Operational complexity of running and maintaining on-prem infrastructure versus managed cloud services.

While on-prem compute and non-AWS storage may appear cheaper at first glance, transferring large datasets between clouds or from cloud storage to on-prem compute quickly incurs egress fees that outweigh savings.

For a full cost breakdown and assumptions, see: https://github.com/hotosm/drone-tm/issues/708

Decision Outcome

  • Host and process imagery entirely inside AWS networks, using AWS services.
  • Use S3 as the primary object store for both raw uploads and processed outputs.
  • Use S3 Transfer Acceleration for uploads to improve user experience.
  • Use CloudFront in front of S3 for downloads and UI access to reduce egress costs and benefit from caching.
  • Use S3 lifecycle policies (e.g. Intelligent Tiering / Glacier) to automatically archive old imagery at very low cost.
  • Overspill of processing may use on-prem volunteer hardware only in exceptional cases (e.g. very large, one-off projects), with the understanding that data transfer costs must be carefully managed.

Consequences

  • ✅ Much cheaper network traffic by avoiding cross-cloud or cloud-to-on-prem data transfer.
  • ✅ Processing happens close to the data, improving throughput and reducing overall runtime.
  • ✅ Simpler networking and architecture (single cloud boundary).
  • ✅ Ability to scale compute elastically using AWS (e.g. spot instances, burst capacity).
  • ❌ Increased reliance on AWS services.
  • ❌ Some vendor lock-in, though most components (Kubernetes, S3-compatible tooling) could be migrated if needed.