AWS Batch with EFS mount
Oct 22, 2020
I'm using AWS Batch for ML model-runs on Gnothi. I'm using hugginface/transformers, UKPLab/sentence-transformers, Gensim, spaCy, and more which download large model artifacts (ml-tools). I could add these to the Dockerfile run by Batch via their CLI download commands (eg python -m spacy download en
), but the Dockerfile would be huge, incurring an unnecessary provision uptime cost to Batch. It's preferred to download these models once to an external mounted file-system, for re-mount & re-use across all the Batch runs. AWS's own tutorial on this was pretty lacking, so here are my steps.
- Create two security groups (SG).
- One for the Batch compute environment. Mine is called "ml_jobs ". Outbound=* (0.0.0.0/0), inbound you'll likely want SSH (0.0.0.0/0)
- One for EFS. Outbound=*. Mine's called "EFS "
- Modify each SG to have Inbound=* from each other's SG. You only actually need NFS (and maybe SSH?), but hey, they're only talking to each other; you're safe.
Discussion here.
- Create an EFS file system (link)
- At the security-group step, x out each subnet it suggests, and replace it with the SG subnets you created in (1).
- Create a launch template. Link here, this is the tutorial provided by AWS (the only info I found in this adventure).
- Per the rest of this post, I did mine in Console (not API), so just copy/paste the that big text-blob from the link onto (bottom of launch-template page) > Advanced details > User data (replacing the file_system_id_01)
- Add Storage (volumes) to reflect that tutorial. That is, Volume=EBS, Delete on termination=Yes, Device name=/dev/xvda, Volume type=gp2.
- Make sure everything else on this page, including sub-fields of Storage, is set to "Don't include in launch template ".
- Setup Batch. Create a _compute environment.
- Add an EC2 keypair if you want to SSH in. You likely will, since you'll want to
scp
in files.
- Probably select managed/spot, 100% cost. That's my setup, up to you.
- p2 family or greater(I'm using p2.xlarge). Don't use g family, Nvidia drivers not compatible! Make sure you remove "Optimal "
- In one of the advanced bits, select your launch template from (3), version=$Latest
- VPC ID = the same VPC you used for your EFS, select all subnets
- Specify the SG from (1.1).
- Job definition
- Volumes. Name=efs, Source path=/mnt/efs
- Mount points. Source volume=efs, Container path=/storage (or wherever your container expects the mount)
- Number of GPUs=1
- Privileged=true
Details on the above. Volumes specifies the name/tag/label you're referring to this mount path on the system, and Mount points references that to place it into the container at "Container path ". Number of GPUs requests a GPU, and an Nvidia driver from host. Privileged is required for this setup, as the /mnt/efs will be chown root
.