1. Initial set up
First of all, please remember that this tutorial must be run using an interactive session in ShARC. For that, you should log in into ShARC with ssh
, and then request an interactive session with qrsh
. Your shell prompt should show sharc-nodeXXX
(XXX being a number between 001 and 172) and not @sharc-login1
nor @sharc-login2
.
For this particular practical, we are going to create and work on a directory called varcal
in your /fastdata/$USER directory:
cd /fastdata/$USER
mkdir varcal
cd varcal
Remember you can check your current working directory anytime with the command pwd
.
It should show something like:
pwd
´/fastdata/myuser/varcal
´
Programmes
There are many different tools for SNP calling, we are going to use three of the most popular: bcftools, GATK, and, optionally, ANGSD. Additionally, we will use bcftools and awk for manipulating VCF/BCF files. Additional exercises will use R, NGSadmix (part of ANGSD), and gemma. All of them are already installed as part of the Genomics Software Repository.
Data
We need to copy alignments in BAM format. We are not going to use the same alignments used in the previous sessions, because we need them to contain certain specific information to run some of the exercises for this practical. You can copy them with rsync, which allows resuming in case something goes wrong and the transfer is interrupted:
rsync -avPh /fastdata/SNPgenocall/alignments ./
For speed the files have been made temporarily available in the /fastdata
directory where you copied them from above. However, if you are returning to this tutorial after the day of the course you should be able to copy the files from here:
rsync -avPh /usr/local/extras/Genomics/workshops/NGS_AdvSta_2020/SNPgenocall/alignments ./
It is important to have all the BAM files indexed. To speed things up, we can use an SGE array job to index files in parallel:
#!/bin/bash
#$ -l rmem=2g
#$ -l h_rt=1:00:00
#$ -t 1-32
#$ -j y
#$ -o indexbam.log
#$ -N indexbam
# Load genomics software repository
# (safer to load it here in case the worker nodes don't inherit the environment)
source /usr/local/extras/Genomics/.bashrc
I=$(($SGE_TASK_ID-1))
BAMS=($(ls alignments/*.bam))
samtools index ${BAMS[$I]}
When you have finished editing the bash script, save it as indexbam.sh
, make it executable with chmod
and submit it to the job queue with qsub
:
chmod +x indexbam.sh
qsub indexbam.sh
Indexing should be finished in a few minutes and the files should look like this:
ls -lh alignments | head -n 9
total 59G
-rwx------ 1 myuser cs 1.9G Feb 14 14:46 60A.bam
-rw-r--r-- 1 myuser cs 486K Feb 15 03:18 60A.bam.bai
-rwx------ 1 myuser cs 1.8G Feb 14 14:44 60I.bam
-rw-r--r-- 1 myuser cs 471K Feb 15 03:18 60I.bam.bai
-rwx------ 1 myuser cs 2.1G Feb 14 14:39 61A.bam
-rw-r--r-- 1 myuser cs 590K Feb 15 03:18 61A.bam.bai
-rwx------ 1 myuser cs 1.9G Feb 14 14:41 61I.bam
-rw-r--r-- 1 myuser cs 505K Feb 15 03:18 61I.bam.bai
We also need to download the reference genome, you can do it using wget (remember that you can also read the manual on the command line with man wget
):
mkdir genome
wget http://download.lepbase.org/v4/sequence/Heliconius_melpomene_melpomene_Hmel2_-_scaffolds.fa.gz -O genome/Hmel2.fa.gz