Helen Hipperson, Nicola Nadeau & Alison Wright
Department of Animal and Plant Sciences, University of Sheffield
This practical is part of the module Advanced Data Analysis - Introduction to NGS data analysis. The aim is to learn how to call single nucleotide polymorphism (SNPs) and genotypes, that is the process of identifying variable sites and determining the genotype for each individual at each site. We will be using a dataset of whole genome sequence data of 32 individuals of Heliconius melpomene. After calling SNPs, we will do some subsetting and filtering and will carry out a few example analyses.
Originally written by Victor Soria-Carrasco
Table of contents
- Initial set up
- SNP and genotype calling with BCFtools
- VCF and BCF format
- SNP and genotype calling with GATK
- Operations with BCF files
- Practical application: Population structure with NGSADMIX
- Practical application: PCA of genoypes with R
Extras:
- SNP and genotype calling with ANGSD
- Genetic architecture of traits with GEMMA
- Delimitation of contiguous regions of differentiation using Hidden Markov Models
Resources
- samtools manual
- bcftools manual
- bcftools howto
- GATK User Guide
- ANGSD manual
- SAM/BAM format specification
- VCF/BCF format specification
- VCF/BCF format visual explanation - Highly recommended
References
- Li 2011 - bcftools mpileup implementation
- Danacek et al 2014 - bcftools multiallelic caller
- Van der Auwera et al. 2013 - GATK Best Practices for Variant Discovery for beginners. Note some procedures may be out-dated, check the current documentation.
- Novembre et al 2008 - Example of using PCAs of genotypes to investigate population structure
- Korneliussen et al. 2014 - ANGSD publication
- Bhatia et al. 2013 - Excellent paper about FST estimation and interpretation.