VCFtools is an essential command-line suite used by geneticists to filter, summarize, and manipulate genomic variant data in Variant Call Format (VCF) files. Mastering its syntax drastically speeds up quality control and downstream population genetic analyses.
Below are the top 10 VCFtools commands every genetic researcher should know, organized by utility. 1. Basic Quality Filtering (–minQ)
Eliminates low-confidence variants by filtering out sites below a specific Phred-scaled quality threshold.
vcftools –vcf input.vcf –minQ 30 –recode –out filtered_q30 Use code with caution.
–minQ 30: Keeps only variants with a 99.9% accuracy confidence score or higher.
–recode: Tells VCFtools to generate a new, physical VCF file with the filtered results. –out: Specifies the output file prefix. 2. Minor Allele Frequency Filtering (–maf)
Filters out rare variants or sequencing errors by retaining sites where the minor allele appears above a designated frequency.
vcftools –vcf input.vcf –maf 0.05 –recode –out common_variants Use code with caution.
–maf 0.05: Keeps variants only if the minor allele occurs in at least 5% of the studied population. 3. Handling Missing Data (–max-missing)
Controls dataset completeness by filtering out variant positions that have too many uncalled genotypes among samples.
vcftools –vcf input.vcf –max-missing 0.9 –recode –out complete_sites Use code with caution.
–max-missing 0.9: Counterintuitively, this parameter takes a value from 0 to 1, where 0.9 means a site must be successfully called in at least 90% of the samples to be retained. 4. Extracting Genomic Regions (–chr, –from-bp, –to-bp)
Subsets data to a specific chromosome or precise chromosomal window, which is ideal for target gene or locus-specific studies.
vcftools –vcf input.vcf –chr 20 –from-bp 1000000 –to-bp 2000000 –recode –out region_chr20 Use code with caution. –chr 20: Targets chromosome 20.
–from-bp / –to-bp: Sets the physical genomic coordinates for the window boundaries. 5. Filtering Samples/Individuals (–keep or –remove)
Subsets a massive, multi-sample VCF file to include or exclude specific cohorts based on a text file list.
vcftools –vcf input.vcf –keep sample_list.txt –recode –out cohort_subset Use code with caution.
–keep: Takes a text file (sample_list.txt) containing one sample ID per line and discards all unlisted samples. Swap with –remove to exclude them instead.
6. Isolating Variant Types (–remove-indels or –keep-only-indels)
Separates Single Nucleotide Polymorphisms (SNPs) from small Insertions and Deletions (Indels) to clean up analysis workflows. The C++ executable module examples – VCFtools
Leave a Reply