SNPsift:Case-Control representation, an index to simplify the complex affair

Hello genomics folks,

Exploring Case-Control Studies with SnpSift: A Deep Dive into Genetic Associations:

If you're diving into the world of genetic associations, particularly case-control studies, you might find SnpSift to be a game changer. This powerful tool integrates seamlessly into the SnpEff pipeline, offering an accessible way to analyze the relationship between genotypes and binary outcomes—like whether an individual is a case or a control for disease or any phenotypic observation.

Understanding the CC_TREND Model:

At the heart of this analysis is the Cochran-Armitage Trend Model (CC_TREND). This model elegantly summarizes how different genotypes relate to cases versus controls. It does this by assigning weights:

0 for homozygous reference (A/A)
1 for heterozygous (a/A)
2 for homozygous non-reference (a/a)

These weights reflect the increasing risk associated with non-reference alleles, providing a clearer picture of how these variants might contribute to disease.

Setting Your Cut-offs for Candidate Genes:

To identify candidate genes effectively, you'll want to apply two essential cut-offs during your analysis:

p-value < 0.05: This threshold suggests that as the number of non-reference alleles increases, so does the likelihood of being a case. In other words, there's a statistically significant association at play.
Genotype Weightage: This arbitrary cut-off highlights the presumed increased risk linked with non-reference alleles (three comma-separated values: homozygous variant, heterozygous variant, total variants).

Once you've filtered your data based on these criteria, you can streamline your analysis.

Visualizing Your Results:

Now, let's talk about visualization—specifically, the Manhattan plot. This is a fantastic way to represent the results of your CC_TREND analysis. However, plotting genotypes (three comma-separated values) can be a bit tricky, especially when chromosomes are on the X-axis and p-values are on the Y-axis.

To enhance your visualization, consider creating a variant-index from the output values:

X: Homozygous variants
Y: Heterozygous variants
Z: Total number of variants

One such variant_index derivation is presented here:

variant_index=0.1-((X+Y)*Z/1000)

NOTE: Lower the index, higher the case or control genotype weightage.

By using this single index, you can represent the three comma-separated values as geom_point in ggplot with a color gradient. This approach will help illustrate whether cases or controls have a higher prevalence of homozygous variants, adding another layer of insight to your analysis.

Conclusion:

In summary, SnpSift offers a robust platform for conducting case-control studies, allowing researchers to delve deep into genetic associations. By applying thoughtful cut-offs and employing effective visualization techniques, you can uncover valuable insights into the genetic underpinnings of disease. Happy analyzing!

The sample data and the R script for the Manhattan plot here with:

# Load necessary libraries

library(readr)

library(ggplot2)

# Read the data from a table

data <- read_table("CHROM POS CC_TREND Variant_index Gene

Chr_1 102770243 0.008509 0.888 a1

Chr_1 90037554 0.01117 0.935 a2

Chr_1 125775496 0.03389 0.935 a3

Chr_1 125909874 0.04598 0.935 a4

Chr_1 79639522 0.03389 0.935 a5

Chr_1 12685606 0.04986 0.93 a6

Chr_1 12685678 0.04986 0.93 a7

Chr_1 157299035 0.03389 0.935 a8

Chr_1 135449418 0.04684 0.952 a9

Chr_1 20220906 0.02334 0.888 a10

Chr_1 20220826 0.04598 0.91 a11

Chr_1 20220870 0.04598 0.91 a12

Chr_1 20230638 0.04598 0.91 a13

Chr_1 31754571 0.02334 0.888 a14

Chr_1 54522367 0.00604 0.91 a15

Chr_1 49401944 0.01141 0.93 a16

Chr_1 18984107 0.01563 0.935 a17

Chr_1 57498666 0.02395 0.935 a18

Chr_1 49401685 0.03995 0.935 a19

Chr_1 7262872 0.03389 0.888 a20

Chr_1 114788183 0.01527 0.888 a21

Chr_1 9032685 0.0455 0.91 a22

Chr_1 95477399 0.01141 0.93 a23

Chr_1 95480555 0.01497 0.93 a24

Chr_1 71931108 0.02395 0.935 a25

Chr_1 22049874 0.04771 0.952 a26

Chr_1 70328001 0.01497 0.888 a27

Chr_1 68815909 0.0181 0.93 a28

Chr_1 98451861 0.01842 0.93 a29

Chr_1 105687254 0.008041 0.935 a30

Chr_1 13497314 0.04684 0.952 a31

Chr_1 67988379 0.01431 0.888 a32

Chr_1 8668908 0.02334 0.888 a33

Chr_1 8671839 0.008509 0.888 a34

Chr_1 8672282 0.008509 0.888 a35

Chr_1 20729241 0.01091 0.935 a36

Chr_1 13578541 0.03389 0.935 a37

Chr_1 13635482 0.03389 0.935 a38

Chr_1 58268962 0.04499 0.916 a39

Chr_1 11557849 0.02334 0.888 a40

Chr_1 13361983 0.01497 0.93 a41

Chr_1 57651313 0.0181 0.91 a42

Chr_1 50589808 0.01968 0.935 a43

Chr_1 31028255 0.03479 0.888 a44

Chr_1 7328672 0.04598 0.91 a45

Chr_1 42124982 0.04598 0.91 a46

Chr_1 42126061 0.005479 0.935 a47

Chr_1 54382541 0.0455 0.944 a48

Chr_1 54327533 0.02092 0.916 a49")

# Transform data

mid <- mean(data$Variant_index)

data$CC_TREND <- -log10(data$CC_TREND)

# Create the plot

chr1 <- ggplot(data, aes(x = POS, y = CC_TREND)) +

geom_point(aes(color = Variant_index)) +

scale_color_gradient2(midpoint = mid, low = "blue", mid = "green", high = "red", space = "Lab") +

geom_hline(yintercept = 1.3010, linetype = "dashed", color = "black", size = 0.5) +

xlab("Position on Chromosome 1") +

annotate("text", fontface = 2, x = 75000000, y = 5.0, label = paste("n= ", nrow(data)), size = 2) +

geom_text(aes(label = ifelse(Variant_index < 0.910, as.character(Gene), '')), hjust = 0, vjust = 0, size = 2, fontface = "italic")

# Display the plot

print(chr1)

The Manhattan Plot: The CC_TREND p-values are represented using a color gradient in the geom_point. Genes with a variant_index less than 0.910, indicating the maximum homozygosity and minimum heterozygosity for the case/control, are labeled on the plot based on this calculation

HAPPY ANALYZING!

References:

1. Clarke GM, Anderson CA, Pettersson FH, Cardon LR, Morris AP, Zondervan KT. Basic statistical analysis in genetic case-control studies. Nat Protoc. 2011;6(2):121-133. doi:10.1038/nprot.2010.182

2. https://pcingola.github.io/SnpEff/snpsift/casecontrol/

Search This Blog

Bioinformatics for everyone