Introduction on the casual biology of the disease or


Genome wide association studies (GWASs) have been used to analyse the
genetic architecture of common diseases and quantitative traits (Visscher et
al., 2012). These studies assess common variants that have a minor allele
frequency (MAF) >5% in the human genome. They have been completed for most
common diseases and numerous associated traits. They have uncovered more than
two thousand disease-related genetic common variants (NHGRI, 2015). But these related
common variants have very small effect sizes and a modest effect in predicting disease
risk or quantitative traits. For example, substantial meta-analysis of GWAS of
type 2 diabetes (T2D) in more than 10,128 people have identified more than 18
SNPs associated with the disease, but these sites explain only 6% of the heritability
of the T2D, and does not explain the causal biology (Zeggini et al., 2008). As
well, in Crohn disease, GWAS meta-analysis in more than 210,000 people have
identified 70 loci associated with the disease, but these explain only 23% of
the increased disease risk between relatives (Franke
et al., 2010). Generally, the majority of identified common variants through GWASs
have shed no light on the casual biology of the disease or trait. This problem
referred to as missing heritability. Low-frequency and rare variants might
solve a portion of missing heritability.  Thus, it is reasonable that analyses of
low-frequency with a MAF of (0.5% ?MAF <5%) and rare with a MAF of < 0.5 variants could give an explanation to disease risk or quantitative trait (Lee et al., 2014). The advancement in sequencing technologies allows in depth examinations on the genetic contribution of rare variants to complex traits.   This essay will look into challenges of studying rare variants and what sequencing approaches and statistical methods that can be used for rare variant association detection analysis and testing. And mention some current studies that discovered rare variants.         Rare variants The theory states that purifying selection keep strong effect rare variants at low frequency in population. Highly penetrant rare variant play essential role in many Mendelian disorders and rare forms of complex diseases. The genotyping arrays have ignored this fraction of allele frequency spectrum because there are no systematic catalogs of the rare variants to support array design. Thus, to look for rare variants multiple assays will be needed as the current arrays are not supportive, it is reasonable to focus first on the common variants. However, accelerate advances in sequencing technologies help to locate and identify low-frequency and rare variants and then investigate their effects in complex traits. Next generation sequencing (NGS) technologies are capable of generating a substantial amount of sequence data in a relatively short time for a reasonable cost. NGS have revolutionized genome research in recent years. It produces billions of short reads; these reads are aligned to a reference genome to enable researchers to identify and genotype sites where sequenced people differ. In these days, the cost of sequencing has gone down, allowing exome and whole genome sequencing studies of common diseases. Some examples of exome sequencing studies including, the NHLBI exome sequencing project, UK10K project, and T2D-GENES project. These exome sequencing projects and others have provided dbSNP over 60 million genetic variants, most of them are rare variants (Lee et al., 2014).     However, the detection of low-frequency and rare variants in common diseases present substantial challenges despite the unique chance that sequencing provide to investigate the functions of low-frequency and rare variants in common diseases. For deep whole genome sequencing WGS large size of individuals are required and currently this expensive. Thus, because of this limitation other alternative methods have been proposed for high efficiency including, low-depth WGS, exome sequencing, target sequencing, and custom array etc., (Lee et al., 2014).  For example, researches have used genotyping arrays, such as Affymetrix exome chip and Illumina to enables them to examine protein coding variants that have been identified previously through different allele frequency spectrum.   Moreover, the statistical significance of classical single-variant tests for low-frequency and rare variants are underpowered unless sample sizes are very large. To solve this problem researchers have developed statistical approaches exclusively designed for rare variant related analysis. These approaches assess relations for several variants in a target region of a gene for instance, instead of examining the effects of single variants.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now



Arrays and sequencing platforms for rare variant analysis

Sequencing studies require multiple data processing and analysis steps.

These including rigours planning regarding to platform and sample selection,
quality control QC, choice of statistical tests, which variant to associate,
and prioritization for replication.

  Deep WGS of large size of
individuals give much information for association studies of complex traits and
diseases. For instance, to sequence one individual at 30x read depth which
means generating redundant sequencing of each base at an average of 30 reads to
differentiate sequencing errors from true polymorphisms, results in more than
99% genotyping accuracy (Bentley et al., 2008). But WGS
have not been used in practical because of its high cost. Therefore, several
suitable sequencing strategies have been suggested and used in consideration of
the cost.


Low-depth WGS have been used to sequence a large number of individuals at low cost.

Thus, it is possible to sequence 7 or 8 individuals at 4x read depth (covering
at each position by an average of 4 reads) which is cost the same when
sequencing one individual at 30x read depth using deep WGS(Lee et al., 2014). The low depth WGS is useful for discovering and genotyping shared
variants as what 1000 Genome Project indicates (McVean et
al., 2012).

Low-depth WGS based on linkage disequilibrium strategy’s that benefits
from the information of each individual to enhance the standard of variant
detection and approximate genotypes. However, low depth sequencing has high
genotyping error rates compared to deep sequencing. Primary studies indicated
that low depth WGS for larger effect sizes can be more beneficial than deep WGS
of smaller effect sizes, regarding both variant detection and follow-up disease
related studies.  


Another strategy widely used is exome
sequencing. It used to capture and sequence 1%-2% of
coding regions of the genome (Bamshad et al., 2011). Exome
sequencing have been used to identify many rare variants that associated with
Mendelian disorders. And it is effective at detecting unidentified variants
that might present in complex diseases or a familial condition which have many
affected individuals. The first successful application was reported by Ng et
al. (2010). They studied four patients of European ancestry in three different
families that they suffer from Miller syndrome of unknown cause. Miller
syndrome is extremely rare mendelian disorder, characterized by many features
including, cleft lip, hypoplasia, and micrognathia. However, they captured and
sequenced protein coding regions at 40x read depth. Then they used HapMap and
dbSNP databases as filters to eliminate common variants. They have detected
DHODH variants in each of the four patients, missense mutations predicted to be
deleterious.  And they used Sanger
sequencing to validate their findings. The candidate gene DHODH encodes for an
enzyme in the pyrimidine de novo biosynthesis pathway. Many other casual variants
on other mendelian disorders have been identified such as Kabuki syndrome (MIM
147920) etc.


Currently several empirical studies use exome sequencing in attempt to
detect genes and variants that are related to complex diseases. The NHLBI ESP uses
approximate 6500 people to sequence their exome for studying the phenotypes
related to heart attack, blood pressure, stroke, blood lipid levels, chronic
obstructive pulmonary disease and obesity (Fu et al.,
2012, Tennessen et al., 2012). And the T2D-GENES Consortium has aimed to
identify the genetic variants related to T2D and metabolic phenotypes, so they
have sequenced the exomes of roughly 10,000 peoples throughout five ancestry


sequencing performed at high coverage, an average depth of 60x-80 in a
particular region, gives high p-value of more than 20x coverage in massive
portion 90% of the coding regions (Do, Kathiresan and Abecasis, 2012). Exome sequencing
also have some error reads, it reads the off target regions, however, these
reads useful for testing sequence quality and deducing population structure.

The main limitation of exome sequencing that it covers only the genetic
variation in the exome. Non-coding region can have a significant role in common
diseases and traits. Some finding from ENCODE Project propose that non-coding
regions may play essential biological role. Overall, the low cost and the
focusing on coding protein regions propose that exome sequencing is a crucial
sequencing approach for studying rare variant (Lee et al., 2014).   


A recent published study has been conducted by (Sims et al., 2017) to
reveal the genes and variants that are associated with Alzheimer’s disease,
carried out in a three-stage case-control study of over 85,000 individuals. In
the first stage, they genotyped over 16,000 late on-set Alzheimer’s patients
and over 18,000 controls by using Illumina HumanExome microarray. They checked
the quality control of the variants and then, analysed common variants using the
classic regression model in each sample group and combined the data using
METAL. And they analysed low-frequency and rare variants using score test and
they combined the data using SeqMeta. However, in this stage they detected 43
candidate variants after they removed known risk loci. In the second stage,
they tested these candidate variants for association in separate group of over
14,000 patients and over 21,000, using de novo genotyping and imputation. And the
variants from stage two were then carried forward to stage three for testing in
a group of 6652 cases and 8345 controls were imputed using the Haplotype
Reference Consortium resource.  From
these analysis, they uncovered four rare coding variants associated with
late-onset Alzheimer disease; missense variant in PLCG2 (associated with
reduced risk of the disease), missense change in ABI3 (showed evidence of
rising the disease risk), and two independent variants in TREM2, one of them
was previously recognised. These genes are highly expressed in microglial and
the analysis of protein-protein interaction indicated that these genes interact
with other variants associated with Alzheimer’s disease.



Genotype imputation strategy based
on the availability of known haplotypes in a population (reference panel) in
order to impute genotypes of the missing SNPs which is generated by genotyping
arrays used for GWAS. Then, these predicted SNPs can be used to check for association
with traits. The missing SNPs can be predicted from reference panels of either HapMap
or 1000 genome project or UK10K. Zheng et al., (2015) have showed that combining
1000 Genome project/ UK10K reference panels is possible to identify rare
variants related to bone mineral density. So, genotype imputation can boosts
the coverage of the variation which enabling to examine more SNPs than that we
obtain from the original microarray. Imputation is beneficial for meta-analysis
studies as it increases the overlap of variants among arrays.


Custom genotyping arrays is not ideal to capture sufficient low-frequency and rare variants but
it is cost effective to use it as an alternative method to sequence regions of
interest. Metabochip41 is an example of custom genotype array that is used for
cardiovascular and metabolic disease. Immunochip42 is used for inflammatory and
autoimmune disease. These chips were designed for the high priority variants
from sequencing and GWAS studies. These chips contains a common variants pick
to replicate the novel GWAS signals and a pick of low-frequency and common
variants to allow a comprehensive testing of many regions linked to a
particular phenotype.  Another
inexpensive arrays Illumina and Affymetrix exome chips (Do, Kathiresan and
Abecasis, 2012).



Association tests for low-frequency and rare variants

To analyse low-frequency and rare variants that have been identified in
sequencing studies, we have to have new statistical methods for examining
single and multiple variants. The regression model that is typically used in
GWAS studies for testing the associations of genetic variants with phenotype
cannot be used to test rare variants. For example, Wald test is widely used for
testing common variants as it is characterized by computation speed and broad
application. However, Wald test has reduced power for detecting rare variants.  So, to increase power many alternative tests of
multivariate have been designed. These tests collapse rare variants together
across a gene for example and, therefore, with the presence of several causal
variants there will be more power to detect association.  The proposed rare variant tests fall into four
categories (Table 1). However, not all variants are affecting phenotypes.


Burden tests such as ARIEL
test, CAST, CMC method etc., collapse rare variants into a single predictor and
then compare the distribution among cases and controls. This test is powerful
when the fraction of causal variants rises. Each type of burden test have
different conclusion. For instance, the easy way to do burden test is to count
the number of the minor alleles through all variants in the set creating a
score for each individual. The the CAST test sets the score to 0 on the
presence or 1 on the absence, at least for one rare variant in the region
assessed. Madson and Browing have proposed weighted sum statistic (WSS), takes
all the variants frequencies into account, not require to set a fixed threshold
to determine rare and common variant as in CAST.   The limitation of this test is giving a
strong statement about the same path and scale of effect, low power. The
variants tested in the functional region are all casual and related to the


Adaptive Burden tests have been developed to
address the limitation of the basic burden tests. The adaptive burden tests are
robust to the existence of null variants and it permit for multiple effect
directions. For example, the data adaptive sum test (aSum) developed by Han et
al. (2010) estimate the effect direction for each variants in a marginal model
and performs the burden test with the estimated direction. This approach needs
permutation to estimate P-values. The limitation of this tests that the marginal
models are unstable even though more robust. And permutation requires intensive


Variant-component tests such as C-Alpha, SKAT, SSU etc., have been
designed to take into account the specific scenario where protective and risk
variants might be detected within a gene or functional unit. It test the
distributions of genetic effects within a collection of variants. This mothed
is adaptable and permits for a mixture of effects in the rare variant
collection. SKAT is most popular test, can consider weightings of rare
variants, covariates, and family structure, it has been basically developed for
quantitative traits.     


Overall these tests feature is to examine the combined effect not the
individual effects of several rare variants as a whole group, therefore, if the
association of rare variants is identified, more analyses will be needed to
establish which one in the group cause the association. Also, these tests cannot
estimate the heritability of rare variants; additional analyses of heritability
using the right method may require.