Why is vntr important




















We cross-validate our approach by leaving one sample out of the pan-genome database and evaluating the prediction accuracy on the excluded sample. For comparison, VNTR lengths were also estimated by a read depth method. For each VNTR region, the read depth, computed with samtools bedcov -j, was divided by the global read depth, computed from the nonrepetitive regions, to give the length estimate.

To compare in a similar setting, danbing-tk was run with option -gc -thcth -k 21 -cth 45 -rth 0. V ST was calculated according to Redon et al.

Among this set, the number of times each genome appeared as an outlier was used to select a set of genomes with an over abundant contribution to fragile loci. Any candidate locus with an individual that was an outlier in at least four other loci was removed from the candidate list. The loci were compared to gencode v34, excluding readthrough, pseudogenes, noncoding RNA, and nonsense transcripts.

We use the EAS population as the reference for measuring differential motif usage and expansion. Initially, a lasso fit using the statsmodel. OLS function in python statsmodel v0. The k -mer with the highest weight is denoted as the most informative k -mer mi-kmer for the locus.

VNTR lengths are genotyped using daunting-tk with options: -gc -thcth 50 -cth 45 -rth 0. All the k -mer counts of a locus are summed and adjusted by global read depth and ploidy to represent the approximate length of a locus.

Adjusted values are then z -score normalized as input for eQTL mapping. The downloaded expression matrices are already preprocessed such that outliers are rejected and expression counts are quantile normalized as standard normal distribution. Confounding factors such as sex, sequencing platform, amplification method, technical variations and population structure are removed prior to eQTL mapping to avoid spurious associations.

Population structures are corrected with the top 10 principal components PCs from the SNP matrix of all samples. This is done by first using CrossMap v0.

The normalized expression matrix are residualized with the above covariates using the following formula:. The residualized expression values are z -score normalized as the input of eQTL mapping. Linear regression was done using the statsmodel. Nominal P -values are computed by performing t -tests on slope.

Adjusted P -values are computed by Bonferroni correction on nominal P -values. Specifically, the adjusted P -values of the lead VNTR for each gene are taken as input for Benjamini—Hochberg procedure using statsmodels. Further information on research design is available in the Nature Research Reporting Summary linked to this article. Data accession IDs are given in Supplementary Table 4. The whole-genome sequencing and expression data of GTEx samples phs The Source Data for Figs.

Source data are provided with this paper. Consortium, I. Initial sequencing and analysis of the human genome. Nature , — Viguera, E. Replication slippage involves DNA polymerase pausing and dissociation.

EMBO J. Gatchel, J. Diseases of unstable repeat expansion: mechanisms and common principles. Hannan, A. Tandem repeats mediating genetic plasticity in health and disease.

Mallick, S. The Simons Genome Diversity Project: genomes from diverse populations. Fotsing, S. The impact of short tandem repeat variation on gene expression. Gymrek, M. Abundant contribution of short tandem repeats to gene expression variation in humans.

Bakhtiari, M. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. Dolzhenko, E. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35 , — A global reference for human genetic variation. Nature , 68—74 Taliun, D. Consortium, G.

Genetic effects on gene expression across human tissues. Article Google Scholar. Li, H. A synthetic-diploid benchmark for accurate variant-calling evaluation. Methods 15 , — Chaisson, M. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Mousavi, N. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res.

Koren, S. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Chin, C. Phased diploid genome assembly with single-molecule real-time sequencing. Methods 13 , — Song, J. Characterization of a human-specific tandem repeat associated with bipolar disorder and schizophrenia. Du, Z. Whole genome analyses of chinese population and de novo assembly of a northern han genome. Genomics Proteom. Shi, L. Long-read sequencing and de novo assembly of a Chinese genome.

Hickey, G. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. Audano, P. Characterizing the major structural variant alleles of the human genome. Cell , — Chen, S. Paragraph: a graph-based structural variant genotyper for short-read sequence data.

Saini, S. A reference haplotype panel for genome-wide imputation of short tandem repeats. Interpreting short tandem repeat variations in humans using mutational constraint. Eggertsson, H. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Garrison, E. Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Pevzner, P. De novo repeat classification and fragment assembly. Jiang, Z. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Raphael, B. A novel method for multiple alignment of sequences with repeated and shuffled elements. Iqbal, Z. High-throughput microbial population genomics using the Cortex variation assembler.

Bioinformatics 29 , — De novo assembly and genotyping of variants using colored de Bruijn graphs. Seo, J. De novo assembly and phasing of a Korean human genome. Zook, J. A robust benchmark for detection of germline large deletions and insertions.

Porubsky, D. Dense and accurate whole-chromosome haplotyping of individual genomes. Kolmogorov, M. Assembly of long, error-prone reads using repeat graphs. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Rakocevic, G. Fast and accurate genomic analyses using genome graphs. Rautiainen, M. Bit-parallel sequence-to-graph alignment. Fairley, S. Redon, R. Global variation in copy number in the human genome. Sudmant, P.

Global diversity, population stratification, and selection of human copy-number variation. Science , aab Variable number tandem repeats mediate the expression of proximal genes. Wellcome Trust Case Control Consortium. Association scan of 14, nonsynonymous SNPs in four diseases identifies autoimmunity variants.

Franke, A. Ye, C. Genetic analysis of isoform usage in the human anti-viral response reveals influenza-specific regulation of transcripts under balancing selection. Koolen, D. Clinical and molecular delineation of the 17q Witoelar, A. Genome-wide pleiotropy between parkinson disease and autoimmune diseases. JAMA Neurol. Trends Genet. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.

LaPierre, N. Identifying causal variants by fine mapping across multiple studies. Braida, C. Variant CCG and GGC repeats within the CTG expansion dramatically modify mutational dynamics and likely contribute toward unusual symptoms in some myotonic dystrophy type 1 patients. Paten, B. Genome graphs and the evolution of genome inference. The design and construction of reference pangenome graphs with minigraph.

Seabold, S. Statsmodels: econometric and statistical modeling with python. In Proc. Lu, T. Download references. Katherine M. Munson, Alexandra P. You can also search for this author in PubMed Google Scholar. Correspondence to Mark J. Peer review information Nature Communications thanks Sai Chen, Erik Garrison and the other, anonymous, reviewer s for their contribution to the peer review of this work.

Peer reviewer reports are available. Reprints and Permissions. Lu, TY. Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. Nat Commun 12, Download citation. Received : 24 December Accepted : 10 June Published : 12 July Anyone you share the following link with will be able to read this content:.

Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate. Any filter is characterized by three parameters:. The total genotyping time for n VNTRs is given by:.

The running time is:. Note that a read can be assigned to multiple VNTR loci, or to none. As an initial step toward this task, we perform a fast string matching based on prefix tree trie to assign each read to the VNTR loci that share an exact match with the read. For an efficient matching, we generate a separate aho-corasick trie 63 using every k-mer in VNTR loci as dictionary X.

A trie is a rooted tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from the root to a leaf gives a unique word k-mer X.

On the other hand, the string concatenation of the edge symbols from the root to a middle node gives a unique substring of X, called the string represented by the node.

We add extra internal edges called failure edges to other branches of the trie that share a common prefix which allow fast transitions between failed string matches without the need for backtracking The overall complexity of this algorithm is linear based in the length of original dictionary VNTRs in the database to build the trie and recover matches plus the length of queries sequencing reads. Hence, after construction of the trie, the running time is proportional to just reading in the sequences.

To further reduce the set of reads assigned to each VNTR, we use a 2-layer feedforward neural network to compute f i , using a k-mer based embedding to encode DNA strings.

Each read R can be defined by a collection of overlapping k-mers. Details of the neural network architecture and hyper-parameters are presented below. Let v denote the mapping of a read. We use a shallow architecture with an input layer used to present v to the network. We add two layers of fully connected nodes as the hidden layers, with each node being a ReLU function. In the output layer, there are two nodes z e r o and o n e which specify that whether read should be classified as true containing VNTR or false Fig.

We used the training set to train the network with Adam optimization algorithm The number of hidden layers N 1 and N 2 were chosen empirically. Too many nodes would increase both training time and test time and possibly cause over-fitting.

The choice of k-mer length is important. Increasing the k-mer size could decrease sensitivity in our case as small variation will significantly change the k-mer composition, whereas lowering k-mer size reduces the features that are discriminative for a pattern In addition, our embedding size exponentially grows with respect to the k so there is also a practical upper bound on the k.

The accuracy remains comparable in this range Fig. To choose the best loss function, we examined three regression loss functions: Mean squared error MSE , mean squared logarithmic error MSLE , and mean absolute error MAE , as well as three binary classification loss functions hinge, squared hinge, and binary cross-entropy.

We compared the validation performance of our models for these six different loss functions. Each distribution in Supplementary Fig. S 24 shows the accuracy on validation set across genomic loci. We analyzed these distributions using one-way analysis of variance ANOVA and none of them were significantly better than others. We chose binary cross-entropy as it obtained the highest mean accuracy The running time using the two filters could be modeled as. For each locus, we assigned labels to reads as being true reads or not, based on exact location.

We trained all neural network models using the training and validation sets, and reported performance on the test dataset. To augment the data, we added random single nucleotide variations in the genome sequences of the dataset before simulating the sequencing reads For each sequence in the dataset, we replaced its nucleotides with a random one with probability r m.

To test and compare genotyping accuracy against VNTRseek v1. As a result, target VNTRs remained. We used ART 67 to generate heterozygous samples by simulating 15X coverage reads from each modified haplotype which contained a non-reference allele and combined those with 15X reads that were simulated from reference.

Together, this provided six diploid simulated datasets for each locus, at 30X coverage. Similarly, to test and compare genotyping accuracy against GangSTR 40 v2. A total of target VNTRs remained. As genotyping VNTRs remains computationally expensive, we focused on the smaller set of VNTRs located within coding, untranslated, or promoter regions of genes, which are most likely to be involved in regulation. Overall, this procedure identified 13, VNTRs, of which 10, were within the size range for short-read genotyping Fig.

We subsequently added two VNTRs previously linked to a human disease to obtain 10, target loci 38 , In addition, we identified the amount of base-pair difference that they make in genome of each individual by comparing the copy number difference of VNTRs between reference and the sample and multiplied that by the pattern length of each locus. We computed how many loci on average differed between an individual and reference by combining all non-reference calls in at least one haplotype from all individuals and dividing it by all called variants.

For the remaining genes, we quantile-normalized RPKM values of each tissue to a normal distribution. Before the analysis of the association of VNTR genotypes and gene expression levels, we adjusted gene expression levels for each tissue in order to control for covariates of sex, population structure, and technical variations in measuring expression. For population structure, we used the top ten PCs from a principal components analysis PCA on the matrix of SNP genotypes to provide a correction for population structure.

To correct for non-genetic factors such as technical variations in measuring RNA expression levels e. We removed the effect of covariates by regressing them out from the RNA expression matrix of each tissue and subtracting their factor contributions and used the residuals for all eQTL association analyses.

We normalized the individual raw gene expression values to N 0, 1 by subtracting the mean and dividing by the standard deviation of the expression values for that cohort. For a gene-VNTR pair v , let y iv denote the normalized expression value of gene in v for individual i and x iv denote the genotype of the VNTR in v for individual i.

Overall, significant tests were observed from total of 73, tests in all tissues and unique VNTRs passed the significance test in at least one tissue. We performed a similar correction for the Geuvadis cohort. For the Icelandic cohort, only the VNTRs that showed significant associations in GTEx were tested using unmapped reads plus reads mapped to those specific loci.

Hence, we used the conservative p value cutoff from whole blood tissue of the smaller GTEx cohort. Then, we ranked all variants based on their association P value. We further used a fine-mapping method, CAVIAR, as an orthogonal method to identify the causal variant for the change in gene expression level.

CAVIAR is a statistical method that quantifies the probability that a variant is causal by combining association signals i. We ran CAVIAR with parameter -c 1 to identify the most likely causal variant, along with the causality probability distribution for each variant site. Further information on research design is available in the Nature Research Reporting Summary linked to this article. The analyses presented in this paper are based on the use of GTEx study data downloaded from the dbGaP web site, under phs Source data are provided with this paper.

Willems, T. The landscape of human STR variation. Genome Res. Gymrek, M. A genomic view of short tandem repeats. Li, M. Gemayel, R.

Variable tandem repeats accelerate evolution of coding and regulatory sequences. Vafiadis, P. Brookes, K. The VNTR in complex disorders: the forgotten polymorphisms? A functional way forward? Genomics , — Capurso, C. Psychiatry 34 , — Lalioti, M. Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy.

Nature , Fondon, J. Molecular origins of rapid and continuous morphological evolution. Natl Acad. USA , — A mutation in hairless dogs implicates FOXI3 in ectodermal development.

Science , — Vogler, A. Mutations, mutation rates, and evolution at the hypervariable vntr loci of yersinia pestis. Supply, P. Automated high-throughput genotyping for study of global epidemiology of mycobacterium tuberculosis based on mycobacterial interspersed repetitive units.

Sonay, T. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Sulovari, A.

Human-specific tandem repeat expansion and differential gene expression during primate evolution. Nicolae, D. PLoS Genet. Nica, A. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. Gilad, Y. Revealing the architecture of gene regulation: the promise of eQTL studies.

Trends Genet. Battle, A. Genetic effects on gene expression across human tissues. Nature , — Borel, C. Dolzhenko, E. Detection of long repeat expansions from PCR-free whole-genome sequence data. Bakhtiari, M. Targeted genotyping of variable number tandem repeats with adVNTR. Gelfand, Y. VNTRseek—a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Res.

De Roeck, A. Nanosatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol. Mitsuhashi, S. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Lappalainen, T.

Transcriptome and genome sequencing uncovers functional variation in humans. Chiang, C. The impact of structural variation on human gene expression. Quilez, J. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Abundant contribution of short tandem repeats to gene expression variation in humans.

Fotsing, S. The impact of short tandem repeat variation on gene expression. Grundberg, E. Mapping cis-and trans-regulatory effects across multiple tissues in twins. Wright, F. Heritability and genomics of gene expression in peripheral blood. Manolio, T. Finding the missing heritability of complex diseases. Hannan, A. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Ebbert, M. Molecular Neurodegeneration 13 , 46 Wang, Y. Endothelial nitric oxide synthase gene polymorphism in intron 4 affects the progression of renal failure in non-diabetic renal diseases.

Langmead, B. Fast gapped-read alignment with Bowtie 2. Methods 9 , Mousavi, N. Profiling the genome-wide landscape of tandem repeat expansions. A global reference for human genetic variation. Nature , 68—74 Gudbjartsson, D. Large-scale whole-genome sequencing of the Icelandic population. Stegle, O.

Using probabilistic estimation of expression residuals PEER to obtain increased power and interpretability of gene expression analyses. Stranger, B. Patterns of cis regulatory variation in diverse human populations.

Urbut, S. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Bomba, L. The impact of rare and low-frequency genetic variants in common disease. Hormozdiari, F. Identifying causal variants at loci with multiple signals of association. Genetics , — Hao, R. Gene expression profiles indicate tissue-specific obesity regulation changes and strong obesity relevant tissues. Cell Metab. Li, G.

Lean rats with hypothalamic pro-opiomelanocortin overexpression exhibit greater diet-induced obesity and impaired central melanocortin responsiveness. Diabetologia 50 , — Savino, A. Network analysis allows to unravel breast cancer molecular features and to identify novel targets. Skubitz, A.

Differential gene expression identifies subgroups of ovarian carcinoma. Marioni, R. Psychiatry 8, 99 Pimenova, A. Psychiatry 83 , — Lee, D. Givalos, N. Replication protein A is an independent prognostic indicator with potential therapeutic implications in colon cancer. Tomioka, Y. Decreased serum pyridoxal levels in schizophrenia: meta-analysis and Mendelian randomization analysis. Psychiatry Neurosci. Sato, N. Genes Chromosomes Cancer 49 , — Gylfe, A. Eleven candidate susceptibility genes for common familial colorectal cancer.

Morales, F. A polymorphism in the MSH3 mismatch repair gene is associated with the levels of somatic instability of the expanded CTG repeat in the blood DNA of myotonic dystrophy type 1 patients.

DNA Repair 40 , 57—66 Williams, G. MSH3 promotes dynamic behavior of trinucleotide repeat tracts in vivo. Aho, A. Efficient string matching: an aid to bibliographic search. Communications of the ACM 18 , — Kingma, D.



0コメント

  • 1000 / 1000