APORC Document Center: VPECA

More...

vPECA: variants interpretation method by Paired Expression and Chromatin Accessibility data

Version 1.0
Last updated: June 6, 2019

Reference#

Jingxue Xin, Hui Zhang, Yaoxi He, Zhana Duren, Chaoying Cui, Lang Chen, Xin Luo, Dong-Sheng Yan, Chaoyu Zhang, Xiang Zhu, Qiuyue Yuan, Xuebing Qi, Ouzhuluobu, Wing Hung Wong, Yong Wang, Bing Su. Chromatin accessibility landscape and regulatory network of high-altitude hypoxia adaptation. (In submission).

Method#

We develop a new method called vPECA (Variants interpretation model by Paired Expression and Chromatin Accessibility data) to model genome-wide chromatin accessibility profiles for high-altitude hypoxia adaptation in HUVEC, to reveal causal SNPs, active and active selected regulatory elements for a certain gene. vPECA can integrate our measured paired expression and chromatin accessibility data with the available public data, including population genetics data, functional genomics data in ENCODE, and Hi-C data for HUVEC. Our previous work PECA integrates paired expression and chromatin accessibility data across diverse cellular contexts and model the localization to REs of chromatin regulators (CR), the activation of REs due to CRs that are localized to them, and the effect of TFs bound to activated REs on the transcription of target genes (TG) 18. Our innovation here is to extend PECA to interpret genetic variants from population genetics and matched WGS data. vPECA models how positively selected noncoding SNPs affects the RE’s selection status, chromatin accessibility, and activity and further determine the target gene expression. The statistical modelling allows us to systematically identify active REs, active selected REs, and gene regulatory network to interpret variants.

Processing data#

vPECA model requires input as sample matched time-series RNA-seq, and ATAC-seq, and individual matched DNA-seq data together with selection scores calculated for each SNP from public data. For RNA-seq and ATAC-seq data, first we processed raw reads into an expression matrix with row genes and column samples. And chromatin accessibility data as a matrix with element by sample dimensions. The candidate RE and TG pairs based on distance are collected into a Element_gene in Data_prior.mat. Then the SNPs locate on REs and their corresponding selection scores are in a text file named element_SNP_use.txt. The prior (TF-TG, TG-RE) learned from public data could be set to certain number if it is not available. TF binding strength are calculated from motif scan algorithm.

Running vPECA#

All the main programs are in main_PECA.m file. Please run the script and get the result from the folder called Output. All the selection status of each RE, and TF-RE-TG triplets are listed