-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LDpred2-auto for EWAS #473
Comments
LDpred and lassosum are basically trying to approximate a penalized linear regression. |
Ok, I've gone through the tutorial and it seems feasible. Of course, in methylation data there is no a0 and a1, but if i understand it correctly these are just used to match the test data variants with the sumstats ones? And since rsID (cgID for methylation) can be used this should be fine. Something I dont understand at the moment is the difference when the phenotype is continuous and binary. The "affection" is the continuous phenotype info right? Does both LDpred and lassosum automatically switch to logistic regression in the case of a binary phenotype? I had some trouble understanding this in the lassosum paper too, where the objective function f (β) = (y − Xβ)T (y − Xβ) makes no sense to me if y is a binary vector. Am I confused about something here? |
Yes, you can probably just Both LDpred and lassosum try to approximate a linear regression using summary statistics. |
Any update on this? |
I have managed to get as far as snp_ldpred2_auto() with some test data . For now Im just using a regular pearson correlation matrix. At the moment i'm getting only NA values in beta_est. Probably the models dont converge. I am only running this on my laptop so far so I'm suspecting that my fake test set (500x3000) has too little data. Does this sound likely? My data is randomly generated so it might also be that its lacking patterns. I will move over to a server and try it on a larger test set to see if it works. I'll put my pipeline on github and share it here if you're interested to take a look. cheers |
Small update: I'm trying to create a correlation matrix without any genomic context from my test set (1227 samples/rows, 760668 probes/cols) with the following:
You wouldn't happen to have any intuition if this will be feasible or if it will take too long? I've queued it on the HPC now requesting 24h of 64 cores and 500Gb RAM so I'll get some diagnostics. So far I dont have any chr or pos information so I've tried to hack around the snp_cor() function. Dont know how it would behave even if I can find the positions, since genetic distance might or might not apply for methylation. |
Be careful, adding sparsity to your correlation matrix by keeping the largest absolute values only is probably a bad idea; you will get some matrix that does behave well in algorithms due to the negative eigenvalues this thresholding introduces. I would first look at the correlation structure with e.g. |
Thanks for the warning, this was just something I improvised, I wasnt even aware that this was a common method for sparsity induction. Would you say its never a good idea or could it be ok if the matrix is still positive definite? The main challenge will be to find something equivalent to LD in methylation, some biological motivation for blocks/clumping. I know some people have used co-metylation, I'm looking into this right now: https://pubmed.ncbi.nlm.nih.gov/31985744/ |
When you compute a correlation matrix using less samples than variables, you already have a non-positive definite matrix. It is only a semi-positive definite matrix (some eigenvalues are 0), which can already be a problem. I think any thresholding would probably introduce negative eigenvalues pretty quickly on such matrices. You can always regularize the correlation matrix to counter this effect (e.g. Rs = s R + (1 - s) I, and choose s = (ϵ−1) / (λ−1), where λ is the smallest eigenvalue of R (probably negative), and ϵ as the new smallest eigenvalue of Rs that you choose (positive).) |
Yes, if all blocks are smaller than the number of individuals, that should be fine. |
I have managed to get results with LDpred2-auto, but they are quite abysmal at the moment. The pcor() is -0.006 and the measure we have used for our previous models, glm(factor(phenotype)~risk_scores_vector, family=binomial), performs equally poorly with a calculated Nagelkerke R2 of 9e-6. The R2 from this assessment is 0.03 when running the unadjusted effect sizes for comparison. The problem most likely stems from my correlation matrix. I have created a matrix with blocks based on co-methylation, but since only about 4% of probes are clustered into blocks and 96% are singeltons, my matrix ends up looking kind of like this: With the squares on the diagonal symbolizing correlation blocks and the matrix mostly only being a zero matrix with ones on the diag. Correlation blocks are very small with 3-6 probes in each cluster. Do you have any sense of how the algorithm would behave with such a matrix? |
|
Yes, i did double check that my samples and probes are ordered correctly, especially since the comeback package that I used to create my co-methylated regions jumbles the order or probes. So if i would feed the algorithm an identity matrix length(probes) x length(probes), meaning all probes are independent, i would simply get a transformed set of effect sizes that would basically have the same relationship among them as the original effect sizes? This would mean the effect sizes are not improved at all? This matrix is the result of the comeback package which takes the test data as input and determines co-methylated regions, i.e. regions that are methylated or unmethylated together. It determines these regions by looking at probe bp distance, probe density, and correlation of expression levels. With my current settings if probes have a distance of less than 100k bp, with a density of probes in between them of <400bp and correlation of methylation levels >0.3, they are considered a co-methylated region. But only about 4% of the probes are currently assigned to these co-methylated regions, thats why i get so few correlation blocks. I am only calculating within-region correlations. Most of the matrix is just empty, since most probes are considered independent in this case. So what I need is some block correlation structure that covers all or most of the probes in my set? |
You need the correlation matrix, but that's complicated if you have 760,668 variables and no way to pre-determine that some of them are mostly uncorrelated. |
Yes, that seems to be the main issue. If I manage to get a complete probe-probe correlation matrix, meaning no zero values but an expected high number of near-zero values, do you think the algorithm would respond well? |
The problem is that you cannot get a 760Kx760K dense matrix, that's too large. |
Right, 4 bytes x 760000² = 2.9TB. Should have thought of that of course... |
The default is even 8 bytes per value, so 6 TB, that's a lot, yes. |
By applying some less strict thresholds for co-methylation I created a correlation matrix where about 17% of probes are non-singleton blocks. This improved the pcor to -0.022, so it seems that the more of the probes I get into blocks, the better the result. I currently have a calculated h2 of 29, this seems extremely high. I was expecting h2 to be between 0-1. |
So a negative correlation would be a sign of something going wrong in the model, not a true reflection of the data relation? The pcor results are -0.0225 -0.0783 0.0336. Does the h2 used in snp_ldpred2_auto() work mathematically in any case? I calculate it with snp_ldsc() but my df_beta only contains placeholder values for a1="B" and a0="C" so I kind of force it into the correct format. As I understand it snp_ldsc doesnt make use of a1 and a0 anyway. Yes, Instead of big_prodVec() i simply do |
|
Have you looked at the QC step comparing standard deviations? |
Thats a good idea. Is there an function with the QC check from the tutorial implemented? I did a run now with probes randomly assigned to correlation blocks, and my pcor output is: 0.16 0.11 0.22 so thats a lot better. When it comes to the R2 value its 0.036, which is higher than for the unadjusted scores and all previous models. This is good news. It means that as long as I can find a biological motivation for assigning all probes to blocks I can probably get some good numbers. |
This is the part where it is described: https://privefl.github.io/bigsnpr/articles/LDpred2.html#quality-control-of-gwas-summary-statistics. |
Heres an update: I used TAD regions to create blocks in my correlation matrix: This worked very well and gave the best pcor and R2 values so far. I will clean up the code and post it somewhere in a few weeks. Right now Im trying to grasp a bit of what is happening mathemtically for my thesis report. Is it so that snp_ldsc() uses the equation from the LDpred1 Vilhjalmsson paper under headline Estimation of the Heritability Parameter in the following way: |
|
Ok, I think I get it. Another thing I wondered about is where the penalization or approximation of penalized regression happens? Is it in the incorporation of the zero values from the correlation matrix into the betas in the sampler? equation 5 in LDpred2 paper |
No, the zero values in the correlation matrix are there for scalability. Two things can be thought as regularization:
|
We will try to publish a research note in BMC bioinformatics about my EWAS project and I'd like to add you in acknowledgements if thats ok? |
Yes |
Hello!
I have run into the problem of constructing risk scores from EWAS summary statistics without having access to the individual samples. Is there an equivalent to LDpred2-auto for methylation data? Alternatively, do you think it would be possible to adapt the pipeline for EWAS? This is something I could be interested in doing, depending on time and skill restraints.
Thanks!
/Kristoffer
The text was updated successfully, but these errors were encountered: