Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDpred2-auto for EWAS #473

Open
KristofferSandas opened this issue Jan 15, 2024 · 33 comments
Open

LDpred2-auto for EWAS #473

KristofferSandas opened this issue Jan 15, 2024 · 33 comments

Comments

@KristofferSandas
Copy link

Hello!

I have run into the problem of constructing risk scores from EWAS summary statistics without having access to the individual samples. Is there an equivalent to LDpred2-auto for methylation data? Alternatively, do you think it would be possible to adapt the pipeline for EWAS? This is something I could be interested in doing, depending on time and skill restraints.

Thanks!
/Kristoffer

@privefl
Copy link
Owner

privefl commented Jan 16, 2024

LDpred and lassosum are basically trying to approximate a penalized linear regression.
So, if you have the correlation matrix for your methylation data and some validation data, you could probably use LDpred2-grid and lassosum2.
For LDpred2-auto, I'm not sure how the estimation of parameters would behave in such a case, but you can definitively try it.

@KristofferSandas
Copy link
Author

Ok, I've gone through the tutorial and it seems feasible.

Of course, in methylation data there is no a0 and a1, but if i understand it correctly these are just used to match the test data variants with the sumstats ones? And since rsID (cgID for methylation) can be used this should be fine.

Something I dont understand at the moment is the difference when the phenotype is continuous and binary. The "affection" is the continuous phenotype info right? Does both LDpred and lassosum automatically switch to logistic regression in the case of a binary phenotype? I had some trouble understanding this in the lassosum paper too, where the objective function f (β) = (y − Xβ)T (y − Xβ) makes no sense to me if y is a binary vector. Am I confused about something here?

@privefl
Copy link
Owner

privefl commented Jan 25, 2024

Yes, you can probably just match() on the cgIDs.

Both LDpred and lassosum try to approximate a linear regression using summary statistics.
Usually it makes very little difference whether you're using a linear or logistic regression model, and PGS methods actually implementing a logistic model are very scarce and do not perform better.

@privefl
Copy link
Owner

privefl commented Feb 6, 2024

Any update on this?

@KristofferSandas
Copy link
Author

I have managed to get as far as snp_ldpred2_auto() with some test data . For now Im just using a regular pearson correlation matrix. At the moment i'm getting only NA values in beta_est. Probably the models dont converge. I am only running this on my laptop so far so I'm suspecting that my fake test set (500x3000) has too little data. Does this sound likely? My data is randomly generated so it might also be that its lacking patterns. I will move over to a server and try it on a larger test set to see if it works. I'll put my pipeline on github and share it here if you're interested to take a look.

cheers

@KristofferSandas
Copy link
Author

Small update: I'm trying to create a correlation matrix without any genomic context from my test set (1227 samples/rows, 760668 probes/cols) with the following:

df<- test_data
num_features <- ncol(df)
tmp <- tempfile(tmpdir = "tmp-pipeline-test-1")

for (i in 1:num_features) {
  if (i %% 1000 == 0) {
    cat("Number of features processed:", i, "out of", num_features, "\n")
  }
  
  # Extract one feature at a time
  current_feature <- df[, i]
  
  # Initialize correlations matrix of zeros
  correlations <- sparseMatrix(i = integer(0), j = integer(0), x = double(0), 
                              dims = c(num_features, 1))
  
  # Calculate correlations with all other features
  for (j in 1:num_features) {
    # Populate correlations
    correlations[j, ] <- cor(current_feature, df[, j])
    }

  # Set sparsity for correlations
  # we want to keep only the top 40% largest absolute value correlations, 
  # and set all other correlations to zero.
  # This is just to create a sparse matrix and make it more manageable,
  # it has no biological rationale.
  sparsity_cutoff <- 0.60  
  cutoff <- quantile(abs(correlations), sparsity_cutoff)
  
  # Zero out correlations below the cutoff
  correlations[abs(correlations) < cutoff] <- 0
  
  if (i == 1) {
    ld <- Matrix::colSums(correlations^2)
    corr <- as_SFBM(correlations, tmp, compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(correlations^2))
    corr$add_columns(correlations, 0)
  }
}

You wouldn't happen to have any intuition if this will be feasible or if it will take too long? I've queued it on the HPC now requesting 24h of 64 cores and 500Gb RAM so I'll get some diagnostics. So far I dont have any chr or pos information so I've tried to hack around the snp_cor() function. Dont know how it would behave even if I can find the positions, since genetic distance might or might not apply for methylation.

@privefl
Copy link
Owner

privefl commented Feb 16, 2024

Be careful, adding sparsity to your correlation matrix by keeping the largest absolute values only is probably a bad idea; you will get some matrix that does behave well in algorithms due to the negative eigenvalues this thresholding introduces.

I would first look at the correlation structure with e.g. Matrix::image(corr^2). To see whether there is some block structure, whether it makes sense to use some distance threshold, whether you have inter-chromosomal correlations, etc.

@KristofferSandas
Copy link
Author

Thanks for the warning, this was just something I improvised, I wasnt even aware that this was a common method for sparsity induction. Would you say its never a good idea or could it be ok if the matrix is still positive definite?

The main challenge will be to find something equivalent to LD in methylation, some biological motivation for blocks/clumping. I know some people have used co-metylation, I'm looking into this right now: https://pubmed.ncbi.nlm.nih.gov/31985744/

@privefl
Copy link
Owner

privefl commented Feb 16, 2024

When you compute a correlation matrix using less samples than variables, you already have a non-positive definite matrix. It is only a semi-positive definite matrix (some eigenvalues are 0), which can already be a problem. I think any thresholding would probably introduce negative eigenvalues pretty quickly on such matrices. You can always regularize the correlation matrix to counter this effect (e.g. Rs = s R + (1 - s) I, and choose s = (ϵ−1) / (λ−1), where λ is the smallest eigenvalue of R (probably negative), and ϵ as the new smallest eigenvalue of Rs that you choose (positive).)

@KristofferSandas
Copy link
Author

Ok, so now Ive clustered my methylation data per-chromosome according to co-methylation and I did correlation only within clusters. I ended up with a matrix with most non-zero values aoound the diagonal, this is from my test run:
image
So now the matrix is a composite of smaller correlation matrices with fewer variables than samples, will i still need to think about semi-positivity for the matrix in its entirety and regularize?

@privefl
Copy link
Owner

privefl commented Feb 19, 2024

Yes, if all blocks are smaller than the number of individuals, that should be fine.
In LDpred2-auto, you can always slightly regularize using shrink_corr = 0.99.

@KristofferSandas
Copy link
Author

I have managed to get results with LDpred2-auto, but they are quite abysmal at the moment. The pcor() is -0.006 and the measure we have used for our previous models, glm(factor(phenotype)~risk_scores_vector, family=binomial), performs equally poorly with a calculated Nagelkerke R2 of 9e-6. The R2 from this assessment is 0.03 when running the unadjusted effect sizes for comparison.

The problem most likely stems from my correlation matrix. I have created a matrix with blocks based on co-methylation, but since only about 4% of probes are clustered into blocks and 96% are singeltons, my matrix ends up looking kind of like this:

image

With the squares on the diagonal symbolizing correlation blocks and the matrix mostly only being a zero matrix with ones on the diag. Correlation blocks are very small with 3-6 probes in each cluster. Do you have any sense of how the algorithm would behave with such a matrix?

@privefl
Copy link
Owner

privefl commented Feb 27, 2024

  • If you consider variants to be independent, what you get is probably just a monotonic transformation of GWAS/EWAS effect sizes.
  • How can you get such a structure?
  • Often an R2 close to 0 results from an error in the order of individuals provided for the PGS and the phenotype (as if you had sampled one of them, without applying the same ordering to the other).

@KristofferSandas
Copy link
Author

Yes, i did double check that my samples and probes are ordered correctly, especially since the comeback package that I used to create my co-methylated regions jumbles the order or probes.

So if i would feed the algorithm an identity matrix length(probes) x length(probes), meaning all probes are independent, i would simply get a transformed set of effect sizes that would basically have the same relationship among them as the original effect sizes? This would mean the effect sizes are not improved at all?

This matrix is the result of the comeback package which takes the test data as input and determines co-methylated regions, i.e. regions that are methylated or unmethylated together. It determines these regions by looking at probe bp distance, probe density, and correlation of expression levels. With my current settings if probes have a distance of less than 100k bp, with a density of probes in between them of <400bp and correlation of methylation levels >0.3, they are considered a co-methylated region. But only about 4% of the probes are currently assigned to these co-methylated regions, thats why i get so few correlation blocks. I am only calculating within-region correlations. Most of the matrix is just empty, since most probes are considered independent in this case.

So what I need is some block correlation structure that covers all or most of the probes in my set?

@privefl
Copy link
Owner

privefl commented Feb 27, 2024

You need the correlation matrix, but that's complicated if you have 760,668 variables and no way to pre-determine that some of them are mostly uncorrelated.

@KristofferSandas
Copy link
Author

Yes, that seems to be the main issue. If I manage to get a complete probe-probe correlation matrix, meaning no zero values but an expected high number of near-zero values, do you think the algorithm would respond well?

@privefl
Copy link
Owner

privefl commented Feb 27, 2024

The problem is that you cannot get a 760Kx760K dense matrix, that's too large.

@KristofferSandas
Copy link
Author

Right, 4 bytes x 760000² = 2.9TB. Should have thought of that of course...

@privefl
Copy link
Owner

privefl commented Feb 27, 2024

The default is even 8 bytes per value, so 6 TB, that's a lot, yes.

@KristofferSandas
Copy link
Author

By applying some less strict thresholds for co-methylation I created a correlation matrix where about 17% of probes are non-singleton blocks. This improved the pcor to -0.022, so it seems that the more of the probes I get into blocks, the better the result.

I currently have a calculated h2 of 29, this seems extremely high. I was expecting h2 to be between 0-1.

@privefl
Copy link
Owner

privefl commented Mar 5, 2024

  • A negative correlation is not good, unless you know you can choose the right direction. What is the confidence interval?

  • I don't think any measure of h2 would make sense here since you do not have genotype data.

  • Maybe other things are going wrong. Did you check range, keep, and some plot like this?
    unnamed-chunk-25-1

@KristofferSandas
Copy link
Author

So a negative correlation would be a sign of something going wrong in the model, not a true reflection of the data relation? The pcor results are -0.0225 -0.0783 0.0336.

Does the h2 used in snp_ldpred2_auto() work mathematically in any case? I calculate it with snp_ldsc() but my df_beta only contains placeholder values for a1="B" and a0="C" so I kind of force it into the correct format. As I understand it snp_ldsc doesnt make use of a1 and a0 anyway.

Yes, keep keeps all 30 chains using standard thresholds, so they all seem to converge, the plots look good too. range ranges lie between 0.3252 and 0.3312 with a mean=median=0.3280. This is where I had problems in my initial run with test data, the beta_est were all NA, but at least I get numbers now.

Instead of big_prodVec() i simply do risk_scores<- t(test_data) %*% beta_auto where test_data is a df with methylation levels (floating points), rows are probes, columns are samples. I order the probes according to the ld vector from when I created the corr SFBM, which should be in the same order as beta_auto. Could something be going wrong here? Im not sure how the G variable is processed in big_prodVec, but G is always values of 0,1,2 right?

@privefl
Copy link
Owner

privefl commented Mar 5, 2024

  • Here you basically have a 0 correlation (not significantly different from 0), and it is negative by chance (you probably have a small sample size given the large CI). This can result from an error in matching either probes or samples.
  • snp_ldsc() should not depend on the sign of the effects.
  • You should probably get the order of probes from df_beta[["_NUM_ID_"]].

@KristofferSandas
Copy link
Author

KristofferSandas commented Mar 6, 2024

Im trying to exclude some possible problems.

Instead of genotype data I have M-values:
Methylation levels are measured for each CpG site with a probe that captures methylated CpGs and one that captures unmethylated. An M-value around 0 means that the methylated and unmethylated signals are close to equal (in theory, one allele is methylated). A positive M-value means that the methylated signal is higher, while a negative M-value means that the unmethylated signal is higher. Example of what the data looks like after accounting for covariates by residualization:
image
So Im dealing with very small floating points instead of 0,1,2.

The sumstats effect sizes are logFC of M-values (schizophrenia cases vs controls), and look like this:
image
I use the random effect sizes from this meta-EWAS.

After formatting my data to create df_beta, all chr and pos values are zero since I dont have locations in my data. I set all a1 to "B" and all a0 to "C". I calculate n_eff according to the recommendation for meta analyses in the LDpred.R doc.

My test sample size is 1227.

Does any of these things seem like they could mess with the algorithm?

@privefl
Copy link
Owner

privefl commented Mar 6, 2024

Have you looked at the QC step comparing standard deviations?
Usually there are problems detected using the random effects.
I would try using the fixed effects instead.

@KristofferSandas
Copy link
Author

Thats a good idea. Is there an function with the QC check from the tutorial implemented?

I did a run now with probes randomly assigned to correlation blocks, and my pcor output is: 0.16 0.11 0.22 so thats a lot better. When it comes to the R2 value its 0.036, which is higher than for the unadjusted scores and all previous models. This is good news. It means that as long as I can find a biological motivation for assigning all probes to blocks I can probably get some good numbers.

@privefl
Copy link
Owner

privefl commented Mar 6, 2024

This is the part where it is described: https://privefl.github.io/bigsnpr/articles/LDpred2.html#quality-control-of-gwas-summary-statistics.
And I think there is some link to some code example that I used in some paper.

@KristofferSandas
Copy link
Author

Heres an update: I used TAD regions to create blocks in my correlation matrix:
https://cb.csail.mit.edu/cb/tadmap/

This worked very well and gave the best pcor and R2 values so far. I will clean up the code and post it somewhere in a few weeks.

Right now Im trying to grasp a bit of what is happening mathemtically for my thesis report. Is it so that snp_ldsc() uses the equation from the LDpred1 Vilhjalmsson paper under headline Estimation of the Heritability Parameter in the following way:
N=n_eff
beta/se(beta) instead of chi^2
ld instead of the l parameter

@privefl
Copy link
Owner

privefl commented Mar 28, 2024

snp_ldsc() implements LD score regression as described in the LDSc regression papers.

@KristofferSandas
Copy link
Author

Ok, I think I get it. Another thing I wondered about is where the penalization or approximation of penalized regression happens? Is it in the incorporation of the zero values from the correlation matrix into the betas in the sampler? equation 5 in LDpred2 paper

@privefl
Copy link
Owner

privefl commented Apr 2, 2024

No, the zero values in the correlation matrix are there for scalability.

Two things can be thought as regularization:

  • the fact that we sample effects with probability pj (therefore prob of 1-pj of being 0)
  • the scaling factor 1 / (1 + M p / (N h2))

@KristofferSandas
Copy link
Author

We will try to publish a research note in BMC bioinformatics about my EWAS project and I'd like to add you in acknowledgements if thats ok?

@privefl
Copy link
Owner

privefl commented Dec 8, 2024

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants