Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to read from cloud storage #22

Closed
hammer opened this issue Dec 4, 2023 · 16 comments
Closed

Add ability to read from cloud storage #22

hammer opened this issue Dec 4, 2023 · 16 comments
Labels
enhancement New feature or request

Comments

@hammer
Copy link

hammer commented Dec 4, 2023

Perhaps with the object_store crate?

@CarlKCarlK CarlKCarlK added the enhancement New feature or request label Dec 4, 2023
@CarlKCarlK
Copy link
Contributor

Very interesting idea. Thanks!

Object_store's vectored_read seems especially on-point.
This would be most useful if Python users of bed-reader could use it, too. I wonder if there are any examples of Python crates that under-the-covers use object_store.

Time line: I should be able to at least take a look in more detail in a few weeks.
(Also, either here or privately, please let me know how useful/important this would be you/your project/your users.)

@hammer
Copy link
Author

hammer commented Dec 4, 2023

https://github.com/roeap/object-store-python seems to be a Python interface.

Thanks for the consideration Carl!

@hammer
Copy link
Author

hammer commented Dec 4, 2023

Also, some details on our use case: we are finally writing up a publication on sgkit at https://github.com/pystatgen/sgkit-publication. I'm hoping to demonstrate how to run a GWAS using a large synthetic dataset called HAPNEST: sgkit-dev/sgkit-publication#43. The BED files for the largest chromosomes are over 100 GB, and I'm hoping to read them in from cloud storage rather than copy them to a local filesystem to read. I suppose I could explore ways of mounting cloud storage on the local filesystem, but it would be nicer I think to just use the cloud storage APIs themselves.

@CarlKCarlK
Copy link
Contributor

A request for more info to allow eventual performance testing ...
So, the largest bed file is 1 millions samples by ??? # variants. Would a typical read by all the samples x some slice of variants? If so, how many variants at time would be typical?
Which cloud provider would be your first choice? (I'll need to test on one and it might as well be the same as yours. I'm especially interested/worried in seeing the vectored_read work efficiently.)

@hammer
Copy link
Author

hammer commented Dec 7, 2023

1 million samples by 6.8 million variants. Our use case is to read the entire BED file and write it back out in the Zarr storage format. Currently we don't do anything intelligent to read and write in blocks, unfortunately, but that's something we should probably be able to add in our library, as we do when reading large VCF files.

@hammer
Copy link
Author

hammer commented Dec 7, 2023

Oh and for first choice cloud provider I'm currently using Google, but I'd say Amazon is the most common and I'd be happy to switch to them if you prefer.

@CarlKCarlK
Copy link
Contributor

CarlKCarlK commented Dec 7, 2023 via email

@CarlKCarlK
Copy link
Contributor

I'm making progress. Below is some (working) example code.

Current API: a Bed struct (that contains a file path)
Proposed, Additional API:

  • a BedCloud struct (that contains an object store and a store path).
  • anything that might involve "reading" must be .awaited
  • You can control the number of concurrent async calls to the cloud
  • You can control the size of the buffer (number of SNPs in each async call to the cloud)
    let object_store = Arc::new(LocalFileSystem::new());
    let file_path = sample_bed_file("plink_sim_10s_100v_10pmiss.bed")?;
    let store_path = StorePath::from_filesystem_path(file_path).map_err(BedErrorPlus::from)?;
    
    let mut bed_cloud = BedCloud::new((object_store, store_path)).await?;
    let val = bed_cloud.read::<i8>().await?;
    println!("{val:?}");

Another example:

    let object_path = sample_bed_object_path("plink_sim_10s_100v_10pmiss.bed")?;

    let mut bed_cloud = BedCloud::new(object_path).await?;
    let val = ReadOptions::builder()
        .count_a2()
        .i8()
        .read_cloud(&mut bed_cloud)
        .await?;

    let mean = val.mapv(|elem| elem as f64).mean().unwrap();
    println!("{mean:?}");
    assert!(mean == -13.274); // really shouldn't do mean on data where -127 represents missing

@hammer Let me know if this looks like it would work well with your use case.

@hammer
Copy link
Author

hammer commented Dec 21, 2023

Thanks Carl! Let me cc @tomwhite as I believe he wrote the BED reader code in sgkit. I assume you'd wrap these Rust interfaces with a Python interface in your library?

@CarlKCarlK
Copy link
Contributor

Yeah, I kind of forgot that everyone (including my other projects) use the library from Python and not from Rust. :-)

My next steps:

  • (Holiday stuff)
  • Get it working well from AWS, not just object-store's LocalFileSystem
  • Create the Python interface
  • Get Write (not just Read) working.

@hammer or @tomwhite, if you have any or know of any large non-private PLINK Bed files on AWS (or Google if necessary) that I can use for performance testing, please let me know.

@hammer
Copy link
Author

hammer commented Jan 11, 2024

@CarlKCarlK Happy new year! I've been using the files from https://www.ebi.ac.uk/biostudies/studies/S-BSST936 for testing. I have all of the smaller files (600 samples) and chr20 for the 1,000,000 samples on Google Cloud Storage if that would be helpful. It took me about 30 minutes to get chr20 off of FTP and into the cloud, so it may just be easier to do yourself too.

@CarlKCarlK
Copy link
Contributor

@hammer & @tomwhite

Please see the end of this note for something for you to try.

Meanwhile, some observations. I'm finding this feature somewhat frustrating. Let me share some of those frustrations and perhaps you can make suggestions (or just offer sympathy 😊).

  • I would like my documentation to offer my users working examples but as far as I can tell all the cloud providers require that even "public data" be authenticated to access, making simple examples impossible to offer.
  • I depend on the Rust version of object_store. It is OK and I'm using it extensively under the covers. However, it's documentation is very bare bones (perhaps in part because of the first point above.)
  • The Python of object_store version is more limited, so instead of using it, I'm currently just having Python pass URLs and option strings. I hope that is OK with you and other future users.
  • Sadly, I can't find any nice documentation on creating cloud access URLs and option strings. This puts me in the position of telling my users "just create an URL for the cloud access", but not being able to point them to any good instructions on how to create such a URL.
  • I'd like to test and tune the performance of downloading parts of big files from the cloud. However, I'm afraid to host the data myself for fear of an unexpected bill. (Last year, I ran up a $200 on AWS when testing the Mac M2 version of BedReader -- I begged, and AWS did kindly forgive it.)
  • This feature increases the size of the bed-reader download from 1.5 meg to 7.5 meg. I need to investigate to see why. Maybe I misconfigured something or maybe this is just the cost of adding cloud support.
  • For a while, I thought I could expose async on the Python side. Instead, I now offer only a regular (non-async) interface which is much, much simpler for users. Under the covers, in the Rust code, it does use and offer async. I hope no Python users need direct async access.

None of this means the feature isn't worth adding. I think the state of cloud/async/etc. a bit more primitive that I would have thought.

  • Carl

============
Please try this beta version with cloud support:

pip install bed-reader[samples,sparse]==1.0.1b1

The documentation is here: https://fastlmm.github.io/bed-reader/1.0.1beta/

Here is sample usage:

import numpy as np
from bed_reader import open_bed

# Somehow, get your AWS credentials
import configparser, os  
config = configparser.ConfigParser()  
_ = config.read(os.path.expanduser("~/.aws/credentials"))  

# Create a dictionary with your AWS credentials and the AWS region.
cloud_options = {  
    "aws_access_key_id": config["default"].get("aws_access_key_id"),  
    "aws_secret_access_key": config["default"].get("aws_secret_access_key"),  
    "aws_region": "us-west-2"}  

# Open the bed file with a URL and any needed cloud options, then use as before.
with open_bed("s3://bedreader/v1/toydata.5chrom.bed", cloud_options=cloud_options) as bed:  
    val = bed.read(np.s_[:10, :10])  
val

@hammer
Copy link
Author

hammer commented Jan 11, 2024

Thanks Carl I'll check it out! To your point about the state of cloud/async being more primitive than expected, we've found the same thing. As one example the primary AWS Python library does not support asyncio: boto/botocore#458.

@CarlKCarlK
Copy link
Contributor

I'm very excited because bed-reader can now efficiently read from regular web servers. This lets us, for example, read a SNP--almost instantly--directly from the S-BSST936 website!

import numpy as np
from bed_reader import open_bed
with open_bed(
    "https://www.ebi.ac.uk/biostudies/files/S-BSST936/genotypes/synthetic_v1_chr-10.bed",
    skip_format_check=True,
    iid_count=1_008_000,
    sid_count=361_561,
    ) as bed:
    val = bed.read(index=np.s_[:, 100_000], dtype=np.float32)
    np.mean(val) 
# outputs 0.033913...

More examples here: https://fastlmm.github.io/bed-reader/1.0.1beta/cloud_urls.html

Please install with
pip install bed-reader[samples,sparse]==1.0.1b2

The full docs are: The documentation is here: https://fastlmm.github.io/bed-reader/1.0.1beta/

  • Carl

@hammer
Copy link
Author

hammer commented Jan 22, 2024

Thanks Carl! I'm on vacation this week but am excited to try this out next week when I'm back.

@CarlKCarlK
Copy link
Contributor

Support for cloud files is now released as version 1.02 on PiPy. Thanks for your suggestion!

(I also used this as an excuse to write an article about adding this feature to the Rust side of the code. It will hopefully soon be published on https://medium.com/@carlmkadie)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants