Home Artificial Intelligence Nine Rules for Accessing Cloud Files from Your Rust Code Rule 1: Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file. Rule 2: Sequentially read text lines from cloud files via two nested loops. Rule 3: Randomly access cloud files, even giant ones, with range methods, while respecting server-imposed limits. Rule 4: Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud. Rule 5: Test via tokio::test on http and native files. Rule 6: For max performance, add cloud-file support to your Rust library via an async API. Rule 7: Alternatively, for optimum convenience, add cloud-file support to your Rust library via a standard (“synchronous”) API. Rule 8: Follow the principles of fine API design partly by utilizing hidden lines in your doc tests. Rule 9: Include a runtime, but optionally. Conclusion

Nine Rules for Accessing Cloud Files from Your Rust Code Rule 1: Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file. Rule 2: Sequentially read text lines from cloud files via two nested loops. Rule 3: Randomly access cloud files, even giant ones, with range methods, while respecting server-imposed limits. Rule 4: Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud. Rule 5: Test via tokio::test on http and native files. Rule 6: For max performance, add cloud-file support to your Rust library via an async API. Rule 7: Alternatively, for optimum convenience, add cloud-file support to your Rust library via a standard (“synchronous”) API. Rule 8: Follow the principles of fine API design partly by utilizing hidden lines in your doc tests. Rule 9: Include a runtime, but optionally. Conclusion

0
Nine Rules for Accessing Cloud Files from Your Rust Code
Rule 1: Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file.
Rule 2: Sequentially read text lines from cloud files via two nested loops.
Rule 3: Randomly access cloud files, even giant ones, with range methods, while respecting server-imposed limits.
Rule 4: Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud.
Rule 5: Test via tokio::test on http and native files.
Rule 6: For max performance, add cloud-file support to your Rust library via an async API.
Rule 7: Alternatively, for optimum convenience, add cloud-file support to your Rust library via a standard (“synchronous”) API.
Rule 8: Follow the principles of fine API design partly by utilizing hidden lines in your doc tests.
Rule 9: Include a runtime, but optionally.
Conclusion

Practical lessons from upgrading Bed-Reader, a bioinformatics library

Towards Data Science
Rust and Python reading DNA data directly from the cloud — Source: https://openai.com/dall-e-2/. All other figures from the writer.

Would you want your Rust program to seamlessly access data from files within the cloud? After I consult with “files within the cloud,” I mean data housed on web servers or inside cloud storage solutions like AWS S3, Azure Blob Storage, or Google Cloud Storage. The term “read”, here, encompasses each the sequential retrieval of file contents — be they text or binary, from starting to finish —and the aptitude to pinpoint and extract specific sections of the file as needed.

Upgrading your program to access cloud files can reduce annoyance and complication: the annoyance of downloading to local storage and the complication of periodically checking that an area copy is up to this point.

Sadly, upgrading your program to access cloud files may also increase annoyance and complication: the annoyance of URLs and credential information, and the complication of asynchronous programming.

Bed-Reader is a Python package and Rust crate for reading PLINK Bed Files, a binary format utilized in bioinformatics to store genotype (DNA) data. At a user’s request, I recently updated Bed-Reader to optionally read data directly from cloud storage. Along the best way, I learned nine rules that may make it easier to add cloud-file support to your programs. The principles are:

  1. Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file.
  2. Sequentially read text lines from cloud files via two nested loops.
  3. Randomly access cloud files, even giant ones, with “range” methods, while respecting server-imposed limits.
  4. Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud.
  5. Test via tokio::test on http and native files.

If other programs call your program — in other words, in case your program offers an API (application program interface) — 4 additional rules apply:

6. For max performance, add cloud-file support to your Rust library via an async API.

7. Alternatively, for optimum convenience, add cloud-file support to your Rust library via a standard (“synchronous”) API.

8. Follow the principles of fine API design partly by utilizing hidden lines in your doc tests.

9. Include a runtime, but optionally.

Aside: To avoid wishy-washiness, I call these “rules”, but they’re, after all, just suggestions.

The powerful object_store crate provides full content access to files stored on http, AWS S3, Azure, Google Cloud, and native files. It is a component of the Apache Arrow project and has over 2.4 million downloads.

For this text, I also created a brand new crate called cloud-file. It simplifies the usage of the object_store crate. It wraps and focuses on a useful subset of object_store’s features. You may either use it directly, or pull-out its code for your personal use.

Let’s have a look at an example. We’ll count the lines of a cloud file by counting the variety of newline characters it accommodates.

use cloud_file::{CloudFile, CloudFileError};
use futures_util::StreamExt; // Enables `.next()` on streams.

async fn count_lines(cloud_file: &CloudFile) -> Result {
let mut chunks = cloud_file.stream_chunks().await?;
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b'n');
}
Okay(newline_count)
}

#[tokio::main]
async fn most important() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/toydata.5chrom.fam";
let options = [("timeout", "10s")];
let cloud_file = CloudFile::new_with_options(url, options)?;
let line_count = count_lines(&cloud_file).await?;
println!("line_count: {line_count}");
Okay(())
}

Once we run this code, it returns:

line_count: 500

Some points of interest:

  • We use async (and, here, tokio). We’ll discuss this alternative more in Rules 6 and seven.
  • We turn a URL string and string options right into a CloudFile instance with CloudFile::new_with_options(url, options)?. We use ? to catch malformed URLs).
  • We create a stream of binary chunks with cloud_file.stream_chunks().await?. That is the primary place that the code tries to access the cloud file. If the file doesn’t exist or we are able to’t open it, the ? will return an error.
  • We use chunks.next().await to retrieve the file’s next binary chunk. (Note the use futures_util::StreamExt;.) The next method returns None in any case chunks have been retrieved.
  • What if there is a next chunk but in addition an issue retrieving it? We’ll catch any problem with let chunk = chunk?;.
  • Finally, we use the fast bytecount crate to count newline characters.

In contrast with this cloud solution, take into consideration how you’ll write an easy line counter for an area file. You may write this:

use std::fs::File;
use std::io::{self, BufRead, BufReader};

fn most important() -> io::Result<()> {
let path = "examples/line_counts_local.rs";
let reader = BufReader::latest(File::open(path)?);
let mut line_count = 0;
for line in reader.lines() {
let _line = line?;
line_count += 1;
}
println!("line_count: {line_count}");
Okay(())
}

Between the cloud-file version and the local-file version, three differences stand out. First, we are able to easily read local files as text. By default, we read cloud files as binary (but see Rule 2). Second, by default, we read local files synchronously, blocking program execution until completion. Then again, we normally access cloud files asynchronously, allowing other parts of this system to proceed running while waiting for the relatively slow network access to finish. Third, iterators akin to lines() support for. Nonetheless, streams akin to stream_chunks() don’t, so we use while let.

I discussed earlier that you just didn’t need to make use of the cloud-file wrapper and that you can use the object_store crate directly. Let’s see what it looks like once we count the newlines in a cloud file using only object_store methods:

use futures_util::StreamExt;  // Enables `.next()` on streams.
pub use object_store::path::Path as StorePath;
use object_store::{parse_url_opts, ObjectStore};
use std::sync::Arc;
use url::Url;

async fn count_lines(
object_store: &Arc>,
store_path: StorePath,
) -> Result {
let mut chunks = object_store.get(&store_path).await?.into_stream();
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b'n');
}
Okay(newline_count)
}

#[tokio::main]
async fn most important() -> Result<(), anyhow::Error> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/toydata.5chrom.fam";
let options = [("timeout", "10s")];

let url = Url::parse(url)?;
let (object_store, store_path) = parse_url_opts(&url, options)?;
let object_store = Arc::latest(object_store); // enables cloning and borrowing
let line_count = count_lines(&object_store, store_path).await?;
println!("line_count: {line_count}");
Okay(())
}

You’ll see the code may be very just like the cloud-file code. The differences are:

  • As a substitute of 1 CloudFile input, most methods take two inputs: an ObjectStore and a StorePath. Because ObjectStore is a non-cloneable trait, here the count_lines function specifically uses &Arc>. Alternatively, we could make the function generic and use &Arc.
  • Creating the ObjectStore instance, the StorePath instance, and the stream requires a number of extra steps in comparison with making a CloudFile instance and a stream.
  • As a substitute of coping with one error type (namely, CloudFileError), multiple error types are possible, so we fall back to using the anyhow crate.

Whether you utilize object_store (with 2.4 million downloads) directly or not directly via cloud-file (currently, with 124 downloads 😀), is as much as you.

For the remainder of this text, I’ll give attention to cloud-file. If you desire to translate a cloud-file method into pure object_store code, look up the cloud-file method’s documentation and follow the “source” link. The source is normally only a line or two.

We’ve seen tips on how to sequentially read the bytes of a cloud file. Let’s look next at sequentially reading its lines.

We regularly wish to sequentially read the lines of a cloud file. To try this with cloud-file (or object_store) requires two nested loops.

The outer loop yields binary chunks, as before, but with a key modification: we now be sure that each chunk only accommodates complete lines, ranging from the primary character of a line and ending with a newline character. In other words, chunks may consist of a number of complete lines but no partial lines. The inner loop turns the chunk into text and iterates over the resultant a number of lines.

In this instance, given a cloud file and a number n, we discover the road at index position n:

use cloud_file::CloudFile;
use futures::StreamExt; // Enables `.next()` on streams.
use std::str::from_utf8;

async fn nth_line(cloud_file: &CloudFile, n: usize) -> Result {
// Each binary line_chunk accommodates a number of lines, that's, each chunk ends with a newline.
let mut line_chunks = cloud_file.stream_line_chunks().await?;
let mut index_iter = 0usize..;
while let Some(line_chunk) = line_chunks.next().await {
let line_chunk = line_chunk?;
let lines = from_utf8(&line_chunk)?.lines();
for line in lines {
let index = index_iter.next().unwrap(); // secure because we all know the iterator is infinite
if index == n {
return Okay(line.to_string());
}
}
}
Err(anyhow::anyhow!("Not enough lines within the file"))
}

#[tokio::main]
async fn most important() -> Result<(), anyhow::Error> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/toydata.5chrom.fam";
let n = 4;

let cloud_file = CloudFile::latest(url)?;
let line = nth_line(&cloud_file, n).await?;
println!("line at index {n}: {line}");
Okay(())
}

The code prints:

line at index 4: per4 per4 0 0 2 0.452591

Some points of interest:

  • The important thing method is .stream_line_chunks().
  • We must also call std::str::from_utf8 to create text. (Possibly returning a Utf8Error.) Also, we call the .lines() method to create an iterator of lines.
  • If we would like a line index, we must make it ourselves. Here we use:
let mut index_iter = 0usize..;
...
let index = index_iter.next().unwrap(); // secure because we all know the iterator is infinite

Aside: Why two loops? Why doesn’t cloud-file define a brand new stream that returns one line at a time? Because I don’t know the way. If anyone can figure it out, please send me a pull request with the answer!

I wish this was simpler. I’m completely happy it’s efficient. Let’s return to simplicity by next have a look at randomly accessing cloud files.

I work with a genomics file format called PLINK Bed 1.9. Files might be as large as 1 TB. Too big for web access? Not necessarily. We sometimes only need a fraction of the file. Furthermore, modern cloud services (including most web servers) can efficiently retrieve regions of interest from a cloud file.

Let’s have a look at an example. This test code uses a CloudFile method called read_range_and_file_size It reads a *.bed file’s first 3 bytes, checks that the file starts with the expected bytes, after which checks for the expected length.

#[tokio::test]
async fn check_file_signature() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/plink_sim_10s_100v_10pmiss.bed";
let cloud_file = CloudFile::latest(url)?;
let (bytes, size) = cloud_file.read_range_and_file_size(0..3).await?;

assert_eq!(bytes.len(), 3);
assert_eq!(bytes[0], 0x6c);
assert_eq!(bytes[1], 0x1b);
assert_eq!(bytes[2], 0x01);
assert_eq!(size, 303);
Okay(())
}

Notice that in a single web call, this method returns not only the bytes requested, but in addition the dimensions of the entire file.

Here is a listing of high-level CloudFile methods and what they will retrieve in a single web call:

These methods can run into two problems if we ask for an excessive amount of data at a time. First, our cloud service may limit the variety of bytes we are able to retrieve in a single call. Second, we may get faster results by making multiple simultaneous requests moderately than simply one by one.

Consider this instance: We would like to collect statistics on the frequency of adjoining ASCII characters in a file of any size. For instance, in a random sample of 10,000 adjoining characters, perhaps “th” appears 171 times.

Suppose our web server is completely happy with 10 concurrent requests but only wants us to retrieve 750 bytes per call. (8 MB can be a more normal limit).

Because of Ben Lichtman (B3NNY) on the Seattle Rust Meetup for pointing me in the correct direction on adding limits to async streams.

Our most important function could appear to be this:

#[tokio::main]
async fn most important() -> Result<(), anyhow::Error> {
let url = "https://www.gutenberg.org/cache/epub/100/pg100.txt";
let options = [("timeout", "30s")];
let cloud_file = CloudFile::new_with_options(url, options)?;

let seed = Some(0u64);
let sample_count = 10_000;
let max_chunk_bytes = 750; // 8_000_000 is a great default when chunks are greater.
let max_concurrent_requests = 10; // 10 is a great default

count_bigrams(
cloud_file,
sample_count,
seed,
max_concurrent_requests,
max_chunk_bytes,
)
.await?;

Okay(())
}

The count_bigrams function can start by making a random number generator and making a call to search out the dimensions of the cloud file:

#[cfg(not(target_pointer_width = "64"))]
compile_error!("This code requires a 64-bit goal architecture.");

use cloud_file::CloudFile;
use futures::pin_mut;
use futures_util::StreamExt; // Enables `.next()` on streams.
use rand::{rngs::StdRng, Rng, SeedableRng};
use std::{cmp::max, collections::HashMap, ops::Range};

async fn count_bigrams(
cloud_file: CloudFile,
sample_count: usize,
seed: Option,
max_concurrent_requests: usize,
max_chunk_bytes: usize,
) -> Result<(), anyhow::Error> {
// Create a random number generator
let mut rng = if let Some(s) = seed {
StdRng::seed_from_u64(s)
} else {
StdRng::from_entropy()
};

// Find the document size
let file_size = cloud_file.read_file_size().await?;
//...

Next, based on the file size, the function can create a vector of 10,000 random two-byte ranges.

   // Randomly select the two-byte ranges to sample
let range_samples: Vec> = (0..sample_count)
.map(|_| rng.gen_range(0..file_size - 1))
.map(|start| start..start + 2)
.collect();

For instance, it would produce the vector [4122418..4122420, 4361192..4361194, 145726..145728,]. But retrieving 20,000 bytes directly (we’re pretending) is just too much. So, we divide the vector into 27 chunks of not more than 750 bytes:

   // Divide the ranges into chunks respecting the max_chunk_bytes limit
const BYTES_PER_BIGRAM: usize = 2;
let chunk_count = max(1, max_chunk_bytes / BYTES_PER_BIGRAM);
let range_chunks = range_samples.chunks(chunk_count);

Using somewhat async magic, we create an iterator of future work for every of the 27 chunks after which we turn that iterator right into a stream. We tell the stream to do as much as 10 simultaneous calls. Also, we are saying that out-of-order results are positive.

   // Create an iterator of future work
let work_chunks_iterator = range_chunks.map(|chunk| {
let cloud_file = cloud_file.clone(); // by design, clone is affordable
async move { cloud_file.read_ranges(chunk).await }
});

// Create a stream of futures to run out-of-order and with constrained concurrency.
let work_chunks_stream =
futures_util::stream::iter(work_chunks_iterator).buffer_unordered(max_concurrent_requests);
pin_mut!(work_chunks_stream); // The compiler says we want this

Within the last section of code, we first do the work within the stream and — as we get results — tabulate. Finally, we sort and print the highest results.

    // Run the futures and, as result bytes are available, tabulate.
let mut bigram_counts = HashMap::latest();
while let Some(result) = work_chunks_stream.next().await {
let bytes_vec = result?;
for bytes in bytes_vec.iter() {
let bigram = (bytes[0], bytes[1]);
let count = bigram_counts.entry(bigram).or_insert(0);
*count += 1;
}
}

// Sort the bigrams by count and print the highest 10
let mut bigram_count_vec: Vec<(_, usize)> = bigram_counts.into_iter().collect();
bigram_count_vec.sort_by(|a, b| b.1.cmp(&a.1));
for (bigram, count) in bigram_count_vec.into_iter().take(10) {
let char0 = (bigram.0 as char).escape_default();
let char1 = (bigram.1 as char).escape_default();
println!("Bigram ('{}{}') occurs {} times", char0, char1, count);
}
Okay(())
}

The output is:

Bigram ('rn') occurs 367 times
Bigram ('e ') occurs 221 times
Bigram (' t') occurs 184 times
Bigram ('th') occurs 171 times
Bigram ('he') occurs 158 times
Bigram ('s ') occurs 143 times
Bigram ('.r') occurs 136 times
Bigram ('d ') occurs 133 times
Bigram (', ') occurs 127 times
Bigram (' a') occurs 121 times

The code for the Bed-Reader genomics crate uses the identical technique to retrieve information from scattered DNA regions of interest. Because the DNA information is available in, perhaps out of order, the code fills in the right columns of an output array.

Aside: This method uses an iterator, a stream, and a loop. I wish it were simpler. In case you can determine a less complicated strategy to retrieve a vector of regions while limiting the utmost chunk size and the utmost variety of concurrent requests, please send me a pull request.

That covers access to files stored on an HTTP server, but what about AWS S3 and other cloud services? What about local files?

The object_store crate (and the cloud-file wrapper crate) supports specifying files either via a URL string or via structs. I like to recommend sticking with URL strings, however the alternative is yours.

Let’s consider an AWS S3 example. As you may see, AWS access requires credential information.

use cloud_file::CloudFile;
use rusoto_credential::{CredentialsError, ProfileProvider, ProvideAwsCredentials};

#[tokio::main]
async fn most important() -> Result<(), anyhow::Error> {
// get credentials from ~/.aws/credentials
let credentials = if let Okay(provider) = ProfileProvider::latest() {
provider.credentials().await
} else {
Err(CredentialsError::latest("No credentials found"))
};

let Okay(credentials) = credentials else {
eprintln!("Skipping example because no AWS credentials found");
return Okay(());
};

let url = "s3://bedreader/v1/toydata.5chrom.bed";
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
];
let cloud_file = CloudFile::new_with_options(url, options)?;

assert_eq!(cloud_file.read_file_size().await?, 1_250_003);
Okay(())
}

The important thing part is:

    let url = "s3://bedreader/v1/toydata.5chrom.bed";
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
];
let cloud_file = CloudFile::new_with_options(url, options)?;

If we wish to make use of structs as a substitute of URL strings, this becomes:

    use object_store::{aws::AmazonS3Builder, path::Path as StorePath};

let s3 = AmazonS3Builder::latest()
.with_region("us-west-2")
.with_bucket_name("bedreader")
.with_access_key_id(credentials.aws_access_key_id())
.with_secret_access_key(credentials.aws_secret_access_key())
.construct()?;
let store_path = StorePath::parse("v1/toydata.5chrom.bed")?;
let cloud_file = CloudFile::from_structs(s3, store_path);

I prefer the URL approach over structs. I find URLs barely simpler, rather more uniform across cloud services, and vastly easier for interop (with, for instance, Python).

Listed here are example URLs for the three web services I actually have used:

Local files don’t need options. For the opposite services, listed here are links to their supported options and chosen examples:

Now that we are able to specify and skim cloud files, we should always create tests.

The object_store crate (and cloud-file) supports any async runtime. For testing, the Tokio runtime makes it easy to check your code on cloud files. Here’s a test on an http file:

[tokio::test]
async fn cloud_file_extension() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/plink_sim_10s_100v_10pmiss.bed";
let mut cloud_file = CloudFile::latest(url)?;
assert_eq!(cloud_file.read_file_size().await?, 303);
cloud_file.set_extension("fam")?;
assert_eq!(cloud_file.read_file_size().await?, 130);
Okay(())
}

Run this test with:

cargo test

In case you don’t wish to hit an out of doors web server together with your tests, you may as a substitute test against local files as if they were within the cloud.

#[tokio::test]
async fn local_file() -> Result<(), CloudFileError> {
use std::env;

let apache_url = abs_path_to_url_string(env::var("CARGO_MANIFEST_DIR").unwrap()
+ "/LICENSE-APACHE")?;
let cloud_file = CloudFile::latest(&apache_url)?;
assert_eq!(cloud_file.read_file_size().await?, 9898);
Okay(())
}

This uses the usual Rust environment variable CARGO_MANIFEST_DIR to search out the total path to a text file. It then uses cloud_file::abs_path_to_url_string to accurately encode that full path right into a URL.

Whether you test on http files or local files, the facility of object_store signifies that your code should work on any cloud service, including AWS S3, Azure, and Google Cloud.

In case you only must access cloud files for your personal use, you may stop reading the principles here and skip to the conclusion. In case you are adding cloud access to a library (Rust crate) for others, keep reading.

In case you offer a Rust crate to others, supporting cloud files offers great convenience to your users, but not with no cost. Let’s have a look at Bed-Reader, the genomics crate to which I added cloud support.

As previously mentioned, Bed-Reader is a library for reading and writing PLINK Bed Files, a binary format utilized in bioinformatics to store genotype (DNA) data. Files in Bed format might be as large as a terabyte. Bed-Reader gives users fast, random access to large subsets of the information. It returns a 2-D array within the user’s alternative of int8, float32, or float64. Bed-Reader also gives users access to 12 pieces of metadata, six related to individuals and 6 related to SNPs (roughly speaking, DNA locations). The genotype data is commonly 100,000 times larger than the metadata.

PLINK stores genotype data and metadata. (Figure by writer.)

Aside: On this context, an “API” refers to an Application Programming Interface. It’s the general public structs, methods, etc., provided by library code akin to Bed-Reader for an additional program to call.

Here is a few sample code using Bed-Reader’s original “local file” API. This code lists the primary five individual ids, the primary five SNP ids, and each unique chromosome number. It then reads every genomic value in chromosome 5:

#[test]
fn lib_intro() -> Result<(), Box> {
let file_name = sample_bed_file("some_missing.bed")?;

let mut bed = Bed::latest(file_name)?;
println!("{:?}", bed.iid()?.slice(s![..5])); // Outputs ndarray: ["iid_0", "iid_1", "iid_2", "iid_3", "iid_4"]
println!("{:?}", bed.sid()?.slice(s![..5])); // Outputs ndarray: ["sid_0", "sid_1", "sid_2", "sid_3", "sid_4"]
println!("{:?}", bed.chromosome()?.iter().collect::>());
// Outputs: {"12", "10", "4", "8", "19", "21", "9", "15", "6", "16", "13", "7", "17", "18", "1", "22", "11", "2", "20", "3", "5", "14"}
let _ = ReadOptions::builder()
.sid_index(bed.chromosome()?.map(|elem| elem == "5"))
.f64()
.read(&mut bed)?;

Okay(())
}

And here is similar code using the brand new cloud file API:

#[tokio::test]
async fn cloud_lib_intro() -> Result<(), Box> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/some_missing.bed";
let cloud_options = [("timeout", "10s")];

let mut bed_cloud = BedCloud::new_with_options(url, cloud_options).await?;
println!("{:?}", bed_cloud.iid().await?.slice(s![..5])); // Outputs ndarray: ["iid_0", "iid_1", "iid_2", "iid_3", "iid_4"]
println!("{:?}", bed_cloud.sid().await?.slice(s![..5])); // Outputs ndarray: ["sid_0", "sid_1", "sid_2", "sid_3", "sid_4"]
println!(
"{:?}",
bed_cloud.chromosome().await?.iter().collect::>()
);
// Outputs: {"12", "10", "4", "8", "19", "21", "9", "15", "6", "16", "13", "7", "17", "18", "1", "22", "11", "2", "20", "3", "5", "14"}
let _ = ReadOptions::builder()
.sid_index(bed_cloud.chromosome().await?.map(|elem| elem == "5"))
.f64()
.read_cloud(&mut bed_cloud)
.await?;

Okay(())
}

When switching to cloud data, a Bed-Reader user must make these changes:

  • They have to run in an async environment, here #[tokio::test].
  • They have to use a brand new struct, BedCloud as a substitute of Bed. (Also, not shown, BedCloudBuilder moderately than BedBuilder.)
  • They offer a URL string and optional string options moderately than an area file path.
  • They have to use .await in lots of, moderately unpredictable, places. (Happily, the compiler gives a great error message in the event that they miss a spot.)
  • The ReadOptionsBuilder gets a brand new method, read_cloud, to associate with its previous read method.

From the library developer’s viewpoint, adding the brand new BedCloud and BedCloudBuilder structs costs many lines of most important and test code. In my case, 2,200 lines of latest most important code and a pair of,400 lines of latest test code.

Aside: Also, see Mario Ortiz Manero’s article “The bane of my existence: Supporting each async and sync code in Rust”.

The profit users get from these changes is the flexibility to read data from cloud files with async’s high efficiency.

Is that this profit value it? If not, there may be an alternate that we’ll have a look at next.

If adding an efficient async API looks as if an excessive amount of be just right for you or seems too confusing to your users, there may be an alternate. Namely, you may offer a standard (“synchronous”) API. I do that for the Python version of Bed-Reader and for the Rust code that supports the Python version.

Aside: See: Nine Rules for Writing Python Extensions in Rust: Practical Lessons from Upgrading Bed-Reader, a Python Bioinformatics Package in Towards Data Science.

Here is the Rust function that Python calls to ascertain if a *.bed file starts with the right file signature.

use tokio::runtime;
// ...
#[pyfn(m)]
fn check_file_cloud(location: &str, options: HashMap<&str, String>) -> Result<(), PyErr> {
runtime::Runtime::latest()?.block_on(async {
BedCloud::new_with_options(location, options).await?;
Okay(())
})
}

Notice that that is not an async function. It’s a standard “synchronous” function. Inside this synchronous function, Rust makes an async call:

BedCloud::new_with_options(location, options).await?;

We make the async call synchronous by wrapping it in a Tokio runtime:

use tokio::runtime;
// ...

runtime::Runtime::latest()?.block_on(async {
BedCloud::new_with_options(location, options).await?;
Okay(())
})

Bed-Reader’s Python users could previously open an area file for reading with the command open_bed(file_name_string). Now, they may also open a cloud file for reading with the identical command open_bed(url_string). The one difference is the format of the string they pass in.

Here is the instance from Rule 6, in Python, using the updated Python API:

  with open_bed(
"https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/some_missing.bed",
cloud_options={"timeout": "30s"},
) as bed:
print(bed.iid[:5])
print(bed.sid[:5])
print(np.unique(bed.chromosome))
val = bed.read(index=np.s_[:, bed.chromosome == "5"])
print(val.shape)

Notice the Python API also offers a brand new optional parameter called cloud_options. Also, behind the scenes, a tiny bit of latest code distinguishes between strings representing local files and strings representing URLs.

In Rust, you should use the identical trick to make calls to object_cloud synchronous. Specifically, you may wrap async calls in a runtime. The profit is a less complicated interface and fewer library code. The fee is less efficiency in comparison with offering an async API.

In case you resolve against the “synchronous” alternative and decide to offer an async API, you’ll discover a brand new problem: providing async examples in your documentation. We are going to have a look at that issue next.

All the principles from the article Nine Rules for Elegant Rust Library APIs: Practical Lessons from Porting Bed-Reader, a Bioinformatics Library, from Python to Rust in Towards Data Science apply. Of particular importance are these two:

Write good documentation to maintain your design honest.
Create examples that don’t embarrass you.

These suggest that we should always give examples in our documentation, but how can we try this with async methods and awaits? The trick is “hidden lines” in our doc tests. For instance, here is the documentation for CloudFile::read_ranges:

    /// Return the `Vec` of [`Bytes`](https://docs.rs/bytes/latest/bytes/struct.Bytes.html) from specified ranges.
///
/// # Example
/// ```
/// use cloud_file::CloudFile;
///
/// # Runtime::latest().unwrap().block_on(async {
/// let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/plink_sim_10s_100v_10pmiss.bim";
/// let cloud_file = CloudFile::latest(url)?;
/// let bytes_vec = cloud_file.read_ranges(&[0..10, 1000..1010]).await?;
/// assert_eq!(bytes_vec.len(), 2);
/// assert_eq!(bytes_vec[0].as_ref(), b"1t1:1:A:Ct");
/// assert_eq!(bytes_vec[1].as_ref(), b":A:Ct0.0t4");
/// # Okay::<(), CloudFileError>(())}).unwrap();
/// # use {tokio::runtime::Runtime, cloud_file::CloudFileError};
/// ```

The doc test starts with ```. Inside the doc test, lines starting with /// # disappear from the documentation:

The hidden lines, nevertheless, will still be run by cargo test.

In my library crates, I try to incorporate a working example with every method. If such an example seems overly complex or otherwise embarrassing, I attempt to fix the difficulty by improving the API.

Notice that on this rule and the previous Rule 7, we added a runtime to the code. Unfortunately, including a runtime can easily double the dimensions of your user’s programs, even in the event that they don’t read files from the cloud. Making this extra size optional is the subject of Rule 9.

In case you follow Rule 6 and supply async methods, your users gain the liberty to decide on their very own runtime. Choosing a runtime like Tokio may significantly increase their compiled program’s size. Nonetheless, in the event that they use no async methods, choosing a runtime becomes unnecessary, keeping the compiled program lean. This embodies the “zero cost principle”, where one incurs costs just for the features one uses.

Then again, if you happen to follow Rule 7 and wrap async calls inside traditional, “synchronous” methods, then you will need to provide a runtime. This can increase the dimensions of the resultant program. To mitigate this cost, it is best to make the inclusion of any runtime optional.

Bed-Reader features a runtime under two conditions. First, when used as a Python extension. Second, when testing the async methods. To handle the primary condition, we create a Cargo feature called extension-module that pulls in optional dependencies pyo3 and tokio. Listed here are the relevant sections of Cargo.toml:

[features]
extension-module = ["pyo3/extension-module", "tokio/full"]
default = []

[dependencies]
#...
pyo3 = { version = "0.20.0", features = ["extension-module"], optional = true }
tokio = { version = "1.35.0", features = ["full"], optional = true }

Also, because I’m using Maturin to create a Rust extension for Python, I include this text in pyproject.toml:

[tool.maturin]
features = ["extension-module"]

I put all of the Rust code related to extending Python in a file called python_modules.rs. It starts with this conditional compilation attribute:

#![cfg(feature = "extension-module")] // ignore file if feature not 'on'

This starting line ensures that the compiler includes the extension code only when needed.

With the Python extension code taken care of, we turn next to providing an optional runtime for testing our async methods. I again select Tokio because the runtime. I put the tests for the async code in their very own file called tests_api_cloud.rs. To be sure that that async tests are run only when the tokio dependency feature is “on”, I start the file with this line:

#![cfg(feature = "tokio")]

As per Rule 5, we should always also include examples in our documentation of the async methods. These examples also function “doc tests”. The doc tests need conditional compilation attributes. Below is the documentation for the strategy that retrieves chromosome metadata. Notice that the instance includes two hidden lines that start
/// # #[cfg(feature = "tokio")]

/// Chromosome of every SNP (variant)
/// [...]
///
/// # Example:
/// ```
/// use ndarray as nd;
/// use bed_reader::{BedCloud, ReadOptions};
/// use bed_reader::assert_eq_nan;
///
/// # #[cfg(feature = "tokio")] Runtime::latest().unwrap().block_on(async {
/// let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/most important/small.bed";
/// let mut bed_cloud = BedCloud::latest(url).await?;
/// let chromosome = bed_cloud.chromosome().await?;
/// println!("{chromosome:?}"); // Outputs ndarray ["1", "1", "5", "Y"]
/// # Okay::<(), Box>(())}).unwrap();
/// # #[cfg(feature = "tokio")] use {tokio::runtime::Runtime, bed_reader::BedErrorPlus};
/// ```

On this doc test, when the tokio feature is ‘on’, the instance, uses tokio and runs 4 lines of code inside a Tokio runtime. When the tokio feature is ‘off’, the code throughout the #[cfg(feature = "tokio")] block disappears, effectively skipping the asynchronous operations.

When formatting the documentation, Rust includes documentation for all features by default, so we see the 4 lines of code:

To summarize Rule 9: Through the use of Cargo features and conditional compilation we are able to be sure that users only pay for the features that they use.

So, there you’ve gotten it: nine rules for reading cloud files in your Rust program. Because of the facility of the object_store crate, your programs can move beyond your local drive and cargo data from the online, AWS S3, Azure, and Google Cloud. To make this somewhat simpler, you too can use the brand new cloud-file wrapping crate that I wrote for this text.

I also needs to mention that this text explored only a subset of object_store’s features. Along with what we’ve seen, the object_store crate also handles writing files and dealing with folders and subfolders. The cloud-file crate, then again, only handles reading files. (But, hey, I’m open to drag requests).

Must you add cloud file support to your program? It, after all, depends. Supporting cloud files offers an enormous convenience to your program’s users. The fee is the additional complexity of using/providing an async interface. The fee also includes the increased file size of runtimes like Tokio. Then again, I feel the tools for adding such support are good and trying them is straightforward, so give it a try!

Thanks for joining me on this journey into the cloud. I hope that if you happen to decide to support cloud files, these steps will make it easier to do it.

Please follow Carl on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to jot down about one article per thirty days.

LEAVE A REPLY

Please enter your comment!
Please enter your name here