1 unstable release
Uses new Rust 2024
| new 0.1.0 | Oct 28, 2025 |
|---|
#222 in Compression
67 downloads per month
94KB
2K
SLoC
convert_genome
convert_genome converts direct-to-consumer (DTC) genotype exports (23andMe, AncestryDNA, etc.) into standard VCF or BCF files. The converter understands remote references, streams compressed archives, and now includes high-performance, parallel processing suitable for multi-million record datasets.
Features
- Parallel conversion pipeline powered by
rayon, scaling across all cores. - Thread-safe reference genome access with a shared
LRUcache for rapid base lookups. - Remote reference support for
http://andhttps://URLs with transparent decompression of.gzand.ziparchives. - Robust parsing with property-based and integration tests covering malformed input, missing fields, and concurrent access.
- Benchmark suite (Criterion) to track performance of parsing, reference lookups, and pipeline throughput.
- Comprehensive CI/CD across Linux, macOS, and Windows with formatting, linting, testing, coverage, and benchmarks.
Installation
The project targets Rust nightly (see rust-toolchain.toml). Install the converter directly from the repository:
cargo install --path .
Alternatively, build the binary without installing:
cargo build --release
The resulting executable lives at target/release/convert_genome.
Usage
The CLI accepts both local files and remote resources. A minimal invocation converts a DTC file to VCF:
convert_genome \
--input data/genotypes.txt \
--reference GRCh38.fa \
--output genotypes.vcf \
--sample SAMPLE_ID
Generate a BCF with explicit assembly metadata and keep homozygous reference calls:
convert_genome \
--input https://example.org/sample.txt.gz \
--reference https://example.org/GRCh38.fa.gz \
--output sample.bcf \
--output-format bcf \
--assembly GRCh38 \
--sample SAMPLE_ID \
--include-reference-sites
If a .fai index is not provided the converter will generate one next to the FASTA automatically.
Performance
Reference lookups use a shared, thread-safe LRU cache sized for 128k entries, dramatically reducing random I/O. The conversion pipeline collects DTC records, sorts them for cache locality, and processes them in parallel; results are written sequentially to keep deterministic ordering.
The Criterion benchmarks can be executed with:
cargo bench
Benchmarks cover:
- Cached vs. uncached reference lookups.
- DTC parsing throughput.
- Full conversion pipeline comparisons (parallel vs. single-threaded execution).
Testing
Unit, integration, and property-based tests ensure correctness across a wide surface area:
cargo test # Debug builds, all tests
cargo test --release # Property tests under release optimizations
Ignored integration tests in tests/remote_download.rs exercise real-world genome downloads; run them manually as needed.
Continuous Integration
See .github/workflows/ci.yml. The workflow performs:
- Formatting (
cargo fmt --check) - Linting (
cargo clippy --all-targets -- -D warnings) - Cross-platform builds
- Test suites (debug + release/property)
- Benchmarks (
cargo bench --no-fail-fast) - Coverage reporting via
cargo tarpaulinon Linux
Project Architecture
src/cli.rs– Argument parsing and top-level command dispatch.src/conversion.rs– Conversion pipeline, header construction, and record translation.src/dtc.rs– Streaming parser for DTC genotype exports.src/reference.rs– Reference genome loader, contig metadata, and cached base access.src/remote.rs– Remote fetching with HTTP(S) support and archive extraction.
Additional resources:
tests/– Integration and property-based test suites.benches/– Criterion benchmarks for core subsystems.
Contributing
- Install the nightly toolchain (
rustup toolchain install nightly). - Run formatting and linting before submitting:
cargo fmtandcargo clippy --all-targets -- -D warnings. - Execute the full test suite (debug + release) and benchmarks.
- For large datasets or new reference assemblies, add integration tests with representative fixtures.
Issues and pull requests are welcome! Please include benchmark results when proposing performance-sensitive changes.
Dependencies
~24–40MB
~649K SLoC