1 unstable release

Uses new Rust 2024

new 0.1.0	Oct 28, 2025

#222 in Compression

67 downloads per month

MIT license

94KB
2K SLoC

convert_genome

convert_genome converts direct-to-consumer (DTC) genotype exports (23andMe, AncestryDNA, etc.) into standard VCF or BCF files. The converter understands remote references, streams compressed archives, and now includes high-performance, parallel processing suitable for multi-million record datasets.

Features

Parallel conversion pipeline powered by rayon, scaling across all cores.
Thread-safe reference genome access with a shared LRU cache for rapid base lookups.
Remote reference support for http:// and https:// URLs with transparent decompression of .gz and .zip archives.
Robust parsing with property-based and integration tests covering malformed input, missing fields, and concurrent access.
Benchmark suite (Criterion) to track performance of parsing, reference lookups, and pipeline throughput.
Comprehensive CI/CD across Linux, macOS, and Windows with formatting, linting, testing, coverage, and benchmarks.

Installation

The project targets Rust nightly (see rust-toolchain.toml). Install the converter directly from the repository:

cargo install --path .

Alternatively, build the binary without installing:

cargo build --release

The resulting executable lives at target/release/convert_genome.

Usage

The CLI accepts both local files and remote resources. A minimal invocation converts a DTC file to VCF:

convert_genome \
  --input data/genotypes.txt \
  --reference GRCh38.fa \
  --output genotypes.vcf \
  --sample SAMPLE_ID

Generate a BCF with explicit assembly metadata and keep homozygous reference calls:

convert_genome \
  --input https://example.org/sample.txt.gz \
  --reference https://example.org/GRCh38.fa.gz \
  --output sample.bcf \
  --output-format bcf \
  --assembly GRCh38 \
  --sample SAMPLE_ID \
  --include-reference-sites

If a .fai index is not provided the converter will generate one next to the FASTA automatically.

Performance

Reference lookups use a shared, thread-safe LRU cache sized for 128k entries, dramatically reducing random I/O. The conversion pipeline collects DTC records, sorts them for cache locality, and processes them in parallel; results are written sequentially to keep deterministic ordering.

The Criterion benchmarks can be executed with:

cargo bench

Benchmarks cover:

Cached vs. uncached reference lookups.
DTC parsing throughput.
Full conversion pipeline comparisons (parallel vs. single-threaded execution).

Testing

Unit, integration, and property-based tests ensure correctness across a wide surface area:

cargo test              # Debug builds, all tests
cargo test --release    # Property tests under release optimizations

Ignored integration tests in tests/remote_download.rs exercise real-world genome downloads; run them manually as needed.

Continuous Integration

See .github/workflows/ci.yml. The workflow performs:

Formatting (cargo fmt --check)
Linting (cargo clippy --all-targets -- -D warnings)
Cross-platform builds
Test suites (debug + release/property)
Benchmarks (cargo bench --no-fail-fast)
Coverage reporting via cargo tarpaulin on Linux

Project Architecture

src/cli.rs – Argument parsing and top-level command dispatch.
src/conversion.rs – Conversion pipeline, header construction, and record translation.
src/dtc.rs – Streaming parser for DTC genotype exports.
src/reference.rs – Reference genome loader, contig metadata, and cached base access.
src/remote.rs – Remote fetching with HTTP(S) support and archive extraction.

Additional resources:

tests/ – Integration and property-based test suites.
benches/ – Criterion benchmarks for core subsystems.

Contributing

Install the nightly toolchain (rustup toolchain install nightly).
Run formatting and linting before submitting: cargo fmt and cargo clippy --all-targets -- -D warnings.
Execute the full test suite (debug + release) and benchmarks.
For large datasets or new reference assemblies, add integration tests with representative fixtures.

Issues and pull requests are welcome! Please include benchmark results when proposing performance-sensitive changes.

Dependencies

~24–40MB
~649K SLoC