Skip to content

Commit 0f8e9e4

Browse files
committed
feat: introduces support for specific reference genomes
Previously, the tool simply had a hardcoded set of PRIMARY_CHROMOSOMES that were hardcoded to the hg38 primary chromosomes. Now, the tool has a supported set of reference genomes, namely (to start): * GRCh38NoAlt (from the NCBI) * hs37d5 (from the 1000 Genomes Project) These two genomes were selected simply because (a) GRCh38NoAlt is probably the most popular GRCh38 genome and (b) hs37d5 is the genome used for phase 2 and phase 3 of the 1000 Genomes project: a fairly popular publicly available resource and the subject of many QC papers. Introducing a reference genome into the code required multiple QC facets to be updated to use this functionality. For each of these, I chose to simply pass the reference genome to the initialization function for the facet: it's up to the facet to take what it needs from the reference genome and store it for later use (as opposed to adding a lifecycle hook injecting it). Other notable, related changes: * I include now a check at the beginning of the `qc` command to ensure that the sequences in the header of the file match the reference genome the user specified on the commmand line. In the future, I also plan to add checks that the actual FASTA file matches the specified reference genome (if provided) _and_ that the GFF file matches the specified reference genome (if provided). There were some other changes that are introduced in this changeset that, at first, don't appear directly related: * We've now moved away from using `async`/`await` for the `qc` subcommand, as there is an obscure bug that doesn't allow two generic lifetimes and one static lifetime with an `async` function. Thus, I decided to just move away from using `async`/`await` altogether, as I had been considering that regardless (we already moved away from using the lazy evaluation facilities in noodles). See issues rust-lang/rust#63033 and rust-lang/rust#99190 for more details. * In testing this code, I was running into an error where a record fell outside of the valid range of a sequence. This was annoying, so I just decided to fix it as part of this changeset. There is no other deep reason why those changes are included here.
1 parent b644e47 commit 0f8e9e4

File tree

8 files changed

+820
-48
lines changed

8 files changed

+820
-48
lines changed

src/commands/qc.rs

Lines changed: 83 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
11
use bam::bai;
2-
use futures::TryStreamExt;
32
use noodles_sam::Header;
4-
use tokio::fs::File;
53

6-
use std::path::PathBuf;
4+
use std::{fs::File, path::PathBuf, rc::Rc};
75

8-
use anyhow::Context;
6+
use anyhow::{bail, Context};
97
use clap::{value_parser, Arg, ArgMatches, Command};
108
use noodles_bam as bam;
119
use noodles_core::{Position, Region};
@@ -19,7 +17,12 @@ use crate::lib::{
1917
quality_scores::QualityScoreFacet, results::Results, template_length::TemplateLengthFacet,
2018
RecordBasedQualityCheckFacet, SequenceBasedQualityCheckFacet,
2119
},
22-
utils::formats::sam::parse_header,
20+
utils::{
21+
formats::sam::parse_header,
22+
genome::{
23+
get_all_reference_genomes, get_all_sequences, get_reference_genome, ReferenceGenome,
24+
},
25+
},
2326
};
2427

2528
/// A utility struct for passing feature name arguments from the command line
@@ -68,6 +71,7 @@ pub fn get_record_based_qc_facets<'a>(
6871
features_gff: Option<&str>,
6972
feature_names: &'a FeatureNames,
7073
header: &'a Header,
74+
reference_genome: Rc<Box<dyn ReferenceGenome>>,
7175
) -> anyhow::Result<Vec<Box<dyn RecordBasedQualityCheckFacet + 'a>>> {
7276
// Default facets that are loaded within the qc subcommand.
7377
let mut facets: Vec<Box<dyn RecordBasedQualityCheckFacet>> = vec![
@@ -83,6 +87,7 @@ pub fn get_record_based_qc_facets<'a>(
8387
s,
8488
feature_names,
8589
header,
90+
reference_genome,
8691
)?));
8792
}
8893

@@ -92,10 +97,11 @@ pub fn get_record_based_qc_facets<'a>(
9297
pub fn get_sequence_based_qc_facets<'a>(
9398
reference_fasta: Option<&PathBuf>,
9499
header: &'a Header,
100+
reference_genome: Rc<Box<dyn ReferenceGenome>>,
95101
) -> anyhow::Result<Vec<Box<dyn SequenceBasedQualityCheckFacet<'a> + 'a>>> {
96102
// Default facets that are loaded within the qc subcommand.
97103
let mut facets: Vec<Box<dyn SequenceBasedQualityCheckFacet<'_>>> =
98-
vec![Box::new(CoverageFacet::default())];
104+
vec![Box::new(CoverageFacet::new(reference_genome))];
99105

100106
if let Some(fasta) = reference_fasta {
101107
facets.push(Box::new(EditsFacet::try_from(fasta, header)?))
@@ -122,6 +128,13 @@ pub fn get_command<'a>() -> Command<'a> {
122128
.value_parser(value_parser!(PathBuf))
123129
.takes_value(true),
124130
)
131+
.arg(
132+
Arg::new("reference-genome")
133+
.long("--reference-genome")
134+
.help("Reference genome used as the basis for the file.")
135+
.takes_value(true)
136+
.required(true),
137+
)
125138
.arg(
126139
Arg::new("features-gff")
127140
.long("--features-gff")
@@ -213,10 +226,29 @@ pub fn get_command<'a>() -> Command<'a> {
213226
/// Prepares the arguments for running the main `qc` subcommand.
214227
pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
215228
info!("Starting qc command...");
229+
216230
let src: &PathBuf = matches
217231
.get_one("src")
218232
.expect("Could not parse the arguments that were passed in for src.");
219233

234+
let provided_reference_genome = matches
235+
.get_one::<String>("reference-genome")
236+
.expect("Did not receive a reference genome.");
237+
238+
let reference_genome = match get_reference_genome(provided_reference_genome) {
239+
Some(s) => Rc::new(s),
240+
None => bail!(
241+
"reference genome is not supported: {}. List of supported reference \
242+
genomes is: {}.",
243+
provided_reference_genome,
244+
get_all_reference_genomes()
245+
.iter()
246+
.map(|s| s.name())
247+
.collect::<Vec<&str>>()
248+
.join(", ")
249+
),
250+
};
251+
220252
let reference_fasta = matches.get_one("reference-fasta");
221253
let features_gff = matches.value_of("features-gff");
222254

@@ -272,22 +304,16 @@ pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
272304
.expect("Could not create output directory.");
273305
}
274306

275-
let rt = tokio::runtime::Builder::new_multi_thread()
276-
.enable_all()
277-
.build()
278-
.unwrap();
279-
280-
let app = app(
307+
app(
281308
src,
282309
reference_fasta,
283310
features_gff,
311+
reference_genome,
284312
output_prefix,
285313
output_directory,
286314
num_records,
287315
feature_names,
288-
);
289-
290-
rt.block_on(app)
316+
)
291317
}
292318

293319
/// Runs the `qc` subcommand.
@@ -297,6 +323,7 @@ pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
297323
/// * `src` — The filepath to the NGS file to run QC on.
298324
/// * `reference_fasta` — Optionally, the path to a
299325
/// when you want to run the Genomic Features facet.
326+
/// * `reference_genome` — The reference genome to be used.
300327
/// * `features_gff` — Optionally, the path to a GFF gene model file. Useful
301328
/// when you want to run the Genomic Features facet.
302329
/// * `output_prefix` — Output prefix for all files generated by this
@@ -305,31 +332,54 @@ pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
305332
/// * `num_records` — Maximum number of records to process. Anything less than 0
306333
/// is considered infinite.
307334
/// * `feature_names` — Feature names for lookup within the GFF file.
308-
async fn app(
335+
#[allow(clippy::too_many_arguments)]
336+
fn app(
309337
src: &PathBuf,
310338
reference_fasta: Option<&PathBuf>,
311339
features_gff: Option<&str>,
340+
reference_genome: Rc<Box<dyn ReferenceGenome>>,
312341
output_prefix: &str,
313342
output_directory: PathBuf,
314343
num_records: i64,
315344
feature_names: FeatureNames,
316345
) -> anyhow::Result<()> {
317-
//==================================================//
318-
// First pass: set up file handles and prepare file //
319-
//==================================================//
346+
//=====================================================//
347+
// Preprocessing: set up file handles and prepare file //
348+
//=====================================================//
320349

321-
let mut reader = File::open(src).await.map(bam::AsyncReader::new)?;
350+
let mut reader = File::open(src).map(bam::Reader::new)?;
322351

323-
let ht = reader.read_header().await?;
352+
let ht = reader.read_header()?;
324353
let header = parse_header(ht);
325354

326-
reader.read_reference_sequences().await?;
355+
let reference_sequences = reader.read_reference_sequences()?;
356+
357+
//=====================================================//
358+
// Preprocessing: reference sequence concordance check //
359+
//=====================================================//
360+
361+
let supported_sequences = get_all_sequences(Rc::clone(&reference_genome));
362+
363+
for (sequence, _) in reference_sequences {
364+
if !supported_sequences
365+
.iter()
366+
.map(|s| s.name())
367+
.any(|x| x == sequence)
368+
{
369+
bail!("Sequence \"{}\" not found in specified reference genome. Did you set the right reference genome?", sequence);
370+
}
371+
}
327372

328373
//===========================================================//
329374
// First pass: print out which facets we're going to analyze //
330375
//===========================================================//
331376

332-
let mut record_facets = get_record_based_qc_facets(features_gff, &feature_names, &header)?;
377+
let mut record_facets = get_record_based_qc_facets(
378+
features_gff,
379+
&feature_names,
380+
&header,
381+
Rc::clone(&reference_genome),
382+
)?;
333383
info!("");
334384
info!("First pass with the following facets enabled:");
335385
info!("");
@@ -344,9 +394,10 @@ async fn app(
344394

345395
debug!("Starting first pass for QC stats.");
346396
let mut record_count = 0;
347-
let mut records = reader.records();
348397

349-
while let Some(record) = records.try_next().await? {
398+
for result in reader.records() {
399+
let record = result?;
400+
350401
for facet in &mut record_facets {
351402
match facet.process(&record) {
352403
Ok(_) => {}
@@ -387,7 +438,8 @@ async fn app(
387438
// Second pass: print out which facets we're going to analyze //
388439
//============================================================//
389440

390-
let mut sequence_facets = get_sequence_based_qc_facets(reference_fasta, &header)?;
441+
let mut sequence_facets =
442+
get_sequence_based_qc_facets(reference_fasta, &header, Rc::clone(&reference_genome))?;
391443
info!("");
392444
info!("Second pass with the following facets enabled:");
393445
info!("");
@@ -400,10 +452,8 @@ async fn app(
400452
// Second pass: set up file handles and prepare file //
401453
//===================================================//
402454

403-
let mut reader = File::open(src).await.map(bam::AsyncReader::new)?;
404-
let index = bai::r#async::read(src.with_extension("bam.bai"))
405-
.await
406-
.with_context(|| "bam index")?;
455+
let mut reader = File::open(src).map(bam::Reader::new)?;
456+
let index = bai::read(src.with_extension("bam.bai")).with_context(|| "bam index")?;
407457

408458
for (name, seq) in header.reference_sequences() {
409459
let start = Position::MIN;
@@ -419,7 +469,7 @@ async fn app(
419469
}
420470
}
421471

422-
let mut query = reader
472+
let query = reader
423473
.query(
424474
header.reference_sequences(),
425475
&index,
@@ -428,7 +478,8 @@ async fn app(
428478
.unwrap();
429479

430480
debug!(" [*] Processing records from sequence.");
431-
while let Some(record) = query.try_next().await.unwrap() {
481+
for result in query {
482+
let record = result?;
432483
for facet in &mut sequence_facets {
433484
if facet.supports_sequence_name(name) {
434485
facet.process_record(seq, &record).unwrap();

src/lib/qc/coverage.rs

Lines changed: 46 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,28 @@
1-
use std::collections::HashMap;
1+
use std::{collections::HashMap, rc::Rc};
22

33
use noodles_sam::header::ReferenceSequence;
44
use serde::Serialize;
5+
use tracing::error;
56

6-
use crate::lib::utils::{genome::PRIMARY_CHROMOSOMES, histogram::SimpleHistogram};
7+
use crate::lib::utils::{
8+
genome::{get_primary_assembly, ReferenceGenome, Sequence},
9+
histogram::SimpleHistogram,
10+
};
711

812
use super::SequenceBasedQualityCheckFacet;
913

14+
#[derive(Clone, Default, Serialize)]
15+
pub struct IgnoredMetrics {
16+
nonsensical_records: usize,
17+
pileup_too_large_positions: HashMap<String, usize>,
18+
}
19+
1020
#[derive(Clone, Default, Serialize)]
1121
pub struct CoverageMetrics {
1222
mean_coverage: HashMap<String, f64>,
1323
median_coverage: HashMap<String, f64>,
1424
median_over_mean_coverage: HashMap<String, f64>,
15-
ignored: HashMap<String, usize>,
25+
ignored: IgnoredMetrics,
1626
histograms: HashMap<String, SimpleHistogram>,
1727
}
1828

@@ -21,10 +31,20 @@ pub struct CoverageHistograms<'a> {
2131
storage: HashMap<&'a str, SimpleHistogram>,
2232
}
2333

24-
#[derive(Default)]
2534
pub struct CoverageFacet<'a> {
2635
by_position: CoverageHistograms<'a>,
2736
metrics: CoverageMetrics,
37+
primary_assembly: Vec<Sequence>,
38+
}
39+
40+
impl<'a> CoverageFacet<'a> {
41+
pub fn new(reference_genome: Rc<Box<dyn ReferenceGenome>>) -> Self {
42+
Self {
43+
by_position: CoverageHistograms::default(),
44+
metrics: CoverageMetrics::default(),
45+
primary_assembly: get_primary_assembly(reference_genome),
46+
}
47+
}
2848
}
2949

3050
impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
@@ -37,7 +57,10 @@ impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
3757
}
3858

3959
fn supports_sequence_name(&self, name: &str) -> bool {
40-
PRIMARY_CHROMOSOMES.contains(&name)
60+
self.primary_assembly
61+
.iter()
62+
.map(|s| s.name())
63+
.any(|x| x == name)
4164
}
4265

4366
fn setup_sequence(&mut self, _: &ReferenceSequence) -> anyhow::Result<()> {
@@ -62,7 +85,20 @@ impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
6285
let record_end = usize::from(record.alignment_end().unwrap());
6386

6487
for i in record_start..=record_end {
65-
h.increment(i).unwrap();
88+
if h.increment(i).is_err() {
89+
error!(
90+
"Record crosses the sequence boundaries in an expected way. \
91+
This usually means that the record is malformed. Please examine \
92+
the record closely to ensure it fits within the sequence. \
93+
Ignoring record. Read name: {}, Start Alignment: {}, End \
94+
Alignment: {}, Cigar: {}",
95+
record.read_name().unwrap(),
96+
record.alignment_start().unwrap(),
97+
record.alignment_end().unwrap(),
98+
record.cigar()
99+
);
100+
self.metrics.ignored.nonsensical_records += 1;
101+
}
66102
}
67103

68104
Ok(())
@@ -99,7 +135,10 @@ impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
99135
self.metrics
100136
.histograms
101137
.insert(seq.name().to_string(), coverages);
102-
self.metrics.ignored.insert(seq.name().to_string(), ignored);
138+
self.metrics
139+
.ignored
140+
.pileup_too_large_positions
141+
.insert(seq.name().to_string(), ignored);
103142

104143
Ok(())
105144
}

0 commit comments

Comments
 (0)