Error codes

Error messages get recorded in several output files, and it’s not always clear what the different messages mean. They are all defined in the primer_finder_errors.py file, but the list of which errors can appear in which files are described here, along with their meaning.

— `contigs_primer_analysis.csv` —

The denovo assembler can assemble several contigs per sample. In a successful sample, there is one contig that has a BLAST match to HIV, and possibly some other small contigs that can be ignored, such as human genomes or primer artifacts.

The possible errors are:

no contig/conseq constructed — there was a sample in the cascade.csv file that had no entries in contigs.csv.
sequence is non-hiv — none of the sample’s contigs were a BLAST match for HIV.
primer was not found — could not even find a minimum match of two bases from the end of the primer.
primer failed validation — found part of the primer, but the neighbouring sequence didn’t align to the expected part of HXB2.

— `conseqs_primer_analysis.csv` —

Analyses the consensus of all the reads that map to each assembled contig. Coverage must be at least 100 reads for a position to be included in the consensus, and positions of low coverage are reported as an X.

All the same errors can be reported as for contigs, plus these:

not MAX — the conseqs.csv file contains several consensus sequences with mixtures reported at different cutoff levels. MAX reports the base with maximum prevalence at each position, with mixture codes only used for exact ties. This error message means that the consensus was ignored, because it was one of the cutoff levels that more often include mixtures.
sequence is non-proviral — marks V3LOOP sequences generated by G2P.
sequence contained non-TCGA/gap — sequence contained bases beyond TCGA and the gap marker. Most likely cause: an exact tie in counts that generates a mixture code.
low internal read coverage — if there are X’s that further away from the ends than we look for primers (50 bases).
low end read coverage — it trims from each end toward the midpoint until it has removed all X’s. If that leaves less than one sixth of the original, length, it generates this error.

— `outcome_summary.csv` —

This file tries to find a usable result: first from conseqs, then from contigs. The seqtype column shows where it found the result. If both failed, it will summarize the errors:

no contig/conseq constructed — there was a sample in the cascade.csv file that had no entries in contigs.csv.
sequence is non-hiv — none of the sample’s contigs were a BLAST match for HIV.
sample has multiple QC-passed sequences — more than one conseq or more than one contig passed quality control.
primer error — there was one HIV conseq, and it had one of the primer errors: primer was not found, primer failed validation, or low end read coverage. The contigs didn’t pass, either.
low coverage — either the HIV contigs didn’t have coverage above 100 reads, or there was one HIV contig that generated one of the low coverage errors.
multiple contigs — there were multiple contigs, and all of them failed due to some combination of primer errors and low coverage.
hiv but failed — unexpected failure, such as non TCGA.

— `study_summary.csv` —

Summarizes the sample counts and error counts by run folder, by participant id, and grand total. It reports the following errors, grouping the details from the outcome summary file:

no_sequence — no contig/conseq constructed
non_hiv — sequence is non-HIV
no_primer — primer error
low_cov — low end read coverage, low internal read coverage, or low coverage
multiple_contigs — multiple contigs or sample has multiple QC-passed sequences
hiv_but_failed — HIV but failed

Error codes

— contigs_primer_analysis.csv —

— conseqs_primer_analysis.csv —

— outcome_summary.csv —

— study_summary.csv —

— `contigs_primer_analysis.csv` —

— `conseqs_primer_analysis.csv` —

— `outcome_summary.csv` —

— `study_summary.csv` —