Data Preparation

To analyze proviral sequences, the pipeline expects specific input files generated by MiCall, a related tool used for assembling contigs. Familiarity with MiCall is assumed, as it handles the initial stages of data preparation for proviral analysis. Below is a brief overview of the input files required:

Input Files

  1. contigs.csv: Contains assembled and possibly merged contigs, along with their best BLAST results.
    • ref: Reference name with the best BLAST match.
    • match: The fraction of the contig matched in BLAST (negative values indicate reverse-complemented matches).
    • group_ref: The reference name selected to best align with all contigs in a sample.
    • contig: The nucleotide sequence of the assembled contig.

In MiCall’s output, this file is called unstitched_contigs.csv.

  1. conseqs.csv: Contains refined/improved versions of contigs.

In MiCall’s output, this file is called unstitched_conseq.csv.

  1. cascade.csv: Tracks the number of read pairs processed through different stages of the pipeline.
    • demultiplexed: Raw FASTQ count.
    • v3loop: Reads aligned with V3LOOP.
    • g2p: Valid reads count in G2P.
    • prelim_map: Initially mapped to references in the first pass.
    • remap: Reads remapped to other references.
    • aligned: Aligned reads merged with their mate.
  2. sample_info.csv: Provides auxilary run metadata. Unlike all the previous files, this one is not generated by MiCall. This file can be empty.

These files should be organized accordingly in your dataset directory, often structured by MiCall.

Utilizing Example Data for Practice

For practical understanding and experimentation, visit the example directory for sample inputs that align with these specifications. Using these samples, you can become familiar with the structure and usage of the input files. If you are having issues when downloading the sample inputs, following these instructions.


Next: running the pipeline.