HIVIntact workflow
Preparation
Before analyzing a sequence, HIVIntact does some initial preprocessing that is used later in the analysis.
Alignment to Reference
For each input sequence, HIVIntact uses the mafft
software to align it to its subtype sequence.
This operation is repeated with the reverse complement (RC) of the input sequence to determine if the fit is better.
If the RC provides a better alignment, HIVIntact uses it instead.
This ensures that the direction in which the original sequence is read does not affect the analysis.
The alignment is global, and it never fails.
ORF Detection
(TODO: describe the logic)
Detection never fails, but it can output ORFs that have a length of 0.
The outputs from this procedure are used to produce the orfs.json
file.
BLAST Analysis
Optionally, HIVIntact calls the NCBI’s blastn
program to obtain alignment data that is region-based,
as opposed to the global alignment provided by mafft
.
Analysis steps
HIVIntact performs multiple analysis steps that are independent of each other. Most of these analyses are optional and controlled by passing a command line option to the main program.
Table below lists all of the independent steps:
Name | Enabled by Default? | Command Line Option |
---|---|---|
PSI check | Yes | --exclude-packaging-signal |
RRE check | Yes | --exclude-rre |
MSD check | Yes | --ignore-major-splice-donor-site |
Hypermutation check | No | --run-hypermut |
Large deletion check | No | --check-long-deletion |
NonHIV check | No | --check-nonhiv |
Scramble check | No | --check-scramble |
Inversion check | No | --check-internal-inversion |
Large ORFs analysis | Yes | |
Small ORFs analysis | No | --include-small-orfs |
Each step works on a single input sequence and has a set of potential errors it can detect for that sequence.
We describe the logic of each step below.
PSI check
Determines presence and possible intactness of HIV Packaging Signal Region.
Based on the alignment, HIVIntact locates the PSI region in the input sequence and checks its length.
For lengths smaller than the tolerable limit, an error with code PackagingSignalDeletion
is reported.
RRE check
Determines presence and possible intactness of HIV Rev Response Element.
The analysis performed is the same as the PSI check, but the error code is RevResponseElementDeletion
.
MSD check
Determines whether the Major Splice Donor site is mutated.
Based on the alignment, HIVIntact locates the region that is expected to contain the MSD subsequence.
If the found subsequence is anything but G
followed by T
,
an error with MajorSpliceDonorSiteMutated
error code is reported.
Hypermutation check
APOBEC3G/F hypermutation scan and test based on Rose and Korber, Bioinformatics (2000). Briefly, scans reference for APOBEC possible signatures and non-signatures and performs fisher test based on ratio of G->A in ref -> query at these signatures.
If there is enough evidence that the sequence is hypermutated (p-value <0.05)
this step outputs the APOBECHypermutationDetected
error code.
Large deletion check
If the input sequence is shorter than 8000 nucleotide bases,
an error with code LongDeletion
is produced.
NonHIV check
(TODO: describe the logic)
This analysis step is copied from HIVSeqinR software.
It outputs the NonHIV
error code.
Scramble check
(TODO: describe the logic)
This analysis step is copied from HIVSeqinR software.
It outputs the Scramble
error code.
Inversion check
(TODO: describe the logic)
This analysis step is copied from HIVSeqinR software.
It outputs the InternalInversion
error code.
Large ORFs analysis
Large ORFs are gag
, pol
and env
.
Using data from the ORF detection procedure, HIVIntact goes through each of the listed ORFs and checks two things:
- their lengths
- and possible out-of-frame indels.
The length check is based on comparing it to predefined length limits.
Go to cuttofs page now to learn the limits that HIVIntact adhears to.
If the length is too long, the error code is InsertionInOrf
.
If the length is too short, then HIVIntact also checks if there is an internal stop codon in the analyzed ORF.
Depending on that, the code is either DeletionInOrf
or InternalStopInOrf
.
Notably, internal stop codons that do not make the resulting protein too short are ignored.
When out-of-frame indels detection is run, the assumption is that they are common but not all of them render the respective ORF dysfunctional. So we try to estimate the impact that detected indels have. This is done with the following algorithm:
- Redo the alignment, not global this time, but only for the current ORF.
-
Iterate through each position in the alignment, keeping the index of “current frame”:
- if insertion is encountered, change the current frame by +1,
- if deletion is encountered, change the current frame by -1,
counting all nucleotides that are not in the initial frame as we go.
- Compare the amount of out of frame nucleotides to a predefined limit.
If the impact is large enough, meaning that too many nucleotides got frame shifted,
then output an error with code FrameshiftInOrf
.
(TODO: show example)
Small ORFs analysis
Small ORFs are vif
, vpr
, tat
, vpu
, rev
and nef
.
The analysis performed for them is the same as the Large ORFs analysis,
as well as the possible error codes that this step outputs.