presto.Sequence¶

Sequence processing functions

presto.Sequence.calculateDiversity(seq_list, score_dict=getDNAScoreDict())¶

Determine the average pairwise error rate for a list of sequences

Parameters:	seq_list – List of SeqRecord objects to score score_dict – Optional dictionary of alignment scores as {(char1, char2): score}
Returns:	Average pairwise error rate for the list of sequences
Return type:	float

presto.Sequence.calculateSetError(seq_list, ref_seq, ignore_chars=['n', 'N'], score_dict=getDNAScoreDict())¶

Counts the occurrence of nucleotide mismatches from a reference in a set of sequences

Parameters:	seq_list – List of SeqRecord objects with aligned sequences ref_seq – SeqRecord object containing the reference sequence to match against ignore_chars – List of characters to exclude from mismatch counts score_dict – Optional dictionary of alignment scores as {(char1, char2): score}
Returns:	Error rate for the set
Return type:	float

presto.Sequence.checkSeqEqual(seq1, seq2, ignore_chars={'.', '-', 'n', 'N'})¶

Determine if two sequences are equal, excluding missing positions

Parameters:	seq1 – SeqRecord object seq2 – SeqRecord object ignore_chars – Set of characters to ignore
Returns:	True if the sequences are equal
Return type:	bool

presto.Sequence.compilePrimers(primers)¶

Translates IUPAC Ambiguous Nucleotide characters to regular expressions and compiles them

Parameters:	key – Dictionary of sequences to translate
Returns:	Dictionary of compiled regular expressions
Return type:	dict

presto.Sequence.deleteSeqPositions(seq, positions)¶

Deletes a list of positions from a SeqRecord

Parameters:	seq – SeqRecord objects positions – Set of positions (indices) to delete
Returns:	Modified SeqRecord with the specified positions removed
Return type:	SeqRecord

presto.Sequence.findGapPositions(seq_list, max_gap, gap_chars={'.', '-'})¶

Finds positions in a set of aligned sequences with a high number of gap characters.

Parameters:	seq_list – List of SeqRecord objects with aligned sequences max_gap – Float of the maximum gap frequency to consider a position as non-gapped gap_chars – Set of characters to consider as gaps
Returns:	Positions (indices) with gap frequency greater than max_gap
Return type:	list

presto.Sequence.frequencyConsensus(seq_list, min_freq=0.6, ignore_chars={'.', '-', 'n', 'N'})¶

Builds a consensus sequence from a set of sequences

Parameters:	set_seq – List of SeqRecord objects min_freq – Frequency cutoff to assign a base ignore_chars – Set of characters to exclude when building a consensus sequence
Returns:	Consensus SeqRecord object
Return type:	SeqRecord

presto.Sequence.getAAScoreDict(mask_score=None, gap_score=None)¶

Generates a score dictionary

Parameters:	mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:	Score dictionary with keys (char1, char2) mapping to scores
Return type:	dict

presto.Sequence.getDNAScoreDict(mask_score=None, gap_score=None)¶

Generates a score dictionary

Parameters:	mask_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:	Score dictionary with keys (char1, char2) mapping to scores
Return type:	dict

presto.Sequence.indexSeqSets(seq_dict, field='BARCODE', delimiter=('|', '=', ', '))¶

Identifies sets of sequences with the same ID field

Parameters:	seq_dict – a dictionary index of sequences returned from SeqIO.index() field – the annotation field containing set IDs delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:	Dictionary mapping set name to a list of record names
Return type:	dict

presto.Sequence.qualityConsensus(seq_list, min_qual=20, min_freq=0.6, dependent=False, ignore_chars={'.', '-', 'n', 'N'})¶

Builds a consensus sequence from a set of sequences

Parameters:	seq_list – List of SeqRecord objects min_qual – Quality cutoff to assign a base min_freq – Frequency cutoff to assign a base dependent – If False assume sequences are independent for quality calculation ignore_chars – Set of characters to exclude when building a consensus sequence
Returns:	Consensus SeqRecord object
Return type:	SeqRecord

presto.Sequence.reverseComplement(seq)¶

Takes the reverse complement of a sequence

Parameters:	seq – a SeqRecord object, Seq object or string to reverse complement
Returns:	Object of the same type as the input with the reverse complement sequence
Return type:	Seq

presto.Sequence.scoreAA(a, b, mask_score=None, gap_score=None)¶

Returns the score for a pair of IUPAC Extended Protein characters

Parameters:	a – First character b – Second character mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:	Score for the character pair
Return type:	int

presto.Sequence.scoreDNA(a, b, mask_score=None, gap_score=None)¶

Returns the score for a pair of IUPAC Ambiguous Nucleotide characters

Parameters:	a – First characters b – Second character n_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:	Score for the character pair
Return type:	int

presto.Sequence.scoreSeqPair(seq1, seq2, ignore_chars=set(), score_dict=getDNAScoreDict())¶

Determine the error rate for a pair of sequences

Parameters:	seq1 – SeqRecord object seq2 – SeqRecord object ignore_chars – Set of characters to ignore when scoring and counting the weight score_dict – Optional dictionary of alignment scores
Returns:	Tuple of the (score, minimum weight, error rate) for the pair of sequences
Return type:	Tuple

presto.Sequence.subsetSeqIndex(seq_dict, field, values, delimiter=('|', '=', ', '))¶

Subsets a sequence set by annotation value

Parameters:	seq_dict – Dictionary index of sequences returned from SeqIO.index() field – Annotation field to select keys by values – List of annotation values that define the retained keys delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns:	List of keys
Return type:	list

presto.Sequence.subsetSeqSet(seq_iter, field, values, delimiter=('|', '=', ', '))¶

Subsets a sequence set by annotation value

Parameters:	seq_iter – Iterator or list of SeqRecord objects field – Annotation field to select by values – List of annotation values that define the retained sequences delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns:	Modified list of SeqRecord objects
Return type:	list

presto.Sequence.translateAmbigDNA(key)¶

Translates IUPAC Ambiguous Nucleotide characters to or from character sets

Parameters:	key – String or re.search object containing the character set to translate
Returns:	Character translation
Return type:	str

presto.Sequence.weightSeq(seq, ignore_chars=set())¶

Returns the length of a sequencing excluding ignored characters

Parameters:	seq – SeqRecord or Seq object ignore_chars – Set of characters to ignore when counting sequence length
Returns:	Sum of the character scores for the sequence
Return type:	int