CollapseSeq¶
Removes duplicate sequences from FASTA/FASTQ files
usage: CollapseSeq [-h] -s SEQ_FILES [SEQ_FILES ...] [–fasta] [–failed] [–log LOG_FILE] [–delim DELIMITER DELIMITER DELIMITER] [–outdir OUT_DIR] [–outname OUT_NAME] [–version] [-n MAX_MISSING] [–uf UNIQ_FIELDS [UNIQ_FIELDS ...]] [–cf COPY_FIELDS [COPY_FIELDS ...]] [–act {min,max,sum,set} [{min,max,sum,set} ...]] [–inner] [–keepmiss] [–maxf MAX_FIELD | –minf MIN_FIELD]
-
-h,--help¶ show this help message and exit
-
-s<seq_files>¶ A list of FASTA/FASTQ files containing sequences to process.
-
--fasta¶ Specify to force output as FASTA rather than FASTQ.
-
--failed¶ If specified create files containing records that fail processing.
-
--log<log_file>¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--delim<delimiter>¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--outdir<out_dir>¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname<out_name>¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--version¶ show program’s version number and exit
-
-n<max_missing>¶ Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides.
-
--uf<uniq_fields>¶ Specifies a set of annotation fields that must match for sequences to be considered duplicates
-
--cf<copy_fields>¶ Specifies a set of annotation fields to copy into the unique sequence output.
-
--act{min,max,sum,set}¶ List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.
-
--inner¶ If specified, exclude consecutive missing characters at either end of the sequence.
-
--keepmiss¶ If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.
-
--maxf<max_field>¶ Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.
-
--minf<min_field>¶ Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.
- output files:
- collapse-unique
- unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
- collapse-duplicate
- raw reads which are duplicates of the sequences retained in the collapse-unique file.
- collapse-undetermined
- raw reads which were excluded from consideration due to having too many N characters in the sequence.
- output annotation fields:
- DUPCOUNT
- total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
- <user defined>
- annotation fields specified by the –cf parameter.