概述¶
适用范围和特性¶
pRESTO执行在比对到种系参考序列前的所有阶段的序列处理工作。该工具箱容易使用,但要求对于命 令行程序有一定的了解。与针对少数常见的工作流程提供一个固定的解决方案不同的是,我们把 pRESTO设计的尽可能的灵活。这个设计理念使得pRESTO适合于许多现有的protocol,也适用于未来 的技术,但是这要求用户去构建一系列命令行和参数以针对他们实验的protocol。
pRESTO由一套独立的工具组成,它们处理具体的任务,通常有一系列不同行为的子命令。下表展示了 对每个工具的简要描述。
| 工具 | 子命令 | 描述 |
|---|---|---|
| AlignSets | Multiple aligns sets of sequences sharing the same annotation | |
| muscle | Uses the program MUSCLE to align reads | |
| offset | Uses a table of primer alignments to align the 5’ region | |
| table | Creates a table of primer alignments for the offset subcommand | |
| AssemblePairs | Assembles paired-end reads into a complete sequence | |
| align | Assembles paired-end reads by aligning the sequence ends | |
| join | Concatenates pair-end reads with intervening gaps | |
| reference | Assembles paired-end reads using V-segment references | |
| BuildConsensus | Constructs UMI consensus sequences | |
| ClusterSets | Clusters UMI read groups into smaller sub-groups | |
| CollapseSeq | Removes duplicate sequences | |
| ConvertHeaders | Converts sequence headers to the pRESTO format | |
| generic | Converts sequence headers with an unknown annotation system | |
| 454 | Converts Roche 454 sequence headers | |
| genbank | Converts NCBI GenBank and RefSeq sequence headers | |
| illumina | Converts Illumina sequence headers | |
| imgt | Converts sequence headers output by IMGT/GENE-DB. | |
| sra | Converts NCBI SRA sequence headers | |
| EstimateError | Estimates error rates for UMI data | |
| FilterSeq | Removes or modifies low quality reads | |
| length | Removes sequences under a defined length | |
| maskqual | Masks low Phred quality score positions with Ns | |
| missing | Removes sequences with a high number of Ns | |
| repeats | Removes sequences with long repeats of a single nucleotide | |
| quality | Removes sequences with low Phred quality scores | |
| trimqual | Trims sequences to segments with high Phred quality scores | |
| MaskPrimers | Identifies and removes primer regions, MIDs and UMI barcodes | |
| align | Matches primers by local alignment and reorients sequences | |
| score | Matches primers at a fixed user-defined start position | |
| PairSeq | Sorts paired-end reads and copies annotations between them | |
| ParseHeaders | Manipulates sequence annotations | |
| add | Adds a field and value annotation pair to all reads | |
| collapse | Compresses a set of annotation fields into a single field | |
| copy | Copies values between annotations fields | |
| delete | Deletes an annotation from all reads | |
| expand | Expands an field with multiple values into separate annotations | |
| rename | Rename annotation fields | |
| table | Outputs sequence annotations as a data table | |
| ParseLog | Converts the log output of pRESTO scripts into data tables | |
| SplitSeq | Performs conversion, sorting, and subsetting of sequence files | |
| count | Splits files into smaller files | |
| group | Splits files based on numerical or categorical annotation | |
| sample | Randomly samples sequences from a file | |
| samplepair | Randomly samples paired-end reads from two files | |
| sort | Sorts sequences based on annotations |
输入和输出¶
所有的工具都以标准FASTA或FASTQ格式文件为输入,并输出同样格式的文件。这使得pRESTO可以和 其他使用其中任何一种格式的序列分析工具无缝衔接,如果需要的话,在pRESTO工作流程中的任何 步骤都可以被其他工具替代。
Each tool appends a specific suffix to its output files describing the step and
output. For example, MaskPrimers will append _primers-pass to the output
file containing successfully aligned sequences and _primers-fail to the file
containing unaligned sequences.
See also
Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for each tool.
Annotation Scheme¶
The majority of pRESTO tools manipulate and add sequences-specific annotations
as part of their processing functions using the scheme shown below. Each
annotation is delimited using a reserved character (| by default), with the
annotation field name and values separated by a second reserved character
(= by default), and each value within a field is separated by a third
reserved character (, by default). These annotations follow the sequence
identifier, which itself immediately follows the > (FASTA) or @ (FASTQ)
symbol denoting the beginning of a new sequence entry. The sequence identifier
is given the reserved field name ID. To mitigate potential analysis
errors, each tool in pRESTO annotates sequences by appending values to existing
annotation fields when they exist, and will not overwrite or delete annotations
unless explicitly performed using the ParseHeaders tool. All reserved characters
can be redefined using the command line options.
>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo
See also
- Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documentation for each tool.
- The ParseHeaders tool provides a number of options for manipulating annotations in the pRESTO format.
- The ConvertHeaders tool allows you convert several common annotation schemes into the pRESTO annotation format.