Ideas/To Do¶
This is a rather unsorted list of features that would be nice to have, of things that could be improved in the source code, and of possible algorithmic improvements.
- show average error rate
- In colorspace and probably also for Illumina data, gapped alignment is not necessary
--progress
- run pylint, pychecker
- length histogram
- check whether input is FASTQ although -f fasta is given
- search for adapters in the order in which they are given on the command line
- more tests for the alignment algorithm
- deprecate
--rest-file
--detect
prints out best guess which of the given adapters is the correct one- alignment algorithm: make a ‘banded’ version
- it seems the str.find optimization isn’t very helpful. In any case, it should be moved into the Aligner class.
- allow to remove not the adapter itself, but the sequence before or after it
- convert adapter to lowercase
- warn when given adapter sequence contains non-IUPAC characters
- try multithreading again, this time use os.pipe()
Specifying adapters¶
The idea is to deprecate the -b
and -g
parameters. Only -a
is used
with a special syntax for each adapter type. This makes it a bit easier to add
new adapter types in the feature.
back | -a ADAPTER |
-a ADAPTER or -a ...ADAPTER |
suffix | -a ADAPTER$ |
-a ...ADAPTER$ |
front | -g ADAPTER |
-a ADAPTER... |
prefix | -g ^ADAPTER |
-a ^ADAPTER... |
anywhere | -b ADAPTER |
-a ...ADAPTER... ??? |
paired | (not implemented) | -a ADAPTER...ADAPTER or -a ^ADAPTER...ADAPTER |
Or add only -a ADAPTER...
as an alias for -g ^ADAPTER
and
-a ...ADAPTER
as an alias for -a ADAPTER
.
The ...
would be equivalent to N*
as in regular expressions.
Another idea: Allow something such as -a ADAP$TER
or -a ADAPTER$NNN
.
This would be a way to specify less strict anchoring.
Make it possible to specify that the rightmost or leftmost match should be picked. Default right now: Leftmost, even for -g adapters.
Allow N{3,10}
as in regular expressions (for a variable-length sequence).
Paired-end trimming¶
- Could also use a paired-end read merger, then remove adapters with -a and -g
- Should minimum overlap be sum of the two overlaps in each read?
Single-letter command-line options¶
Remaining characters: All uppercase letters except A, B, G, M, N, O Lowercase letters: i, j, k, l, s, w