ULYSSES : Accurate detection of structural variants from large insert mate-pair next generation sequencing
Version 1.0, September 2014
Contacts: Alexandre Gillet-Markowska and Ingrid Lafontaine
alexandre.gillet-markowska/ at / upmc.fr
ingrid.lafontaine / at / upmc.fr
Laboratoire de Biologie Computationnelle et Quantitative.
UPMC UFR 927. CNRS UMR 7238
|
If you use Ulysses for your publications, please read and cite:
Ulysses:
accurate detection of low-frequency structural variations in large
insert-size sequencing libraries
Alexandre Gillet-Markowska; Hugues Richard; Gilles Fischer; Ingrid
Lafontaine
Bioinformatics 2015;
doi: 10.1093/bioinformatics/btu730
Install R
sudo apt-get update
sudo apt-get install r-base
sudo R
Install R “qvalue” and “IRanges” libraries
sudo R
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
biocLite("qvalue")
Install python “pysam” API :
sudo pip pysam
Python Networkx extention is also required
sudo pip install networkx
Python "NumPy" extension is also required.
sudo apt-get install python-numpy
Usage example
wget https://github.com/gillet/ulysses/archive/ulysses-v1.0.zip
unzip master.zip
cd ulysses-master/example
../ReadBAM.py example.bam -p example_params
../Ulysses.py -p example_params
ULYSSES allows detecting the full spectrum of structural variants (SV) from a paired-sequence next generation sequencing library, including deletions (DEL), segmental duplications (DUP), inversions (INV), insertion (INS), reciprocal translocations (RT) and non-reciprocal translocations (NRT).
ULYSSES takes as input files a library of aligned read-pairs (RP) in SAM/BAM format. It is composed of 2 main parts (part1: library parsing and part 2: SV detection)
You can download Ulysses code here or directly at the command line:
wget https://github.com/gillet/ulysses/archive/ulysses-v1.0.zip
Please see the official installation instructions for your running system, following the links given below. Detailed instructions for R and required R packages are given at the end of the manual.
R packages:
Iranges and qvalue packages from BioConductor http://www.bioconductor.org/
See http://www.bioconductor.org/install/index.html#install-bioconductor-packages
Two additionnal python extentions are required: “pysam”: https://github.com/pysam-developers/pysam
and "NumPy": http://www.numpy.org/
Ulysse is written in Python 2.7 and in R version 2.14. It has been tested only on Linux computers with Ubuntu.
A directory "ulysses-master" containing Ulysses scripts will be created.
unzip ulysses-v1.0.zip
ULYSSES takes as input a library of Paired-Sequences mapped onto a reference genome as input. The format is BAM/SAM. The (BAM/SAM) file must contain a header section and be sorted by read coordinates.
A parameter file is required to launch the program. A default file (ulysses_params) is proposed. Users can set some parameter values by modifying them in the file or with the command line options.
parameter |
type |
default value |
description and notes |
in |
string |
none |
Name of the input BAM/SAM file |
mapq |
integer |
20 |
Minimal mapping read quality score. The 2 reads of a RP must have mapping quality score > mapq to be considered |
out |
string |
in+"_Ulysses" |
Prefix of the detection output files. (the name of the files are completed by the types of detected SV (results by SV unit or by RP unit) |
stats |
string |
in+"_stats.txt" |
Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file) |
range |
string |
all |
Indicates the chromosomes on which the detection must be performed. Default value "all" for detection on all chromsomes. To specify the desired chromosomes, indicate their ID. use a dash to define a continuous series (1-5 for chromosomes 1 2 3 and 4) and a coma to separate chromosomes or series. (A,D,E,G-H). There is no need to specify the inter-chromosomal pairs because they are all considered for detection. |
Nsv |
integer |
10000 |
Maximum number of detected SV |
fdr |
real |
0.01 |
Upper threshold for statistical significance of a given SV candidate |
n |
integer |
6 |
Multiplicative factor for MAD, used to define detection parameters dn and ISCn |
annotation |
string |
N/A |
(optional) GFF annotation file. Improve the distinction between reciprocal translocations and inversions. |
Before performing the detection of the structural variants (SV), ULYSSES reads and parses the original BAM/SAM file to estimate statistics on the library (mean, mad, stdev of RP insert sizes) and to select the discordant RP potentially describing an SV. It then creates a temporary BAM files for each chromosome, containing selected discordant RP for intra-chromosomal SV candidates and a BAM file containing selected discordant RP for inter-chromosomal SV.
To launch this step, run the ReadBAM program:
[PATH]/ReadBAM.py bamfile_name
replace [PATH] by the address of the directory ULYSSES.v1.0. To run the program within
the directory ULYSSES.v1.0, the command line becomes:
./ReadBAM.py bamfile_name
Optional arguments for ReadBAM
option |
type |
default value |
description and notes |
p |
string |
ulysses_params |
Name of the parameter file. If the file does not exist, create the file with default parameters. |
mapq |
integer |
20 |
Minimal mapping read quality score. The 2 reads of a RP must have mapping quality score > mapq to be considered |
n |
integer |
6 |
Multiplicative factor for MAD, used to define detection parameters dn and ISCn |
out |
string |
in+"_ulysses" |
Prefix of the detection output files. (the name of the files are completed by the types of detected SV (results by SV unit or by RP unit) |
stats |
string |
in+"_stats.txts" |
Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file) |
To launch this step, run the Ulysses.py program
Default :
[PATH]/Ulysses.py
Replace [PATH] by the address of the directory ULYSSES.v1.0.
To run the program within the directory ULYSSES.v1.0, the command line becomes:
./Ulysses.py
Command line options for ULYSSES
option |
type |
default value |
description and notes |
n | integer | 6 |
Multiplicative factor for MAD, used to define detection parameters dn and ISCn |
p |
string |
ulysses_params |
Name of the parameter file. |
out |
string |
in+"_ulysses" |
Prefix of the detection output files. (the name of the files are completed by the types of detected SV and their type (results by SV unit or by RP unit) |
stats |
string |
in+"_stats.txts" |
Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file) |
range |
string |
all |
Indicates the chromosomes on which the detection must be performed. Default value "all" for detection on all chromsomes. To specify the desired chromosomes, indicate their ID. use a dash to define a continuous series (1-5 for chromosomes 1 2 3 and 4) and a coma to separate chromosomes or series. (A,D,E,G-H). There is no need to specify the inter-chromosomal pairs because they are all considered for detection. (PEUT ETRE LAISSER UNE OPTION POUR RESTREINDRE LES PAIRES) |
Nsv |
integer |
10000 |
Maximum number of detected SV |
fdr |
real |
0.01 |
Upper threshold for statistical significance of a given SV candidate |
a |
string |
N/A |
(optional) GFF annotation file. Improve the distinction between reciprocal translocations and inversions. |
typesv |
string |
ALL |
DUP, DEL, INV, INTER (detection for typesv only). ALL (detection for all SV types). "--stats" can be added to any SV type to re-launch the statistical analysis (e.g.: DUP--stats) or use "-statsmod True". |
statsmod |
string |
FALSE |
If TRUE, only the statistical validation module is performed on candidate detected SV |
vcf | string | FALSE | Providing a library name activates VCF output. |
Note that the parsing is performed with the specified values of the parameter n. If you want to run the detection with another n value, the parsing step must be done again.
You can run the detection step several times from the same parsing step if:
You want to modify the Nsv parameter.
You want to perform the detection on some chromosomes.
You want to perform the detection of the SV of your choice.
You want to run only the statistical validation modules, when modifying the fdr parameter, add "--stats" to the svtype.
for example:
[PATH]Ulysses.py –typesv DEL--stats –fdr 0.001
Four outputfiles are created for each SV type : duplication (DUP), deletion (DEL), small inversion (sINS), inversion (INV), non-reciprocal translocation (NRT), reciprocal translocation (RT).
Two outputfiles give the SV candidate properties for all detected SV (_bySV.csv) and for only those SV that successfully passed the statistical validation tests (_bySV.stats.csv). Similarly, two outputfiles contains both the SV properties and the description of each corresponding RP, for all SV candidates (_byRP.csv) and for only those SV that successfully passed the statistical validation tests (_bySV.stats.csv).
VCF output can optionnaly be provided if "-vcf LIB_NAME" option is
specified.
The annotation file is mandatory. If given, ULYSSES will use the position of the centromere to better distinguish inversions from reciprocal translocations.
The accepted format is GFF (see http://www.sanger.ac.uk/resources/software/gff/spec.html)
The fields of a GFF file are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
If you have an annotation file in EMBL or GENBANK format, you can convert them online :
http://www.ebi.ac.uk/Tools/sfc/readseq/,
or within a genome browser like Artemis :
http://www.sanger.ac.uk/resources/software/artemis/
If you have an annotation file in BED format, you can convert it in Galaxy http://main.g2.bx.psu.edu/
A statistics file is automatically created during the parsing of the library.
It contains the following information:
Information |
Description |
RP_type |
Type of the Paired sequences (mate pairs or paired ends) in the library |
Read_length |
Length of the reads |
Chromosome_prefix |
Prefix of chromosome ID |
Mean |
Mean of the Insert sizes (IS) |
Median |
Median of the IS |
MAD |
Median absolut deviation of the IS |
stdev |
Standard deviation of the IS |
nDUP |
Number of discordant RP compatible with a duplication |
nDEL |
Number of discordant RP compatible with a deletion |
nINV |
Number of discordant RP compatible with an inversion |
nInter |
Number of discordant RP compatible with a Inter chromosomal SV |
genome_length |
Genome length retrieved from the BAM file header |
chromosome name / length |
List of all chromosomes and their length retrieved from the BAM file header |
The outputfile describing each SV per line (_bySV.csv and _bySV.stats.csv) contains the first 18 columns described in the table below.
The outputfile describing each RP of each SV (_byRP.csv and _byRP.stats.csv) per line contains all the columns described in the table below.
The _stats.csv files contain only those SV that have successfully passed the statistical validation steps.
Column |
Name |
Description |
1 |
Library |
Name of the library |
2 |
(pair of) chromosome(s) |
Chromosome (or chromosome pair for interchromosomal SV) |
3 |
ID |
ID of the SV (by chromosome) |
4 |
nbRP |
Number of RP describing the SV |
5 |
nbA |
nb of RP describing the SV on junction A |
6 |
nbB |
nb of RP describing the SV on junction B. (only for 2 junctions SV. Set to -1 for 1 junction SV) |
7 |
left_borderA |
5' border of the SV on junction A |
8 |
right_borderA |
3' border of the SV on junction A |
9 |
deltaA |
SV range on junction A between right and left borders |
10 |
cen_posA |
Four digit code giving the position of reads with respect to centromere (L for left arm and R for right arm) and strand orientation (+/-). The first two digits for read1 and the last two digits for read2 |
11 |
left_borderB |
5' border of the SV on junction B |
12 |
right_borderB |
3' border of the SV on junction B |
13 |
deltaB |
SV range on junction B between right and left borders |
14 |
cen_posB |
Four digit code giving the position of reads with respect to centromere (L for left arm and R for right arm) and strand orientation (+/-). The first two digits for read1 and the last two digits for read2 |
15 |
SV_size_min |
Minimum estimated SV size |
16 |
SV_size_max |
Maximum estimated SV size |
17 |
p-balanced |
For SV with two junctions only. P-value of the binomial test (see methods) |
18 |
cov |
local sequence coverage |
19 |
AvrQual |
Average RP qualities |
20 |
p-value |
P-value of the SV statistical validation test |
21 |
RP |
Read-Pair name |
22 |
str1 |
Strand orientation of read1 |
23 |
chr1 |
Chromosome of read1 |
24 |
pos1 |
Coordinate of read1 |
25 |
str2 |
Strand orientation of read2 |
26 |
chr2 |
Chromosome of read2 |
27 |
pos2 |
Coordinate of read2 |
28 |
L |
Length of the Insert Size |
3) R
Linux Ubuntu
sudo apt-get update
sudo apt-get install r-base r-base-dev
Mac OS X
See cran.r-project.org/doc/manuals/R-admin.html#Installing-R-under-OS-X
3) IRanges and qvalue R packages
Linux Ubuntu and Mac OS X (on a terminal window)
sudo R
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
biocLite("qvalue")
4) Three additionnal python extentions are required:
“pysam”: https://github.com/pysam-developers/pysam
sudo pip install pysam
and "NumPy": http://www.numpy.org/
and Networkx