ULYSSES allows detecting the full spectrum of structural variants (SV) from a paired-sequence next generation sequencing library, including deletions (DEL), segmental duplications (DUP), inversions (INV), insertion (INS), reciprocal translocations (RT) and non-reciprocal translocations (NRT).

ULYSSES takes as input files a library of aligned read-pairs (RP) in SAM/BAM format. It is composed of 2 main parts (part1: library parsing and part 2: SV detection)

Download

You can download Ulysses code here or directly at the command line:

wget https://github.com/gillet/ulysses/archive/ulysses-v1.0.zip

Dependencies

Please see the official installation instructions for your running system, following the links given below. Detailed instructions for R and required R packages are given at the end of the manual.

R: http://www.R-project.org/

R packages:

Iranges and qvalue packages from BioConductor http://www.bioconductor.org/

See http://www.bioconductor.org/install/index.html#install-bioconductor-packages

Two additionnal python extentions are required: “pysam”: https://github.com/pysam-developers/pysam

and "NumPy": http://www.numpy.org/

Installation

Ulysse is written in Python 2.7 and in R version 2.14. It has been tested only on Linux computers with Ubuntu.

A directory "ulysses-master" containing Ulysses scripts will be created.

unzip ulysses-v1.0.zip

The input data FILE

ULYSSES takes as input a library of Paired-Sequences mapped onto a reference genome as input. The format is BAM/SAM. The (BAM/SAM) file must contain a header section and be sorted by read coordinates.

The parameter FILE

A parameter file is required to launch the program. A default file (ulysses_params) is proposed. Users can set some parameter values by modifying them in the file or with the command line options.

parameter	type	default value	description and notes
in	string	none	Name of the input BAM/SAM file
mapq	integer	20	Minimal mapping read quality score. The 2 reads of a RP must have mapping quality score > mapq to be considered
out	string	in+"_Ulysses"	Prefix of the detection output files. (the name of the files are completed by the types of detected SV (results by SV unit or by RP unit)
stats	string	in+"_stats.txt"	Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file)
range	string	all	Indicates the chromosomes on which the detection must be performed. Default value "all" for detection on all chromsomes. To specify the desired chromosomes, indicate their ID. use a dash to define a continuous series (1-5 for chromosomes 1 2 3 and 4) and a coma to separate chromosomes or series. (A,D,E,G-H). There is no need to specify the inter-chromosomal pairs because they are all considered for detection.
Nsv	integer	10000	Maximum number of detected SV
fdr	real	0.01	Upper threshold for statistical significance of a given SV candidate
n	integer	6	Multiplicative factor for MAD, used to define detection parameters d_n and ISC_n
annotation	string	N/A	(optional) GFF annotation file. Improve the distinction between reciprocal translocations and inversions.

USAGE

Parsing of the library in BAM/SAM format

Before performing the detection of the structural variants (SV), ULYSSES reads and parses the original BAM/SAM file to estimate statistics on the library (mean, mad, stdev of RP insert sizes) and to select the discordant RP potentially describing an SV. It then creates a temporary BAM files for each chromosome, containing selected discordant RP for intra-chromosomal SV candidates and a BAM file containing selected discordant RP for inter-chromosomal SV.

To launch this step, run the ReadBAM program:

[PATH]/ReadBAM.py bamfile_name

replace [PATH] by the address of the directory ULYSSES.v1.0. To run the program within

the directory ULYSSES.v1.0, the command line becomes:

./ReadBAM.py bamfile_name

Optional arguments for ReadBAM

option	type	default value	description and notes
p	string	ulysses_params	Name of the parameter file. If the file does not exist, create the file with default parameters.
mapq	integer	20	Minimal mapping read quality score. The 2 reads of a RP must have mapping quality score > mapq to be considered
n	integer	6	Multiplicative factor for MAD, used to define detection parameters d_n and ISC_n
out	string	in+"_ulysses"	Prefix of the detection output files. (the name of the files are completed by the types of detected SV (results by SV unit or by RP unit)
stats	string	in+"_stats.txts"	Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file)

Detection and statistical validation of SV

To launch this step, run the Ulysses.py program

Default :

[PATH]/Ulysses.py

Replace [PATH] by the address of the directory ULYSSES.v1.0.

To run the program within the directory ULYSSES.v1.0, the command line becomes:

./Ulysses.py

Command line options for ULYSSES

option	type	default value	description and notes
n	integer	6	Multiplicative factor for MAD, used to define detection parameters d_n and ISC_n
p	string	ulysses_params	Name of the parameter file.
out	string	in+"_ulysses"	Prefix of the detection output files. (the name of the files are completed by the types of detected SV and their type (results by SV unit or by RP unit)
stats	string	in+"_stats.txts"	Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file)
range	string	all	Indicates the chromosomes on which the detection must be performed. Default value "all" for detection on all chromsomes. To specify the desired chromosomes, indicate their ID. use a dash to define a continuous series (1-5 for chromosomes 1 2 3 and 4) and a coma to separate chromosomes or series. (A,D,E,G-H). There is no need to specify the inter-chromosomal pairs because they are all considered for detection. (PEUT ETRE LAISSER UNE OPTION POUR RESTREINDRE LES PAIRES)
Nsv	integer	10000	Maximum number of detected SV
fdr	real	0.01	Upper threshold for statistical significance of a given SV candidate
a	string	N/A	(optional) GFF annotation file. Improve the distinction between reciprocal translocations and inversions.
typesv	string	ALL	DUP, DEL, INV, INTER (detection for typesv only). ALL (detection for all SV types). "--stats" can be added to any SV type to re-launch the statistical analysis (e.g.: DUP--stats) or use "-statsmod True".
statsmod	string	FALSE	If TRUE, only the statistical validation module is performed on candidate detected SV
vcf	string	FALSE	Providing a library name activates VCF output.

Note that the parsing is performed with the specified values of the parameter n. If you want to run the detection with another n value, the parsing step must be done again.

You can run the detection step several times from the same parsing step if:

You want to modify the Nsv parameter.
You want to perform the detection on some chromosomes.
You want to perform the detection of the SV of your choice.
You want to run only the statistical validation modules, when modifying the fdr parameter, add "--stats" to the svtype.

for example:

[PATH]Ulysses.py –typesv DEL--stats –fdr 0.001

Output

Four outputfiles are created for each SV type : duplication (DUP), deletion (DEL), small inversion (sINS), inversion (INV), non-reciprocal translocation (NRT), reciprocal translocation (RT).

Two outputfiles give the SV candidate properties for all detected SV (_bySV.csv) and for only those SV that successfully passed the statistical validation tests (_bySV.stats.csv). Similarly, two outputfiles contains both the SV properties and the description of each corresponding RP, for all SV candidates (_byRP.csv) and for only those SV that successfully passed the statistical validation tests (_bySV.stats.csv).

VCF output can optionnaly be provided if "-vcf LIB_NAME" option is specified.

File formats

Annotation file

The annotation file is mandatory. If given, ULYSSES will use the position of the centromere to better distinguish inversions from reciprocal translocations.

The accepted format is GFF (see http://www.sanger.ac.uk/resources/software/gff/spec.html)

The fields of a GFF file are:

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

If you have an annotation file in EMBL or GENBANK format, you can convert them online :

http://www.ebi.ac.uk/Tools/sfc/readseq/,

or within a genome browser like Artemis :

http://www.sanger.ac.uk/resources/software/artemis/

If you have an annotation file in BED format, you can convert it in Galaxy http://main.g2.bx.psu.edu/

Statistics file

A statistics file is automatically created during the parsing of the library.

It contains the following information:

Information	Description
RP_type	Type of the Paired sequences (mate pairs or paired ends) in the library
Read_length	Length of the reads
Chromosome_prefix	Prefix of chromosome ID
Mean	Mean of the Insert sizes (IS)
Median	Median of the IS
MAD	Median absolut deviation of the IS
stdev	Standard deviation of the IS
nDUP	Number of discordant RP compatible with a duplication
nDEL	Number of discordant RP compatible with a deletion
nINV	Number of discordant RP compatible with an inversion
nInter	Number of discordant RP compatible with a Inter chromosomal SV
genome_length	Genome length retrieved from the BAM file header
chromosome name / length	List of all chromosomes and their length retrieved from the BAM file header

Outpufiles:

The outputfile describing each SV per line (_bySV.csv and _bySV.stats.csv) contains the first 18 columns described in the table below.

The outputfile describing each RP of each SV (_byRP.csv and _byRP.stats.csv) per line contains all the columns described in the table below.

The _stats.csv files contain only those SV that have successfully passed the statistical validation steps.

Column	Name	Description
1	Library	Name of the library
2	(pair of) chromosome(s)	Chromosome (or chromosome pair for interchromosomal SV)
3	ID	ID of the SV (by chromosome)
4	nbRP	Number of RP describing the SV
5	nbA	nb of RP describing the SV on junction A
6	nbB	nb of RP describing the SV on junction B. (only for 2 junctions SV. Set to -1 for 1 junction SV)
7	left_borderA	5' border of the SV on junction A
8	right_borderA	3' border of the SV on junction A
9	deltaA	SV range on junction A between right and left borders
10	cen_posA	Four digit code giving the position of reads with respect to centromere (L for left arm and R for right arm) and strand orientation (+/-). The first two digits for read1 and the last two digits for read2
11	left_borderB	5' border of the SV on junction B
12	right_borderB	3' border of the SV on junction B
13	deltaB	SV range on junction B between right and left borders
14	cen_posB	Four digit code giving the position of reads with respect to centromere (L for left arm and R for right arm) and strand orientation (+/-). The first two digits for read1 and the last two digits for read2
15	SV_size_min	Minimum estimated SV size
16	SV_size_max	Maximum estimated SV size
17	p-balanced	For SV with two junctions only. P-value of the binomial test (see methods)
18	cov	local sequence coverage
19	AvrQual	Average RP qualities
20	p-value	P-value of the SV statistical validation test
21	RP	Read-Pair name
22	str1	Strand orientation of read1
23	chr1	Chromosome of read1
24	pos1	Coordinate of read1
25	str2	Strand orientation of read2
26	chr2	Chromosome of read2
27	pos2	Coordinate of read2
28	L	Length of the Insert Size

Detailed description for installation of dependencies

3) R

Linux Ubuntu

sudo apt-get update
sudo apt-get install r-base r-base-dev

Mac OS X

See cran.r-project.org/doc/manuals/R-admin.html#Installing-R-under-OS-X

3) IRanges and qvalue R packages

Linux Ubuntu and Mac OS X (on a terminal window)

sudo R
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
biocLite("qvalue")

4) Three additionnal python extentions are required:

“pysam”: https://github.com/pysam-developers/pysam

sudo pip install pysam

and "NumPy": http://www.numpy.org/

and Networkx