ULYSSES : Accurate detection of structural variants from large insert mate-pair next generation sequencing


Version 1.0, September 2014

Contacts: Alexandre Gillet-Markowska and Ingrid Lafontaine


alexandre.gillet-markowska/ at / upmc.fr

ingrid.lafontaine / at / upmc.fr


Laboratoire de Biologie Computationnelle et Quantitative.

UPMC UFR 927. CNRS UMR 7238




Cite Ulysses in your papers

If you use Ulysses for your publications, please read and cite:

Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries
Alexandre Gillet-Markowska; Hugues Richard; Gilles Fischer; Ingrid Lafontaine
Bioinformatics 2015;
doi: 10.1093/bioinformatics/btu730

Bibtex



Quick start


Install R

sudo apt-get update
sudo apt-get install r-base
sudo R

Install R “qvalue” and “IRanges” libraries

sudo R


source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
biocLite("qvalue")

Install python “pysam” API :

sudo pip pysam

 Python Networkx extention is also required

sudo pip install networkx

 Python "NumPy" extension is also required.

sudo apt-get install python-numpy


Usage example


wget https://github.com/gillet/ulysses/archive/ulysses-v1.0.zip
unzip master.zip
cd ulysses-master/example

../ReadBAM.py example.bam -p example_params

../Ulysses.py -p example_params






Introduction


ULYSSES allows detecting the full spectrum of structural variants (SV) from a paired-sequence next generation sequencing library, including deletions (DEL), segmental duplications (DUP), inversions (INV), insertion (INS), reciprocal translocations (RT) and non-reciprocal translocations (NRT).

ULYSSES takes as input files a library of aligned read-pairs (RP) in SAM/BAM format. It is composed of 2 main parts (part1: library parsing and part 2: SV detection)

Download


You can download Ulysses code here or directly at the command line:


wget https://github.com/gillet/ulysses/archive/ulysses-v1.0.zip

Dependencies


Please see the official installation instructions for your running system, following the links given below. Detailed instructions for R and required R packages are given at the end of the manual.


R: http://www.R-project.org/

R packages:

Iranges and qvalue packages from BioConductor http://www.bioconductor.org/

See http://www.bioconductor.org/install/index.html#install-bioconductor-packages


Two additionnal python extentions are required: “pysam”:  https://github.com/pysam-developers/pysam

and "NumPy": http://www.numpy.org/


Installation

Ulysse is written in Python 2.7 and in R version 2.14. It has been tested only on Linux computers with Ubuntu.


A directory "ulysses-master" containing Ulysses scripts will be created.

unzip ulysses-v1.0.zip


The input data FILE 


ULYSSES takes as input a library of Paired-Sequences mapped onto a reference genome as input. The format is BAM/SAM. The (BAM/SAM) file must contain a header section and be sorted by read coordinates.

The parameter FILE


A parameter file is required to launch the program. A default file (ulysses_params) is proposed. Users can set some parameter values by modifying them in the file or with the command line options.




parameter

type

default value

description and notes

in

string

none

Name of the input BAM/SAM file

mapq

integer

20

Minimal mapping read quality score. The 2 reads of a RP must have mapping quality score > mapq to be considered

out

string

in+"_Ulysses"

Prefix of the detection output files. (the name of the files are completed by the types of detected SV (results by SV unit or by RP unit)

stats

string

in+"_stats.txt"

Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file)

range

string

all

Indicates the chromosomes on which the detection must be performed. Default value "all" for detection on all chromsomes. To specify the desired chromosomes, indicate their ID. use a dash to define a continuous series (1-5 for chromosomes 1 2 3 and 4) and a coma to separate chromosomes or series. (A,D,E,G-H). There is no need to specify the inter-chromosomal pairs because they are all considered for detection.

Nsv

integer

10000

Maximum number of detected SV

fdr

real

0.01

Upper threshold for statistical significance of a given SV candidate

n

integer

6

Multiplicative factor for MAD, used to define detection parameters dn and ISCn

annotation

string

N/A

(optional) GFF annotation file. Improve the distinction between reciprocal translocations and inversions.



USAGE

Parsing of the library in BAM/SAM format

Before performing the detection of the structural variants (SV), ULYSSES reads and parses the original BAM/SAM file to estimate statistics on the library (mean, mad, stdev of RP insert sizes) and to select the discordant RP potentially describing an SV. It then creates a temporary BAM files for each chromosome, containing selected discordant RP for intra-chromosomal SV candidates and a BAM file containing selected discordant RP for inter-chromosomal SV.


To launch this step, run the ReadBAM program:

[PATH]/ReadBAM.py bamfile_name

replace [PATH] by the address of the directory ULYSSES.v1.0. To run the program within

the directory ULYSSES.v1.0, the command line becomes:

./ReadBAM.py bamfile_name


Optional arguments for ReadBAM


option

type

default value

description and notes

p

string

ulysses_params

Name of the parameter file. If the file does not exist, create the file with default parameters.

mapq

integer

20

Minimal mapping read quality score. The 2 reads of a RP must have mapping quality score > mapq to be considered

n

integer

6

Multiplicative factor for MAD, used to define detection parameters dn and ISCn

out

string

in+"_ulysses"

Prefix of the detection output files. (the name of the files are completed by the types of detected SV (results by SV unit or by RP unit)

stats

string

in+"_stats.txts"

Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file)



Detection and statistical validation of SV


To launch this step, run the Ulysses.py program


Default :

[PATH]/Ulysses.py

Replace [PATH] by the address of the directory ULYSSES.v1.0.

To run the program within the directory ULYSSES.v1.0, the command line becomes:

./Ulysses.py



Command line options for ULYSSES

option

type

default value

description and notes

n integer 6

Multiplicative factor for MAD, used to define detection parameters dn and ISCn

p

string

ulysses_params

Name of the parameter file.

out

string

in+"_ulysses"

Prefix of the detection output files. (the name of the files are completed by the types of detected SV and their type (results by SV unit or by RP unit)

stats

string

in+"_stats.txts"

Created by ULYSSES. Contains informations required for launching the detection and setting the detection parameters (see table description of the file)

range

string

all

Indicates the chromosomes on which the detection must be performed. Default value "all" for detection on all chromsomes. To specify the desired chromosomes, indicate their ID. use a dash to define a continuous series (1-5 for chromosomes 1 2 3 and 4) and a coma to separate chromosomes or series. (A,D,E,G-H). There is no need to specify the inter-chromosomal pairs because they are all considered for detection. (PEUT ETRE LAISSER UNE OPTION POUR RESTREINDRE LES PAIRES)

Nsv

integer

10000

Maximum number of detected SV

fdr

real

0.01

Upper threshold for statistical significance of a given SV candidate

a

string

N/A

(optional) GFF annotation file. Improve the distinction between reciprocal translocations and inversions.

typesv

string

ALL

DUP, DEL, INV, INTER (detection for typesv only). ALL (detection for all SV types). "--stats" can be added to any SV type to re-launch the statistical analysis (e.g.: DUP--stats) or use "-statsmod True".

statsmod

string

FALSE

If TRUE, only the statistical validation module is performed on candidate detected SV

vcf string FALSE Providing a library name activates VCF output.


Note that the parsing is performed with the specified values of the parameter n. If you want to run the detection with another n value, the parsing step must be done again.


You can run the detection step several times from the same parsing step if:


for example:

[PATH]Ulysses.py –typesv DEL--stats –fdr 0.001



Output


Four outputfiles are created for each SV type : duplication (DUP), deletion (DEL), small inversion (sINS), inversion (INV), non-reciprocal translocation (NRT), reciprocal translocation (RT).

Two outputfiles give the SV candidate properties for all detected SV (_bySV.csv) and for only those SV that successfully passed the statistical validation tests (_bySV.stats.csv). Similarly, two outputfiles contains both the SV properties and the description of each corresponding RP, for all SV candidates (_byRP.csv) and for only those SV that successfully passed the statistical validation tests (_bySV.stats.csv).

VCF output can optionnaly be provided if "-vcf LIB_NAME" option is specified.

File formats

Annotation file


The annotation file is mandatory. If given, ULYSSES will use the position of the centromere to better distinguish inversions from reciprocal translocations.

The accepted format is GFF (see http://www.sanger.ac.uk/resources/software/gff/spec.html)

The fields of a GFF file are:

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]


If you have an annotation file in EMBL or GENBANK format, you can convert them online :

http://www.ebi.ac.uk/Tools/sfc/readseq/,

or within a genome browser like Artemis :

http://www.sanger.ac.uk/resources/software/artemis/


If you have an annotation file in BED format, you can convert it in Galaxy http://main.g2.bx.psu.edu/


Statistics file

A statistics file is automatically created during the parsing of the library.

It contains the following information:


Information

Description

RP_type

Type of the Paired sequences (mate pairs or paired ends) in the library

Read_length

Length of the reads

Chromosome_prefix

Prefix of chromosome ID

Mean

Mean of the Insert sizes (IS)

Median

Median of the IS

MAD

Median absolut deviation of the IS

stdev

Standard deviation of the IS

nDUP

Number of discordant RP compatible with a duplication

nDEL

Number of discordant RP compatible with a deletion

nINV

Number of discordant RP compatible with an inversion

nInter

Number of discordant RP compatible with a Inter chromosomal SV

genome_length

Genome length retrieved from the BAM file header

chromosome name / length

List of all chromosomes and their length retrieved from the BAM file header


Outpufiles:


The outputfile describing each SV per line (_bySV.csv and _bySV.stats.csv) contains the first 18 columns described in the table below.

The outputfile describing each RP of each SV (_byRP.csv and _byRP.stats.csv) per line contains all the columns described in the table below.

The _stats.csv files contain only those SV that have successfully passed the statistical validation steps.


Column

Name

Description

1

Library

Name of the library

2

(pair of) chromosome(s)

Chromosome (or chromosome pair for interchromosomal SV)

3

ID

ID of the SV (by chromosome)

4

nbRP

Number of RP describing the SV

5

nbA

nb of RP describing the SV on junction A

6

nbB

nb of RP describing the SV on junction B. (only for 2 junctions SV. Set to -1 for 1 junction SV)

7

left_borderA

5' border of the SV on junction A

8

right_borderA

3' border of the SV on junction A

9

deltaA

SV range on junction A between right and left borders

10

cen_posA

Four digit code giving the position of reads with respect to centromere (L for left arm and R for right arm) and strand orientation (+/-). The first two digits for read1 and the last two digits for read2

11

left_borderB

5' border of the SV on junction B

12

right_borderB

3' border of the SV on junction B

13

deltaB

SV range on junction B between right and left borders

14

cen_posB

Four digit code giving the position of reads with respect to centromere (L for left arm and R for right arm) and strand orientation (+/-). The first two digits for read1 and the last two digits for read2

15

SV_size_min

Minimum estimated SV size

16

SV_size_max

Maximum estimated SV size

17

p-balanced

For SV with two junctions only. P-value of the binomial test (see methods)

18

cov

local sequence coverage

19

AvrQual

Average RP qualities

20

p-value

P-value of the SV statistical validation test

21

RP

Read-Pair name

22

str1

Strand orientation of read1

23

chr1

Chromosome of read1

24

pos1

Coordinate of read1

25

str2

Strand orientation of read2

26

chr2

Chromosome of read2

27

pos2

Coordinate of read2

28

L

Length of the Insert Size




Detailed description for installation of dependencies


3) R

Linux Ubuntu

sudo apt-get update
sudo apt-get install r-base r-base-dev

Mac OS X

See cran.r-project.org/doc/manuals/R-admin.html#Installing-R-under-OS-X



3) IRanges and qvalue R packages

Linux Ubuntu and Mac OS X (on a terminal window)

sudo R
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
biocLite("qvalue")

4) Three additionnal python extentions are required:

“pysam”:  https://github.com/pysam-developers/pysam

sudo pip install pysam

and "NumPy": http://www.numpy.org/


and Networkx



aff