Scripts

We have also provided standalone scripts for splicing and productivity analysis of quantified isoforms from flair-collapse output.

predictProductivity.py

Annotated start codons from the annotation are used to identify the longest ORF for each isoform for predicting isoform productivity. Requires three arguments to classify isoforms according to productivity: (1) isoforms in psl or bed format, (2) gtf genome annotation, (3) fasta genome sequences. Bedtools must be in your $PATH for predictProductivity.py to run properly.

Usage:

python predictProductivity.py -i <isoforms.bed>|<isoforms.psl> -g annotation.gtf -f genome.fa --longestORF > productivity.bed

Outputs a bed file with either the values PRO (productive), PTC (premature termination codon, i.e. unproductive), NGO (no start codon), or NST (has start codon but no stop codon) appended to the end of the isoform name. When isoforms are visualized in the UCSC genome browser or IGV, the isoforms will be colored accordingly and have thicker exons to denote the coding region.

mark_intron_retention.py

Requires three positional arguments to identify intron retentions in isoforms: (1) a psl of isoforms, (2) psl output filename, (3) txt output filename for coordinates of introns found.

Usage:

python mark_intron_retention.py <isoforms.psl>|<isoforms.bed> out_isoforms.psl out_coords.txt

Outputs (1) an extended psl with an additional column containing either values 0 or 1 classifying the isoform as either spliced or intron-retaining, respectively; (2) txt file of intron retentions with format isoform name chromosome intron 5' coordinate intron 3' coordinate. Note: A psl or bed file with more additional columns will not be displayed in the genome browser, but can be displayed in IGV.

diff_iso_usage.py

Requires four positional arguments to identify and calculate significance of alternative isoform usage between two samples using Fisher’s exact tests: (1) counts_matrix.tsv from flair-quantify, (2) the name of the column of the first sample, (3) the name of the column of the second sample, (4) txt output filename containing the p-value associated with differential isoform usage for each isoform. The more differentially used the isoforms are between the first and second condition, the lower the p-value.

Usage:

python diff_iso_usage.py counts_matrix.tsv colname1 colname2 diff_isos.txt

Output file format columns are as follows: gene name isoform name p-value sample1 isoform count sample2 isoform count sample1 alternative isoforms for gene count sample2 alternative isoforms for gene count

plot_isoform_usage.py

Visualization script for FLAIR isoform structures and the percent usage of each isoform in each sample for a given gene. If you supply the isoforms.bed file from running predictProductivity.py, then isoforms will be filled according to the predicted productivity (solid for PRO, hatched for PTC, faded for NGO or NST). The gene name supplied should correspond to a gene name in your isoform file and counts file.

Usage:

python plot_isoform_usage.py <isoforms.psl>|<isoforms.bed> counts_matrix.tsv gene_name

Outputs (1) gene_name_isoforms.png of isoform structures and (2) gene_name_usage.png of isoform usage by sample.

For example:

_images/toy_diu_isoforms.png
_images/toy_diu_usage.png

diffsplice_fishers_exact.py

Identifies and calculates the significance of alternative splicing events between two samples without replicates using Fisher’s exact tests. Requires four positional arguments: (1) flair-diffSplice tsv of alternative splicing calls for a splicing event type, (2) the name of the column of the first sample, (3) the name of the column of the second sample, and (4) tsv output filename containing the p-values from Fisher’s exact tests of each event.

Usage:

python diffsplice_fishers_exact.py events.quant.tsv colname1 colname2 out.fishers.tsv

The output file contains the original columns with an additional column containing the p-values appended.