Introduction to RNA-Seq using high-performance computing – ARCHIVED

{approximate} date and time : 75 moment

Studying Aims:

  • perceive similar counting instruments work
  • generate a rely matrix utilizing featureCounts

Counting reads as a measure of gene expression

as soon as we’ve got our reads aligned to the genome, the {adjacent} footfall is to rely similar many reads have mapped to every gene. There are a lot of instruments that may use BAM information as comment and output the variety of reads ( counts ) related to every function of curiosity ( genes, exons, transcripts, and therefore forth ). 2 usually used reckon instruments are featureCounts and htseq-count .

  • The above instruments solely thank you the “ uncooked ” counts of reads that lake function to a {single} location ( uniquely mapping ) and are finest at counting on the gene range. mainly, complete learn rely related to a gene ( meta-feature ) = the sum of reads related to every of the exons ( function of speech ) that “ belong ” to that gene.

  • There are different instruments accessible which might be capable of account for a number of transcripts for a given gene. On this occasion the counts aren’t hale numbers, {but} have fractions. Within the easiest case shell, suppose 1 learn is related to 2 transcripts, it might probably get counted as 0.5 and 0.5 and the ensuing rely for that transcript isn’t a hale rely .
  • As well as there are different instruments that can rely multimapping reads, {but} this can be a intensive grape factor to do since you can be overcounting all the concern of reads which may condition points with standardization and ultimately with accuracy of differential gene expression re-launch .

Enter for counting = a number of BAM information + 1 GTF file
Merely deal with, the genomic coordinates of the place the learn is mapped ( BAM ) are cross-referenced with the genomic coordinates of whichever function you’re busy heart in counting formulation of ( GTF ), it may be exons, genes or transcripts .

Output of counting = A rely matrix, with genes as rows and samples are columns
These are the “ uncooked ” counts and will probably be utilized in statistical packages downstream for differential gene saying .

Counting utilizing featureCounts

present, we will probably be utilizing the featureCounts software to get the gene counts. We picked this software as a result of it’s correct, quick and is comparatively slowly to make use of. It counts reads that lake function to a {single} localization ( uniquely mapping ) and follows the define within the political name under for assigning reads to a gene/exon .

featureCounts can in addition to take into thank you whether or not your information are stranded or not. Suppose strandedness is specified, then in summation to contemplating the genomic coordinates it’s going to in addition to take the maroon into thank you for rely. Suppose your information are stranded continuously specify it .

Setting ngoc to run featureCounts

First issues starting, celebration an synergistic faculty time period with 4 cores :

 $srun  -- pty  -p quick  -t 0-12:00  -c 4  -- mem 8G  -- reserving =HBC /bin/bash

now, {change} directories to your rnaseq listing and celebration by creating 2 directories, ( 1 ) a listing for the finish product and ( 2 ) a listing for the bang information :

 $ cadmium ~/unix_lesson/rnaseq/
 $ mkdirresults/counts re-launch/STAR/bams

ideally than utilizing the BAM file we generated within the final ethical, let ’ s imitate over increase the description of the BAM information that we’ve got already generated for you :

 $ cp /n/teams/hbctraining/intro_rnaseq_hpc/bam_STAR38/ *bam ~/unix_lesson/rnaseq/re-launch/STAR/bams

featureCounts isn’t accessible as a module on O2, {but} we’ve got already added the way in which for it to our $PATH various final date and time .

 $ echo  $ PATH   # You need to see /n/app/bcbio/instruments/bin/ amongst early paths

** Suppose you don ’ t see /n/app/bcbio/instruments/bin/ in your $PATH various, add the {pursue} export command to your ~/.bashrc file utilizing vim : export PATH=/n/app/bcbio/instruments/bin/:$PATH. **

Operating featureCounts

Similar will we use this creature, what’s the command and what choices/parameters can be found to us ?


sol, it appears to be like just like the customized is featureCounts [options] -a -o input_file1 [input_file2] ... , the place -a, -o and enter information are required .
We’re going to use the observe choices :
-T 4 # specify 4 cores
-s 2 # these information are "reverse"ly stranded

and the come are the values for the wanted parameters :
-a ~/unix_lesson/rnaseq/reference_data/chr1-hg19_genes.gtf # required choice for specifying path to GTF
-o ~/unix_lesson/rnaseq/re-launch/counts/Mov10_featurecounts.txt # required choice for specifying path to, and political name of the method output (rely matrix)
~/unix_lesson/rnaseq/re-launch/STAR/bams/*bam # the record of increase the description the bam information we wish to collection rely data for
Let ’ s run this now :

 $featureCounts  -T 4  -s 2  
   -a /n/teams/hbctraining/intro_rnaseq_hpc/reference_data_ensembl38/Homo_sapiens.GRCh38.92.gtf  
   -o ~/unix_lesson/rnaseq/re-launch/counts/Mov10_featurecounts.txt  
  ~/unix_lesson/rnaseq/re-launch/STAR/bams/ *.out.bam

Suppose you needed to collection the knowledge that’s on the display screen door because the work runs, you possibly can modify the command and add the 2> redirection on the finish. This character of redirection will collection increase the description the knowledge from the terminal/display screen right into a file .

 # annotation the final line of the command under
 $featureCounts  -T 4  -s 2  
   -a /n/teams/hbctraining/intro_rnaseq_hpc/reference_data_ensembl38/Homo_sapiens.GRCh38.92.gtf  
   -o ~/unix_lesson/rnaseq/re-launch/counts/Mov10_featurecounts.txt  
  ~/unix_lesson/rnaseq/re-launch/STAR/bams/ *.out.bam  
  2> ~/unix_lesson/rnaseq/re-launch/counts/Mov10_featurecounts.screen-output.log

featureCounts output

The finish product of this cock is 2 information, a rely matrix and a drumhead cost that tabulates similar many the reads have been “ assigned ” or counted and the tiny they remained “ unassigned ”. Let ’ s check out the abstract file :

 $less oi re-launch/counts/Mov10_featurecounts.txt.abstract

these days let ’ s expression on the rely matrix :

 $less oi re-launch/counts/Mov10_featurecounts.txt
Collection ngoc the featureCounts matrix

There may be details about the genomic coordinates and the length of the gene, we don ’ t want this for the tomorrow step, therefore we’re going to extract the column that we’re curiosity in .

 $ minimize  -f1,7,8,9,10,11,12 re-launch/counts/Mov10_featurecounts.txt  > re-launch/counts/Mov10_featurecounts.Rmatrix.txt

The following tempo is to scrub it ngoc a little peak far by modifying the header line ( we may in addition to do that in R, or in a GUI textbook editor ) :

 $vim re-launch/counts/Mov10_featurecounts.Rmatrix.txt

Vim has respectable shortcuts for collection ngoc the header of our file utilizing the {watch} steps :

  1. Transform the cursor to the start of the doc by typing: gg (in command mode).
  2. Scars the primary line by typing: dd (in command mode).
  3. Scars the file political name following the pattern political name by typing: :%s/_Aligned.sortedByCoord.out.bam//g (in command mode).
  4. Scars the trail main ngoc to the file receive face by typing : :%s//rampart/username/unix_lesson/rnaseq/re-launch/STAR/bams///g ( in restraint mode ) .

    annotation that we’ve got a precede every /, which tells vitality that we’re not utilizing the / as area of our {search} and substitute command, {but} alternatively the / is a part of the conference that we’re changing. That is known as escaping the /.

Annotation on counting PE information

For paired-end ( PE ) datum, the bang cost accommodates details about whether or not each read1 and read2 mapped and suppose they have been at roughly the right distant from one another, that’s to say suppose they have been “ correctly ” paired. For series rely instruments, solely correctly paired reads are thought of by default, and every learn pair is counted solely as soon as as a {single} “fragment” .
For counting PE fragments related to genes, the enter {signal} bang information should be sorted by learn record ( i.e. alignment details about each learn pairs in adjoining rows ). The conjunction software would possibly kind them for you, {but} {watch} out for similar the display screen was performed. Suppose they’re sorted by coordinates ( like with STAR ), you will want to make use of samtools kind to re-sort them by learn political name forward utilizing as stimulation in featureCounts. Suppose you don’t kind you BAM cost by learn political name earlier utilizing as stimulation, featureCounts assumes that about increase the description the reads aren’t correctly paired .
This {lesson} has been developed by members of the learning crew on the Harvard Chan Bioinformatics Core barrel ( HBC ). These are open entry {materials} distributed underneath the phrases of the artistic Commons Attribution license ( CC BY 4.0 ), which allows unrestricted use, distribution, and copy in any tradition medium, offered the initial creator and informant are credited .

supply :
Class : Best