Lund University Bioinformatics Infrastructure - Large Scale Genomics Analysis


Research areas and keywords

UKÄ subject classification

  • Bioinformatics (Computational Biology)

Type of infrastructure

  • Services

Name of national/international infrastructure this infrastructure belongs to

Closely tied to National Bioinformatics Infrastructure Sweden (NBIS).


LUBI will provide a scalable, flexible and efficient solution which is cost-effective, and enables state-of-the-art high performance computing for researchers who are new to this field as well as for established power-users. A freely available infrastructure that provides such resources coupled with a systems-application expert and tailored introductory and advanced courses ensures that researchers at Lund University can compete globally in the rapidly moving field of data-intensive biology and medicine. LUBI will cater for all research groups and scientists across Lund University coming from the Faculties of Engineering, Medicine and Science.

Equipment and resources


Recently, “blades” to LUNARC to increase the computational and storing capacity needed for large scale genomic projects have been purchased and were installed. Some specifications of the LUBI nodes:
* 2x Intel Xeon Silver 4114 2.2 GHz, 10 cores/20 threads, 13.75 MB L3 cache
* 192 GB RAM (6x32)
* 2 TB disk
To get access to the LUBI nodes you need to:
1) register in SUPR first and
2) be added to the project (LU 2018/2-44).

You can do this a will get a username and password with which you have access to the LUNARC/LUBI-LSGA resources.
To use the LUBI nodes you'll have to specify the following in your batch scripts:
#SBATCH -A lu2018-2-44
#SBATCH -p lu
#SBATCH --reservation-lu2018-2-44
according to:
For research groups having extensive needs there is an option to buy their own blades for their use only, or be shared by others when there are free computing cycles available. LUNARC has an agreement with the suppliers, thereby having good prices that any group can take benefit of. The maintenance of these blades will be taken care of by the LUBI application expert and LUNARC staff.L-SENS is a secure system within LUNARC where GDPR-compliant data management and analysis will be performed. A flexible system for sharing resources is currently under construction.


This is a list of bioinformatics software available at LUNARC. Please note that this list is not exhaustive. To see if a specific package is available and which versions are installed, you will have to login and use 'module spider package-name' e.g. 'module spider BCFtools'.

* Alfred BAM Statistics, Feature Counting and Feature Annotation. Alfred is an efficient and versatile command-line application that computes multi-sample quality control metrics in a read-group aware manner. Alfred supports read counting, feature annotation and haplotype-resolved consensus computation using multiple sequence alignments.Alfred is available as a Bioconda package, you will have to load Anaconda3/2018.12 first before you can use it.

* Amber Amber is a package of programs for molecular dynamics simulations of proteins and nucleic acids.* AmberToolsAmberTools consists of several independently developed packages that work well by themselves, and with Amber. The suite can also be used to carry out complete molecular dynamics simulations, with either explicit water or generalized Born solvent models.

* ANNOVAR ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others).

* AutoDock_Vina AutoDock Vina is an open-source program for doing molecular docking.

* BBMap BBMap short read aligner, and other bioinformatic tools.

* BCFtools Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants.

* bcl2fastq The Illumina sequencing instruments generate per-cycle base call (BCL) files at the end of the sequencing run. A majority of analysis applications use per-read FASTQ files as input for analysis. You can use the bcl2fastq2 Conversion Software v2.19 to convert base call (BCL) files from a sequencing run into FASTQ files.

* beagle-lib beagle-lib is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages.

* BEAST BEAST is a cross-platform program for Bayesian analysis of molecular sequences using MCMC. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability.

* BEDTools Bedtools is a fast, flexible toolset for genome arithmetic.

* Biopython Biopython is a set of freely available tools for biological computation written in Python by an international team of developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics.

* BLAT BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more.

* Bowtie Bowtie is an ultra-fast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome.

* Bowtie2 Bowtie2 is an ultra-fast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie2 supports gapped, local, and paired-end alignment modes.

* BWA Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

* bx-python The bx-python project is a Python library and associated set of scripts to allow for rapid implementation of genome scale analyses.

* Cell Ranger Cell Ranger is a set of analysis pipelines that process Chromium single-cell RNA-seq output to align reads, generate gene-cell matrices and perform clustering and gene expression analysis.

* Cell Ranger ATAC Cell Ranger ATAC is a set of analysis pipelines that process Chromium Single Cell ATAC data.

* Chimera UCSF Chimera is a highly extensible program for interactive visualization and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles.

* chimerascan Chimerascan is a software package that detects gene fusions in paired-end RNA sequencing (RNA-Seq) datasets. Recurrent gene fusions (a.k.a. chimeras) are a prevalent class of mutations that can produce functional transcripts that contribute to cancer progression. Recent advanced in high-throughput sequencing technologies have enabled reliable gene fusion discovery.

* CNVkit CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent.

* cnvkit-bundle CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from targeted DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent. This is a bundle to provide dependencies for cnvkit that aren't available in the standard EasyBuild Python

* CNVnator CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads.

* Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq.

* cutadapt Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

* deepTools deepTools is a suite of Python tools particularly developed for the efficient analysis of high-throughput sequencing data, such as ChIP-seq, RNA-seq or MNase-seq. Intel Xeon Silver 4114 2.2 GHz, 10 cores/20 threads, 13.75 MB L3 cache

* EMBOSS EMBOSS is 'The European Molecular Biology Open Software Suite'. EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community.

* EricScript EricScript is a computational framework for the discovery of gene fusions in paired end RNA-seq data.

* FastQC A quality control tool for high throughput sequence data.

* FASTX-Toolkit The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files reprocessing.

* fineRADstructure Powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry. The high resolution of this method derives from utilizing haplotype linkage information and from focusing on the most recent coalescence (common ancestry) among the sampled individuals to derive a "co-ancestry matrix" - a summary of nearest neighbor haplotype relationships in the dataset. Further advantages when compared with other model-based methods (e.g. STRUCTURE and ADMIXTURE) include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population.

* FLASH FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments. FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies. They can also improve transcriptome assembly when FLASH is used to merge RNA-seq data.

* FreeBayes FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

* FusionCatcher FusionCatcher searches for novel/known fusion genes, translocations, and chimeras in RNA-seq data (paired-end reads from Illumina NGS platforms like Solexa/HiSeq/NextSeq/MiSeq) from diseased samples.

* GATK The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

* GENESIS GENESIS (short for GEneral NEural SImulation System) is a general purpose simulation platform that was developed to support the simulation of neural systems ranging from subcellular components and biochemical reactions to complex models of single neurons, simulations of large networks, and systems-level models.

* GROMACS GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.

* HISAT HISAT is a fast and sensitive spliced alignment program for mapping RNA-seq reads. It is recommended that HISAT and TopHat2 users switch to HISAT2.

* HISAT2 HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) against the general human population (as well as against a single reference genome). HISAT2 is a successor to both HISAT and TopHat2.

* HOMER HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis. It is a collection of command line programs for Unix-style operating systems written in Perl and C++. HOMER was primarily written as a de novo motif discovery algorithm and is well suited for finding 8-20 bp motifs in large scale genomics data. HOMER contains many useful tools for analyzing ChIP-Seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C and numerous other types of functional genomics sequencing data sets.

* HTSeq Analysing high-throughput sequencing data with Python.

* HTSlib A C library for reading/writing high-throughput sequencing data. This package includes the utilities bgzip and tabix.

* IGV The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

* IGVTools This package contains command line utilities for preprocessing, computing feature count density (coverage), sorting, and indexing data files.

* IMPUTE2 IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes.

* Jellyfish Jellyfish is a tool for fast, memory-efficient counting of k-mers in DNA.

* kallisto kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.

* MACS2 Model Based Analysis for ChIP-Seq data.

* MAGeCK Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout.

* manta Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.

* MAVIS MAVIS is a Python (requires >=3) command-line tool for the post-processing of structural variant calls. On Aurora you'll need to load GCC and OpenMPI (module load GCC/7.3.0-2.30 OpenMPI/3.1.1) and Python 3.7.0 (module load Python/3.7.0)

* MEME The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses. The MEME Suite supports motif-based analysis of DNA, RNA and protein sequences. It provides motif discovery algorithms using both probabilistic (MEME) and discrete models (MEME), which have complementary strengths. It also allows discovery of motifs with arbitrary insertions and deletions (GLAM2). In addition to motif discovery, the MEME Suite provides tools for scanning sequences for matches to motifs (FIMO, MAST and GLAM2Scan), scanning for clusters of motifs (MCAST), comparing motifs to known motifs (Tomtom), finding preferred spacings between motifs (SpaMo), predicting the biological roles of motifs (GOMo), measuring the positional enrichment of sequences for known motifs (CentriMo), and analyzing ChIP-seq and other large datasets (MEME-ChIP).

* Molden Molden is a package for displaying Molecular Density from the Ab Initio packages GAMESS-UK, GAMESS-US and GAUSSIAN and the Semi-Empirical packages Mopac/Ampac.

* MuTect MuTect is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes.

* MultiQC Aggregate results from bioinformatics analyses across many samples into a single report. MultiQC searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.

* NAMD NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.

* ncbi-vdb The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives.

* NGS NGS is a new, domain-specific API for accessing reads, alignments and pileups produced from Next Generation Sequencing.

* Picard Picard is a set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

* Pindel Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.

* PLINK PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

* PLUMED PLUMED is an open source library for free energy calculations in molecular systems which works together with some of the most popular molecular dynamics engines. Free energy calculations can be performed as a function of many order parameters with a particular focus on biological problems, using state of the art methods such as metadynamics, umbrella sampling and Jarzynski-equation based steered MD. The software, written in C++, can be easily interfaced with both Fortran and C/C++ codes.

* Protege Ontology editor and framework for building intelligent systems.

* Pysam Pysam is a Python module for reading, manipulating and writing genomic data sets.

* QCTOOL QCTOOL is a command-line utility program for basic quality control of gwas datasets and other genome-wide data. It supports the same file formats used by the WTCCC studies, as well as the binary file format described here and the Variant Call Format, and is designed to work seamlessly with SNPTEST and related tools.

* RasMol RasMol is a program for molecular graphics visualisation.

* ROOT ROOT is a modular scientific software toolkit. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage.

* RSEM RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels.

* RSeQC RSeQC provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc.

* RevBayes RevBayes provides an interactive environment for statistical computation in phylogenetics. It is primarily intended for modeling, simulation, and Bayesian inference in evolutionary biology, particularly phylogenetics.

* Salmon Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data.

* samblaster samblaster: a tool to mark duplicates and extract discordant and split reads from SAM files.

* SAMtools SAM Tools provide various utilities for manipulating alignments in the SAM/BAM/CRAM format, including sorting, merging, indexing and generating alignments in a per-position format.

* SeqAn SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data.

* SeqMonk A tool to visualise and analyse high throughput mapped sequence data.

* seqtk Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

* snpEff SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).

* SNPTEST Analysis of single SNP association in genome-wide studies.

* SRA-Toolkit The Sequence Read Archive (SRA) Toolkit, and the source-code SRA System Development Kit (SDK), will allow you to programmatically access data housed within SRA and convert it from the SRA format.

* Stacks Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.

* STAR STAR aligns RNA-seq reads to a reference genome using uncompressed suffix arrays.

* STAR-Fusion STAR-Fusion uses the STAR aligner to identify candidate fusion transcripts supported by Illumina reads. STAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set.

* Strelka2 Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. The germline caller employs an efficient tiered haplotype model to improve accuracy and provide read-backed phasing, adaptively selecting between assembly and a faster alignment-based haplotyping approach at each variant locus. The germline caller also analyzes input sequencing data using a mixture-model indel error estimation method to improve robustness to indel noise.

* StringTie StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.

* Subread High performance read alignment, quantification and mutation discovery. The Subread package comprises a suite of software programs for processing next-gen sequencing read data including: Subread: a general-purpose read aligner which can align both genomic DNA-seq and RNA-seq reads. It can also be used to discover genomic mutations including short indels and structural variants. Subjunc: a read aligner developed for aligning RNA-seq reads and for the detection of exon-exon junctions. Gene fusion events can be detected as well. featureCounts: a software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins. Sublong: a long-read aligner that is designed based on seed-and-vote. exactSNP: a SNP caller that discovers SNPs by testing signals against local background noises. These programs were also implemented in Bioconductor R package Rsubread.

* TelSeq TelSeq is software that estimates telomere length from whole genome sequencing data (BAMs).

* TIDDIT Structural variant calling: identify chromosomal rearrangements using Mate Pair or Paired End sequencing data. TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions, using supplementary alignments as well as discordant pairs.

* TopHat TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. It is recommended that HISAT and TopHat(2) users switch to HISAT2.

* Trimmomatic Trimmomatic performs a variety of useful trimming tasks for Illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line.

* ucsc-tools Tools from the UCSC browser.

* VarScan Variant calling and somatic mutation/CNV detection for next-generation sequencing data.

* vcf2maf Convert a VCF into a Mutation Annotation Format (MAF), where each variant is annotated to only one of all possible gene isoforms.

* VCFtools The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.

* Velvet Sequence assembler for very short reads.

* ZIFA Zero-inflated dimensionality reduction algorithm for single-cell data.

Services provided

LUBI-LSGA is going to organize training on how to use its services. The topics of these courses will be decided together with the user community. We will also organize seminars and workshops on important bioinformatics topics


The PhD course "Monte Carlo and molecular dynamics tools", 7.5 hp, is now open for registration. The course will run during weeks 14–23.  The registration form should be completed and returned to Ross Church  by email ( or internal mail to astronomi, HS25, by March 22.   More details about the course can be found at the COMPUTE website here: Note that students must be COMPUTE members to take this course: membership is free, comes without any obligatory courses etc., and is open to all PhD students at the faculties of science and medicine and LTH.

Available for loan

Not available for loan

Terms of access:

Freely open for all research groups working at LU on genomic research. Research groups in need of increased capacity can buy and attach their blades to the system.