Since the advent of high-throughput genome sequencing methods in the mid-2000s, molecular biology has rapidly transitioned towards data-intensive science. Recent technological developments have increased the accessibility of omics experiments by decreasing the cost, while the concurrent design of new algorithms have improved the computational work-ﬂow needed to analyse the large datasets generated. This has enabled the long standing idea of a systems approach to the cell, where molecular phenomena are no longer observed in isolation, but as parts of a tightly regulated cell-wide system. However, large data biology is not without its challenges, many of which are directly related to how to store, handle and analyse ome-wide datasets.
The present thesis examines large data microbiology from a middle ground between metabolic engineering and in silico data management. The work was performed in the context of applied microbial lignocellulose valorisation with the end goal of generating improved cell factories for the production of value-added chemicals from renewable plant biomass. Three different challenges related to this feedstock were investigated from a large data-point of view: bacterial catabolism of lignin and its derived aromatic compounds; tolerance of baker’s yeast Saccharomyces cerevisiae to inhibitory compounds in lignocellulose hydrolysate; and the non-fermentable response to xylose in S. cerevisiae engineered for growth on this pentose sugar.
The bibliome of microbial lignin catabolism is vast and consists of a long-standing cohort of fundamental microbiology, and a more recent cohort of applied lignin biovalorisation. Here, an online database was created with the long-term ambition of closing the gap between the two and make new connections that can fuel the generation of new knowledge. Whole-genome sequencing was used to investigate the genetic basis for observed phenotypes in bacterial isolates capable of growing on different kinds of lignin-derived aromatics. A whole-genome approach was also used to identify key sequence variants in the genotype of an industrial S. cerevisiae strain evolved for improved tolerance to inhibitors and high temperature. Finally, assessment of the sugar signalome of S. cerevisiae was enabled by the design and validation of a panel of in vivo ﬂuorescent biosensors for single-cell cytometric analysis. It was found that xylose triggered a signal similar to that of low glucose in yeast cells engineered with xylose utilization pathways, and that introduction of deletions previously related to improved xylose utilization altered the signal towards that of high glucose.
Taken together, the present thesis illustrates how omics-approaches can aid design of laboratory experiments to increase the knowledge and understanding of microorganisms, and demonstrates the need for a combined knowledge of molecular and computational biology in large-scale data microbiology.
The technological advancements in society continuously change how we live and work. Over the last ﬁve decades, computers have helped us organize and process text and numbers, and the internet has given us access to a 24-7 wealth of information and global communication. These developments have also changed how science is performed and disseminated. Specialized instruments can now make hundreds of thousands measurements of a sample in one go, immensely speeding up research outcomes. As a result, some ﬁelds in contemporary cell biology are now as much about data handling and -understanding, as they are about the biology itself.
This type of so-called Large Data biology has opened up whole new possibilities on how the microbial cell can be investigated. While traditional molecular microbiology approaches the subject by studying a couple of elements in a cell such as genes and proteins on their own, the new technologies allow to study whole layers (so called omes) of the cell at once; for instance, the genome consists of all the genes in a cell, the transcriptome all the mRNA that have been expressed from the genes at a given time, the proteome all the proteins translated from said mRNA at a given time, and the metabolome all the chemical compounds (metabolites) produced by the proteins. The methods used to measure these omes are referred to as omics; for instance, the technique to identify the genome (all the genes in the cell) is called genomics.
The sheer size and complexity of the data generated by ome-wide studies calls for scientists to have simultaneous knowledge of the biology (here: the microbial cell) as well as the computational part. The process of handling large biological data is known as bioinformatics, and is together with data management and computer programming an invaluable tool for the modern molecular microbiologist.
In the present thesis, Large Data biology was applied to improve the knowledge and understanding of microbial cells designed for sustainable production of renewable chemicals. Central to the investigation was biological conversion of non-edible plant matter (so called lignocellulose), such as corn stover, wood chips and bagasse, into societally valuable products, e.g. bioethanol. The current work focused on the initial half of the microbial conversion: how lignocellulosic compounds can be better taken up and broken down by the cell.
Three case studies were considered: i) how to better assess the scientiﬁc literature; ii) how to determine the genome sequence of complex industrial microorganisms and new isolates (genomics); and iii) how to measure how the cell senses its nutrients (here: different sugars) and controls its breakdown.
In the ﬁrst case, a web-based database was designed and developed that collects the large and slightly disjointed scientiﬁc literature on the microbial breakdown of lignin, one of the major components of lignocellulose. The goal of the database is to collect all current knowledge on lignin biodegradation in a single interactive platform in order to simplify the process of data retrieval for the scientiﬁc community.
In the second case, the genomes of lignin-degrading bacteria and a lignocellu- lose fermenting yeast were determined by whole-genome sequencing methods. This method produces millions of small snippets of DNA that have to be assembled back to the full genome – a process not unlike that of building a jigsaw puzzle, only that the ﬁnal picture often is unknown at the start. The assembled genomes were then used to determine the presence of genes related to the ability to grow on lignin and its related aromatic compounds. Genomics methods were also used to discover mutations in a yeast strain that had acquired increased tolerance to stressful conditions encountered in industrial lignocellulose fermentation, in order to explain why this yeast had become more robust.
In the third case, the peculiar behavior of baker’s yeast Saccharomyces cerevisiae to the ﬁve-carbon sugar xylose was investigated. This yeast cannot naturally grow on xylose, and has to be genetically modiﬁed with genes from other organisms to do so. Still, even after genetic engineering, the yeast grows much slower on xylose than on its preferred sugar glucose, and produces ethanol at a lower rate. To investigate this behavior, a set of green ﬂuorescent markers were constructed that, once installed in the yeast genome, allowed for the measurement of the sugar sensing and signaling network in each cell in real time through ﬂuorescence measurements. It was found that when the cell sensed xylose, it resulted in the same signal as very low concentrations of glucose (i.e. almost starvation) did, and that the modiﬁcation of previously known key genes for improved use of xylose changed the signal more towards that of regular amounts of glucose.
This thesis illustrates that the use of different forms of Large Data biology allows investigations of the microbial cell in ways that would not be possible or time-wise reasonable with traditional microbial methods. It also shows that the sheer volume of data these approaches generate quickly become a needle-in-the-haystack challenge, where ﬁnding the relevant data in the large ocean that is the cellular omes is only possible when molecular biology is combined with computational approaches.