Contemporary techniques dispel with the assumption of linearity and instead seek highly non-linear relations 1, often by employing neural networks. While these promising results indicate that learned representations can have substantial impact on scientific data analysis, they also beg the question: what is a good representation? This elementary question is the focus of this paper.Ī classic example of representation learning is principal component analysis (PCA) 15, which learns features that are linearly related to the original data. ![]() In the analysis of protein sequences in particular, the last years have produced a number of studies that demonstrate how representations can help extract important biological information automatically from the millions of observations acquired through modern sequencing technologies 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. Given the importance of representations it is no surprise that we see a rise in biology of representation learning 1, a subfield of machine learning where the representation is estimated alongside the statistical model. through visualization, or task-specific predictions where limited data is available. This can subsequently be used for data exploration, e.g. At its core, a representation is a distillation of raw data into an abstract, high-level and often lower-dimensional space that captures the essential features of the original data. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.ĭata representations play a crucial role in the statistical analysis of biological data. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. This begs the question of what even constitutes the most meaningful representation. ![]() However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |