Into the Depths of Cells Using Artificial Intelligence
David Hoksza studies proteins; however, he is not a biologist, as one might assume, but a computer scientist. At the Department of Software Engineering, he develops tools that enable the visualisation of protein structures or the identification of key sites where drugs bind. While scientists worldwide, including major European molecular databases, utilise software created by his research group, his expertise is particularly valued by students at Charles University. Thanks to Hoksza and his colleagues, bioinformatics—a modern, multidisciplinary field combining biology, computer science, and mathematics—has been taught for several years at the Faculty of Mathematics and Physics and the Faculty of Science.
How does a computer scientist become interested in biology?
I have always been interested in biology, but I became more involved in it during my doctoral studies in software systems at the Faculty of Mathematics and Physics. I was working on developing methods for fast searching in protein sequence and structure databases. My research, conducted under the supervision of Tomáš Skopal—now the Vice-Rector for Informatics at Charles University—was, however, heavily computer science-oriented. Tomáš Skopal specialised in metric indexing, a type of specialised indexing structure, and was looking for applications of these methods. Large biological entity databases turned out to be an ideal use case.
At that time, we developed a tool that worked but was not optimal for widespread practical use. The main issue was that we had designed it as computer scientists, without a strong biological background. Biological systems have various constraints, and if you do not understand them, you may end up creating something that works on one type of data but not on another. However, the topic fascinated me, so I continued working on it and gradually incorporated more biology into my expertise.
What exactly does bioinformatics study?
Bioinformatics is a field that combines molecular biology and computer science, specifically applying computational methods to molecular biological data—macromolecules such as proteins, DNA, and RNA. We study their interactions with each other as well as their interactions with small molecules, such as drugs.
So, do you help develop new medicines?
Computational drug development is one of the commonly mentioned applications of bioinformatics research. Another major area where bioinformatics plays a crucial role is DNA sequencing (the process of determining the order of nucleotides in a specific DNA segment). Sequencing machines provide researchers with raw data, which must be processed and stored before it can be interpreted and further analysed—this has largely become a computational task.
Nowadays, most research data are uploaded to publicly accessible databases, which serve as vast repositories storing millions of biological records. These databases are valuable because they allow researchers to compare data across different studies over the years, uncovering connections that might otherwise remain hidden.
Bioinformatics is a relatively new field, but it’s developing rapidly. How has it evolved since you started working in it? What’s the biggest difference today?
Today, we have access to far more data than we did 15 years ago. There’s been a major leap forward in technologies, especially thanks to machine learning methods, which have opened new doors and provided lots of relatively high-quality data.
A recent example is AlphaFold, a tool for predicting the 3D structure of proteins. Last year, its creators from Google DeepMind were awarded the Nobel Prize in Chemistry for it.
What makes this tool so groundbreaking that it earned the highest scientific recognition?
AlphaFold is revolutionary because it enables the generation of reliable protein structures for virtually any sequence—something that was previously impossible. Traditional experimental methods for determining protein structures, such as X-ray crystallography, are extremely time-consuming and costly. The Protein Data Bank (PDB), established in the 1970s, currently contains only around 200,000 protein structures determined through experimental methods. While this may not seem like a small number, it can be limiting for fundamental research. Additionally, the database is somewhat biased, as it predominantly includes proteins associated with various human diseases.
In contrast, the UniProt database holds hundreds of millions of protein sequences. Sequencing machines are now standard equipment in many laboratories, making DNA and protein sequence data far more accessible than structural data. For a long time, researchers sought a way to leverage this vast amount of sequence data and develop a computational method to infer protein structures from sequences. This goal was finally achieved with AlphaFold.
Fourteen years ago, in an interview for the university magazine, you mentioned that the „Holy Grail“ of protein studies was precisely this—being able to predict protein structure from its sequence. Have scientists now solved this problem?
Only partially. It turns out that the world of proteins is far more dynamic than we once thought. A protein is a sequence of amino acids that folds into a more or less rigid 3D shape. However, a significant portion of the proteome contains so-called intrinsically disordered regions—parts of proteins that only adopt a stable structure in response to an external trigger, such as binding to a partner molecule. Current prediction methods are unable to fully capture this phenomenon.
The fact that proteins are, to some extent, unstructured has only gained broader recognition in research over the past five to ten years. It’s not that this was previously unknown, but it was believed to occur on a much smaller scale. Today, it is estimated that roughly one-third of the human proteome is disordered. As a result, new computational methods are now being developed specifically to predict these unstructured regions.
Science allows me to do things that no one has attempted before. That excites and fulfils me. I also appreciate that science gives me the opportunity to satisfy my competitive nature—especially when I'm working to develop a method that outperforms all the others…
In the past, you worked on the P2RANK tool, which is also used by the well-known European protein database PDBe (Protein Data Bank Europe). What is it that you are focusing on now?
P2RANK is a tool designed for detecting binding sites on the surface of protein structures and their interactions with small molecules. We continue to maintain and improve this tool. At the same time, we have several related projects; for example, we recently received a grant to develop a machine learning method for identifying cryptic (hidden) binding sites.
Another project is the G2P portal, which, among other functions, allows for the simultaneous visualization of protein sequences and structures while mapping genomic variant data onto these structures. This method integrates data from several publicly available databases and connects information from different parts of DNA and protein sequences, mapping them onto a 3D structure. This project is a collaboration with colleagues from the Broad Institute, a joint biomedical research centre of MIT and Harvard.
Last year, you and your colleagues published an article on the G2P method in the prestigious scientific journal Nature. How did this international collaboration come about?
I began working on the method used in G2P during my three-year postdoctoral fellowship at the Centre for Systems Biomedicine in Luxembourg. This tool caught the attention of colleagues at the Broad Institute, who specialize in researching genetic variants. At Broad, they maintain databases containing information on individual genetic variants and whether they are pathogenic—meaning they cause diseases—or population variants, which occur in the population without having a negative effect. Together, we developed a portal that integrates information across three levels: DNA, protein sequence, and structure.
What does this mean for scientists who use the portal in their research?
It significantly simplifies their work. While all the mentioned components were previously available, they were not interconnected, making large-scale analyses impossible. With this tool, which is accessible via an application programming interface (API), researchers can perform analyses on all genes in the human organism, examine individual variants, and track where they appear within protein structure regions. This capability has the greatest potential for drug development, particularly for targeting specific genetic mutations and diseases.
When you started, did you anticipate how much potential bioinformatics would have?
I must admit that when I began my doctoral studies, I had no intention of staying in science. I pursued it purely out of interest. However, by the end of my PhD, more data was becoming available, and new applications were emerging, whether in genomics or proteomics. From the beginning, I was more inclined toward structural bioinformatics, and to some extent, that remains true today.
What drew you to science?
The fact that it offers a great deal of freedom and allows me to do things that no one has done before. That excites and fulfils me. I also enjoy the competitive aspect—trying to develop a method that outperforms others…
You mentioned that you had to learn biology „on the fly.“ However, for today's students who are interested in bioinformatics it is noticeably easier thanks to the study programme you co-founded years ago at Charles University with Marian Novotný from the Faculty of Science…
I wouldn’t say it is necessarily easier for them. We established bioinformatics as a joint study program between the Faculty of Mathematics and Physics (Matfyz) and the Faculty of Science. Our students essentially study two fields—computer science and molecular biology—each with its most demanding core subjects. Among other things, they have to take courses like mathematical analysis, programming, and algorithms—just like students who study „only“ computer science at Matfyz. However, the advantage is that our graduates gain a solid foundation in both areas, which opens doors to numerous career opportunities.
Where do graduates typically find employment?
Many of them work in biotechnology companies, such as MSD, one of the world's five largest pharmaceutical companies, which has a research and development centre in Prague. A significant number of students remain in academia—this is the case for our very first graduate, who spent several years at ETH Zurich and Harvard and is now a researcher at the Department of Cell Biology at the Faculty of Science. Because our graduates receive a full-fledged education in computer science, they can also easily find employment as developers in software companies.
There are many exciting directions in bioinformatics. What do you plan to focus on in the near future?
Right now, I’m particularly interested in the dynamics of protein structures, which is reshaping how we view and work with data. Another highly promising area is linking structural data to the genomic level. Thanks to machine learning, we can now analyse genomic data in the context of 3D protein structure knowledge, but we can go even further.
For example, in Luxembourg, projects are emerging around so-called disease maps—which are like Google Maps, but instead of cities, they map proteins and small molecules. These maps visualize entire molecular-biological systems, showing different layers, entities, and their interactions. This allows us to study problems in a much broader context—for instance, understanding how the entire system reacts when a specific gene is turned off. The integration of these layers is incredibly exciting, and with technological advancements, we can dive even deeper into these complex systems.
Listening to you, it seems like burnout syndrome isn’t something you have to worry about…
I hope not – and if it ever happens, then I’m probably doing something wrong.
doc. RNDr. David Hoksza,
Ph.D.
David Hoksza studied computer science (Software Systems) at
the Faculty of Mathematics and Physics, Charles University. From 2017 to 2020,
he worked as a postdoctoral researcher at the Luxembourg Centre for Systems
Biomedicine, University of Luxembourg. Since 2021, he has been an associate
professor at the Department of Software Engineering, Faculty of Mathematics and
Physics, Charles University. His research focuses on structural bioinformatics
and data visualization. His main current projects include the P2Rank framework,
R2DT, and the Genomics 2 Proteins portal. At Charles University, he co-founded
the bachelor's and master's degree programs in Bioinformatics and serves as
the head of the newly established doctoral program in Bioinformatics and
Computational Biology.
OPMK