Unleashing the Power of AI in Drug Design: Paving the Path to Open Chemistry Data

Thursday 27th July 2023

By Sofia Melliou

The field of drug discovery has continuously evolved over time, driven by advancements in scientific understanding and technological innovations. From serendipitous discoveries, such as the discovery of aspirin and penicillin, to targeted pharmacology and high-throughput screenings, the process of identifying new therapeutic agents has undergone significant transformations.

The integration of artificial intelligence (AI) and machine learning (ML) into drug discovery is poised to transform the field. The immense potential of AI-derived breakthroughs has spurred change across the pharmaceutical industry and academic landscape, leading to a number of collaborations and partnerships. In addition, government funding initiatives underscore the growing recognition of AI's significance in shaping the future of chemistry in drug discovery and beyond. The latter is exemplified by the recent $200 million allocation to the Acceleration Consortium at the University of Toronto, to create AI-orchestrated self-driving laboratories that synthesize new molecules and generate high-quality datasets to train AI models and improve AI predictions in real-time.

Within this transformative landscape, the Structural Genomics Consortium (SGC) stands at the forefront, propelled by its mission to understand all proteins encoded by the human genome and accelerate the discovery of new medicines through open science collaboration.

Staying true to our values

Over its first 20 years, SGC scientists determined the structures of thousands of human proteins and collaborated closely with industry partners to invent and disseminate hundreds of chemical probes. These small molecule compounds can modulate individual proteins and are crucial for understanding protein functions. Looking forward, we are focusing our efforts on the mission of Target 2035- a global initiative led by SGC, which aims to deliver pharmacological tools for all human proteins by 2035.

To expand hit-finding approaches beyond traditional methods and contribute to Target 2035’s goal, SGC is actively working towards converting hit-finding from an experimental process into a largely computational one. This focus on catalyzing ML- and more specifically deep-learning (DL)-driven drug discovery has the potential to enable the discovery of novel hit molecules for protein targets, especially for understudied proteins.

Utilizing our expertise with computational models

The concept of AI in drug discovery is not new. The intersection of computational approaches and small-molecule drug design has been explored since the 1970s, aiming to understand the relationship between molecular structure and biological activity to nominate potential candidates. Since then, AI has been used to analyze vast amounts of data, identify patterns, and predict the properties of compounds, enabling more efficient and targeted drug discovery. It has been applied in areas such as virtual screening, de novo drug design, and drug repurposing.

One notable example is AlphaFold developed by DeepMind, which has demonstrated impressive capabilities in predicting protein structures, which typically requires months or years of extensive lab experimental work. However, the ultimate goal is to computationally invent drugs, and as Matthieu Schapira, Professor at the Department of Pharmacology and Toxicology at the University of Toronto, says “We are still far from this holy grail”.

In fact, while the impact of AlphaFold on structural biology is compelling, its ability to directly assist in the drug discovery process remains unproven. It lacks the capability to predict how a drug-like molecule exploits a protein's binding site at the atomic level and scientists still need to rely on experimental methods in the lab or slow and complex physics-based simulations to understand these interactions.

The progress of AI in drug discovery is also limited by the availability of large amounts of experimental data which sometimes lack consistency. Access to diverse datasets, including protein structures bound to small molecules and inhibitor screening data, is crucial for training machine learning algorithms. As a Nature editorial earlier this year accurately described “Machine-learning systems in chemistry need accurate and accessible training data. Until they get it, they won’t achieve their potential” [1]

To address the need for more consistency in hit-finding datasets, SGC complements its expertise in structural chemistry efforts by profiling hundreds of thousands of compounds against a diverse set of proteins in a hypothesis-free approach, following open science models. This approach expands the scope of hit-finding and enables the selective targeting of proteins for drug discovery.

Enhancing Data Accessibility

One crucial aspect of leveraging AI in drug discovery is enhancing data accessibility. “We have to find the best ways to generate high-quality data that is well formatted for ML”, says Matthieu.

SGC strives to deposit all data in publicly accessible repositories. By utilizing existing repositories like the European Bioinformatics Institute (EBI), and working with data scientists, SGC ensures that all data is made accessible in a ML-friendly format. Legacy protein production data (sequence of purified and crystallized protein constructs) are already made publicly available via the EBI and SGC’s database infrastructure is compatible with the screening data format underlying EBI’s chEMBL database, allowing seamless data sharing and, ultimately, data dissemination. This approach not only promotes transparency and accessibility to all but also provides valuable data for refining AI tools and models.

The evolving landscape of publishing practices further contributes to increased data accessibility. Open science initiatives encourage the sharing of data and methodologies, promoting collaboration and enabling researchers to build upon each other's work. SGC has established a knowledge translation platform in collaboration with the European EUbOPEN project to manage, integrate, and disseminate the open-science chemical biology data, reagents, and knowledge generated by our collaborative projects. By sharing our findings openly, we are hoping to accelerate the discovery of new treatments for a wide range of diseases.

Moreover, the establishment of a robust infrastructure for data storage and dissemination can foster stronger collaborations with companies utilizing AI methods for drug discovery. These companies can harness open-source protein/compound screening data datasets to train their algorithms and improve their services while maintaining proprietary technology. Additionally, by advancing their internal drug discovery pipelines, they can simultaneously contribute to open science by making their data available in the public domain.

Automation in the Laboratory

Another important aspect of leveraging AI in drug discovery is consistency and reproducibility. SGC aims to utilize AI to automate the synthesis of novel molecules for hit optimization.

The design-make-test cycle currently used to optimize initial hit compounds into drug-like molecules is a very iterative and slow trial-and-error process. To tackle this challenge, a drug discovery self-driving lab (SDL) at SGC Toronto was recently funded and will be led by Professor Cheryl Arrowsmith, chief scientist of SGC Toronto. This SDL will accelerate the early stage of drug discovery by taking an initial compound and developing it into a more drug-like molecule that modulates a protein of interest in cells.

The drug discovery SDL will combine artificial intelligence, robotics, and advanced computing to accelerate the design-make-test cycle. This automation ensures consistency, reproducibility, and valuable data properly recorded for predicting synthetic routes.

Collaboration and Partnerships

Collaboration between AI companies and SGC plays a pivotal role in our effort to drive innovation in drug discovery. This allows for a diverse range of insights and perspectives, fostering collective intelligence and accelerating progress in the field.

By collaborating with a wide group of companies, it allows us to skip the expensive experimental step of initial screening and instead test a much smaller number of molecules. To date, SGC has established partnerships with nine computational companies which have already had a tremendous impact on the field. More specifically, by combining our expertise on how to make and screen proteins and determine structures to validate AI predictions with partners of choice, such as Cyclica (current Recursion) and Atomwise we have discovered starting points (hits) for chemical probe development to proteins that are mostly uncharacterized.

Furthermore, we have established partnerships with DL-based biotech companies in Canada and the US to conduct a pilot project assessing the potential of DL in drug discovery. During this project, we successfully identified and experimentally validated drug-like 'hits' for 5 out of 11 targets for which DL-based predictions were made. This achievement is highly significant since the 'hit-finding' success rate was achieved with experimental testing of only ~100-200 compounds per target, in contrast to the tens to hundreds of thousands of compounds required in traditional experiment-only hit-finding approaches. Although these results are promising, it is important to acknowledge that there is still a long way to go before computers can predict with high accuracy which small chemical will bind to a protein. To expedite the development of improved DL methods, the key lies in rigorous benchmarking. This is precisely what the SGC-Toronto, in collaboration with Bayer, Boehringer Ingelheim, AstraZeneca, and other companies, is organizing through the open science CACHE public-private partnership.

Benchmarking best computational methods

Led by the SGC, Bayer and Boehringer Ingelheim, Critical Assessment of Computational Hit-finding Experiments (CACHE) is a new international benchmarking competition that organizes AI experts and computational chemists from around the world to predict drug-like small-molecules that will bind to proteins relevant to drug discovery. This series of challenges, with their experimental hub at SGC-Toronto, reveal the state of the art in computational hit finding by systematically testing predictions and making the results publicly available.

“There are a lot of expectations that AI will enable a technological breakthrough in computational hit-finding, as it did with AlphaFold in protein structure prediction. Lots of startup companies say that they have found the solution. CACHE is a good contest to see which methods are the most efficient and where the field is going”, says Matthieu who is leading this initiative.

Because the CACHE initiative tests experimentally the compounds predicted by participants, it benefits AI companies in drug discovery by providing them with advanced experimental capabilities to validate their predictions, resulting in improved competitiveness on the national and global stages.

By initiating new hit-finding benchmarking exercises every four months, these public competitions (challenges) have the added benefit of identifying new chemical starting points for biologically interesting targets with participants from all over the world using their computational method to predict hits that will be tested experimentally by CACHE. With 4 challenges already on the way, this partnership includes 43 scientists from academia and industry coming from 16 countries, who predicted more than 6,000 open-source drug starting points for neurological and viral diseases that are experimentally tested at the SGC.

The collaboration between SGC and CACHE sets them apart as pioneers in systematic testing and public data sharing is a powerful approach that enhances the credibility and reliability of the research conducted, not only in Canada but potentially on a global scale. We eagerly anticipate the release of the first results from these efforts later this year.

In conclusion, AI has the potential to revolutionize drug discovery, but its progress is contingent upon access to vast and high-quality experimental data. SGC's strategic efforts to convert hit-finding into a computational process, along with our focus on data accessibility and collaborative initiatives, pave the way for transformative discoveries. By combining SGC’s protein expertise and network of industry partners and collaborators with computational expertise from CACHE and other projects with a diverse range of experimental data, SGC will create an open science drug discovery technology hub that is globally unique. These efforts can accelerate the development of novel therapeutics and move closer to achieving the goals of Target 2035.