Company Overview
Recursion Pharmaceuticals is a clinical-stage biotechnology company decoding biology by integrating technological innovations across biology, chemistry, automation, data science and engineering to industrialize drug discovery. They leverage massive datasets generated in their automated wet labs to train machine learning models for identifying novel drug candidates. Their strong AI focus and proprietary data positions them as a key player in the revolution of AI-driven drug discovery.
Core AI/ML Stack
Recursion's AI/ML stack is built around a hybrid approach, leveraging both open-source frameworks and custom-built tools optimized for their unique biological data. They utilize:
- Deep Learning Frameworks: Primarily PyTorch 3.0, chosen for its flexibility and strong support for research-oriented model development. TensorFlow 3.0 is also used for some deployment-focused pipelines.
- Model Architectures: Heavy focus on Graph Neural Networks (GNNs) for representing complex biological interactions and Convolutional Neural Networks (CNNs) for analyzing high-content image data. They have pioneered novel GNN architectures specifically tailored for multi-omics data integration. They also use Transformers for sequence-based analysis.
- Training Infrastructure: A combination of on-premise NVIDIA H200 GPU clusters interconnected with NVLink 5 and cloud-based (AWS and Azure) A100 GPU instances for larger-scale training and experimentation. They are piloting custom ASICs designed in collaboration with Cerebras for ultra-fast training of their core GNN models.
- MLOps: MLflow for experiment tracking, model versioning, and deployment. Custom-built pipelines for automated hyperparameter tuning and model selection. They use Kubeflow for orchestrating their end-to-end ML workflows.
Hardware & Compute Infrastructure
Recursion has strategically invested in both on-premise and cloud infrastructure to balance cost, control, and scalability:
- Data Centers: Two private data centers housing their core GPU clusters, leveraging liquid cooling for high-density compute. They are transitioning to PCIe Gen6 interconnects to improve inter-GPU communication speeds.
- Chip Architecture: Predominantly NVIDIA H200 GPUs (Ampere Next architecture) with 80 GB HBM3e memory per GPU. Piloting Cerebras Wafer Scale Engine 3 (WSE-3) for training exceptionally large biological models.
- Cloud vs On-Prem: Hybrid cloud model. On-prem for core model development and data processing requiring low latency and high bandwidth. Cloud (AWS Sagemaker, Azure ML) for scaling compute on demand, deployment, and DR/BCP.
- Networking Fabric: High-bandwidth, low-latency network fabric based on InfiniBand HDR (200 Gbps) within their data centers. 400 Gbps interconnects for connecting data centers to cloud providers.
Software Platform & Developer Tools
Recursion is developing a robust internal platform to empower their data scientists and engineers:
- APIs & SDKs: Internal APIs for accessing their proprietary datasets, pre-trained models, and automation systems. Python SDKs for interacting with these APIs.
- Developer Platform: Custom-built IDE based on VS Code, pre-configured with necessary libraries and tools for AI-driven drug discovery.
- Open-Source Contributions: Actively contributing to the open-source community with libraries for handling biological data formats and accelerating GNN training.
- Key Internal Tools: Tools for data visualization, model explainability, and simulation of biological systems. They use a custom language (RecLang) for scripting complex biological workflows.
Data Pipeline & Storage
Data is the lifeblood of Recursion's AI engine. Their data pipeline is designed to handle massive volumes of heterogeneous biological data:
- Data Ingestion: Automated pipelines for ingesting data from their robotic wet labs, including high-content imaging, genomics, proteomics, and metabolomics data.
- Data Lakes: A central data lake built on Apache Iceberg and stored on AWS S3 and Azure Blob Storage. Data is partitioned and indexed for efficient querying.
- Streaming: Apache Kafka for real-time data ingestion from their experimental facilities.
- ETL Pipelines: Apache Spark and Apache Beam for transforming and cleaning data at scale. Data lineage is tracked using Apache Atlas.
Key Products & How They're Built
Recursion's AI stack powers their drug discovery pipeline, leading to multiple clinical candidates:
- REC-994: A drug candidate for treating cerebral cavernous malformation (CCM). It was discovered by training GNNs on their phenomic data to identify compounds that reverse disease-related cellular phenotypes. The models were trained using their on-premise GPU clusters and validated using their automated high-throughput screening platform.
- REC-4881: A drug candidate for treating familial adenomatous polyposis (FAP). This was discovered using transformer models applied to genomic and proteomic data. Custom model architectures optimized for identifying potential therapeutic targets were deployed on AWS SageMaker for inference.
Competitive Moat
Recursion's competitive advantage stems from a combination of factors:
- Proprietary Data: Their massive, high-quality biological dataset, generated through their automated wet labs, is difficult and expensive to replicate.
- Custom Hardware: Investment in custom ASICs and optimized GPU clusters provides a performance advantage in model training.
- Network Effects: As they generate more data, their models become more accurate, attracting more partnerships and driving further data generation.
- Talent: A world-class team of AI/ML researchers and biologists with deep expertise in drug discovery.
Stack Scorecard
| Dimension | Score (1-10) | Rationale |
|---|---|---|
| Compute Power | 9 | Significant investment in high-end GPUs and custom ASICs provides substantial compute capacity. |
| AI/ML Maturity | 8 | Advanced AI/ML capabilities with demonstrated success in identifying drug candidates. |
| Developer Ecosystem | 7 | Growing developer ecosystem with strong internal tools and open-source contributions. |
| Data Advantage | 9 | Massive, high-quality, and proprietary biological dataset is a significant competitive asset. |
| Innovation Pipeline | 8 | Active research and development of new AI/ML models and algorithms for drug discovery. |