TypedBio: A Functional Type Theory for Biology and Drug Discovery
Published:
"Well-typed programs don't go wrong."
— Robin Milner
Biological systems are remarkably complex, exhibiting intricate interactions across multiple scales, from genes and proteins up to cells, tissues, and organs. The ways we usually describe and manipulate these systems often lack the precision and structured reasoning you find in programming or mathematics. The descriptive nature of the knowledge in biology could be one of the reasons that hinders reasoning in biological domains such as drug discovery where the goal is to find optimal interventions to steer a biological system towards a desired state.
Lately, I’ve been thinking about the limits of reasoning in biology, how it’s currently done, and what the ideal state might be. If you take mathematics as an example, Gottlob Frege’s introduction of a formal language and notation in 1879 was a monumental step, accelerating discoveries and paving the way for automatic proof assistants.
Having a formal language to study a subject provides multiple benefits: it forces authors to state their axioms up front, makes every logical step explicit and verifiable, and makes it easier to generalize patterns and reuse laws. But my favorite benefit, and indeed the main motivation behind this article, is that formal syntax allows a field to study itself. This meta-level reasoning led to discoveries like Gödel’s incompleteness theorem, which relies on treating mathematical proofs as manipulable strings.
I’ve been exploring an idea I call TypedBio with the motivation to turn the essential part of a biological system into a language that is minimal but powerful in capturing the relevant biological processes: What if we treated biological entities and processes as explicitly typed objects and composable functions, much like in typed programming languages? I mainly draw inspiration from functional programming and dependent type theory to suggest a perspective that might offer a new way of thinking rigorously about biology and medicine.
Before diving into the biological side, let me set the stage with a bit of background. Type theory, at its heart, is about classifying things. Every mathematical object belongs to a type, and types act as a kind of guarantee that things behave as expected. If you’ve ever written code in a language like Haskell, or you’re one of organized programmers who care about type annotation in Python, you’ve seen how types can catch mistakes before they happen. For example, you might have an Int
for integers, a Bool
for true/false values, or a String
for text. Functions themselves have types too, like Int -> Bool
, which means “a function that takes an integer and returns a boolean.”
Dependent Type Theory offers more expressiveness by letting types depend on values. Imagine a vector whose type actually encodes its length, so you can only add vectors of the same size. Or a function whose output type changes depending on the input. Such cases happen so frequently in programming languages and also mathematical proof assistants that DTT is now a norm in many of these systems. If you’re interested in knowing more about type theory, LessWrong has a few interesting blogposts on the topic such as this one explaining its building blocks.
A close concept to type theory is functional programming, whose main idea is treating computation as the evaluation of mathematical functions. You build up complex operations by composing simple, pure functions, no side effects e.g. print(“something”), and no change of environment. It’s a style that encourages clarity, safety, and composability. Basically it’s a clean modular way of describing processes (aka computations).
"In a pure functional language, you can look at a function and know that it does nothing other than return a result. That is a tremendously valuable property when you're trying to understand or reason about a program."
— Simon Peyton Jones
Such clean modular presentation of computation allows complex reasoning to become just a mechanical manipulation of functions. What I envision is to compress biological knowledge into a similar structure and make it amenable to general reasoning procedures similar to those that are carried out in mathematics and programming languages. This compression is of course lossy, i.e., a lot of biological and chemical details are dropped, but this is the same as with any processing engine including human brain. When physicians look at patient’s data and make a decision, they make decisions based on a sparse subset of data not the entire biological state of the patient. The question is then how to compress data in a way which is most useful for the downstream task. A structured language such as TypedBio can be thought of as a compression of processes that are most useful for reasoning tasks in biology. In this framework, each biological component, whether it’s a gene, a protein, or a cell, has its own formal “type,” just like in mathematics or programming. This means you can talk about a Gene, a Protein, a Cell, a Tissue, or an Organ as distinct, well-defined entities in your model.
One of the main advantages of typed systems is their convenience in describing the relationship among different types, which correspond to different biological entities in TypedBio. For example, we can have a function that takes a Gene and tells you how it affects a Protein:
perturbGene :: Gene -> Effect Protein
Or a function that models what happens when a drug targets a specific protein in a cell:
targetDrug :: Drug -> Protein -> Effect Cell
Or a function to model the outcome of clinical interventions:
intervene :: PatientProfile -> Intervention -> ClinicalOutcome
Or functions that measure the outcome of interventions:
measureBiomarker :: ClinicalOutcome -> BiomarkerMeasurement
As you see, functions are first-class citizens in TypedBio. These functions are more than just a line of code, rather they serve as a way of capturing the logic of biology in a precise, checkable form. For instance, perturbGene
takes a gene as input and returns an effect on a protein, letting you reason about gene expression in a structured way. targetDrug
is a function with multiple arguments: you give it a drug and a protein, and it tells you what happens at the cellular level. intervene
models what happens when you apply a treatment to a patient, and measureBiomarker
lets you turn clinical outcomes into quantitative data.
There is though a natural limit to the expressiveness of this approach which is in particular visible in living systems. In biology, context is extremely important, e.g., a drug might work for one patient and do nothing for another, depending on genetics, dosage, and other factors which we may or may not be aware of. To cover such cases, we need types that change based on the value of other types. For example, the outcome type of an intervention may be different in patients with different genetics. Such expressiveness is offered in Dependent Type Theory that I introduced earlier in this article. With dependent types, you can write a function like this:
administer :: (dose : Dosage)
-> (drug : Drug)
-> (patient : PatientProfile)
-> TreatmentOutcome
Here, the type of the outcome can depend on the actual dose, the specific drug, and the patient’s profile. This lets us encode real-world constraints and context directly into our model which reduces the reasoning space to the set of feasible processes.
Let’s make this more concrete with a clinical scenario. Imagine you’re designing a treatment plan for a patient with lung cancer. You start by defining the patient’s profile, including age, diagnosis, and genetic markers. Then you lay out a sequence of interventions: maybe gene therapy targeting P53, surgery to remove the tumor, and a round of chemotherapy with a specific drug and dosage. In TypedBio syntax, this process might look like this:
-- Patient profile
patientB :: PatientProfile
patientB = { age=60, diagnosis=LungCancer, geneticProfile=[GeneP53, GeneEGFR] }
-- Interventions
geneTherapy = GeneTherapy { targetGene = GeneP53 }
surgery = Surgery { organ = Lung, surgeryType = TumorResection }
chemo = Chemotherapy { drug = "Cisplatin", dosage = "75mg/m2" }
treatments = [geneTherapy, surgery, chemo]
outcomes = map (intervene patientB) treatments
biomarkers = map measureBiomarker outcomes
What’s happening here is a kind of pipeline: you have a list of treatments, you apply each one to the patient to get a list of outcomes, and then you measure biomarkers for each outcome. It’s a precise, type-safe way to model the clinical workflow, and it ensures that every step is compatible with the patient’s profile and the logic of biology. This is obviously a simple workflow and can easily be described with natural language. However, the structured language is scalable to procedures with arbitrary number of steps, parallel processes and with many entities, where the description in natural language can quickly become intricate and as a result less actionable.
Currently most biological data are stored as spreadsheets, ontologies, or ad hoc scripts. A standard language like TypedBio offers a few big advantages. First, the type system acts as a safety net, catching errors and enforcing constraints automatically. Second, the functional approach makes it easy to compose and reason about complex interactions. And third, because everything is formalized, you can actually prove that certain treatment protocols are safe or effective using general reasoning engines such as Lean.
Although TypedBio can disambiguate biological data and inter-personal communication, my main motivation is making biology a playground for AI-augmented reasoning agents. Most biological data is messy and unstructured, scattered across pre-clinical experiment assays, EHRs, clinical trials, and research papers. TypedBio gives a way to translate all that information into a formal language that AI can reason with. Imagine using large language models to map legacy datasets into TypedBio syntax, then running AlphaZero-style algorithms to search for optimal treatment strategies. RL-augmented search methods e.g. AlphaZero needs a clearly defined action space. Such structured action space is a natural product of distilling biological knowledge in TypedBio to produce a well-defined and mostly finite action space, turning reasoning in Biology similar to traversal of a game tree. Notice that this TypedBio concerns with the representation of biological states, their relationship and actions (interventions). A full analogy to game-like scenarios requires a well-defined reward function which is beyond the representation method.
To make this practical, we need a way to bridge natural language and formal types. There are currently tools that facilitate such translation e.g. OpenAI’s function-calling API or Pydantic AI framework that ensures agents respect their output type definition when producing tokens. We must then define schemas for things like patient profiles and interventions, and then use LLMs to generate outputs that are automatically checked against those schemas.
It is obvious that TypedBio, for all its strengths, doesn’t capture everything about biology. Real systems are dynamic, stochastic, and full of feedback loops and spatial organization. But I think one beauty of this approach is that it’s extensible, you can gradually add features like time, probability, and spatial relationships to the type system to make it richer and more expressive as you go.
I want to emphasize that TypedBio isn’t just a technical framework (like a typed-annotated python package). I see it as a way of thinking about biology that’s both formal and flexible. It’s a way to bring fundamental ideas from mathematics and computer science to make biology a playground for AI reasoning systems in a hope to discover super-human interventions for a complex system which is otherwise hard to find in natural language.
"Biology is the study of complicated things that give the appearance of having been designed for a purpose."
— Richard Dawkins