My time at DeepLearn 2025

In July 2025 I went to the DeepLearn summer school at the University of Maia, Portugal. This is a summary of my experience.
Published

2025-08-13

I had a pretty intense week in Portugal. The lectures ran all day, and then there was a Hackathon running most evenings which I took part in. I flew into Porto and had a couple of hours to walk around the city before getting the metro out to Maia, a more industrial town about an hour out. The metro runs nicely, but the payment system is pretty confusing.

I learned a lot from the lectures, despite some of the intimidating maths involved. Each lecturer ran a series of three talks, and you couldn’t attend all the sessions because three ran in parallel. I had to miss some I was looking forward to, but attendees can view them online for a while. Hopefully I can find the time!

Lectures

I came along to this set of lectures because I’ve been thinking about (and working on in dribs and drabs) an agentic version of my work project, Lettuce. These were helpful lectures, given by Mark Derdzinski. The information given in these lectures is more broadly applicable than just agentic applications though. The real use of these is for contexts where you don’t have a hard ground truth for an application’s outputs. This is relevant to us because the job of Lettuce is to suggest the correct term from a set vocabulary for a user’s input. This is called “mapping” a source term to OMOP. Almost the first thing I did when I started the project was to try and find some way of evaluating its outputs, and it has been tricky.

With human mapping, there’s no one “correct” term, so the best we can do in development is to map the terms ourselves and evaluate pipelines by their agreement with our choices but that’s just like, uh, your opinion, man. I have a fantasy about getting a list of source terms and asking as many people who do this work to map them and using multiple people’s mappings to score outputs, but getting that many busy people to do more of their job is unrealistic. This is why I found these lectures really useful. Mark had plenty of advice on how to design sets of metrics that can serve as proxies for your desired behaviour. He framed this all in a paradigm of evaluations lying on a continuum of difficulty to collect, with simple assertions at one end, and the collection of user feedback at the other, which I found really useful. There are plenty of examples in my notes of me being excited because Mark articulated something I had incoherently been thinking for a while, which is always nice.

There was also a good amount of material on how to handle evaluations within the context of an organisation. Partly this was about getting stakeholders to agree on what useful metrics are, which is a useful soft skill I do not posess. Part of it was getting metrics to work as part of development (I think he even used the phrase “Eval-driven development” at one point), which I liked. This included ways of dealing with metrics running in production and dealing with distribution changes in production. This part made me feel guilty for not having proper logging and metrics set up, which we will need as people start using Lettuce.

As I’m working on AI for healthcare, I thought I should probably go to these, though I really wanted to go to the “Machine learning at the frontier of Astrophysics” series running in parallel. Jayashree Kalpathy-Cramer gave the lectures, which are an overview or introduction to the topic. Having been in the field for a while, I could have done with the lectures getting into more of the details, but it was interesting to hear talks from the perspective of a group leader in the field. Part of that perspective was being able to acknowledge that large language models can perform well in particular tasks, but that there are significant potential downsides, like loss of skills and users relying on them too much, particularly in critical situations.

Another thing that was nice was to understand a lot more of it than would have done coming to this talk a year ago. I remember bouncing off explanations of multimodal models not that long ago, and now it all seems understandable.

This wasn’t one of the lecture series, but the keynote speech - and what a keynote! Sergei Gleyzer gave the lecture, covering ways physics, particularly particle physics, uses machine learning. He works with data from the Compact Muon Solenoid at the Large Hadron Collider (LHC), so had a lot of expertise to share. It honestly made me wish I had more of a maths and physics brain, it was very inspiring. I was surprised to learn how much the LHC uses machine learning. One use that was particularly interesting was to identify what data to keep and what to throw away. The LHC generates so much data that they can’t keep it all. They apply machine learning to identify when they should be recording, but as this all has to happen quickly and on-line, using traditional computing and running models on stored data isn’t possible, so they have the models implemented in electronics, particularly using field-programmable gate arrays, which is very cool. He also talked about ideas at the edge of AI research that have applications in physics, like using neurosymbolic AI for high-energy physics, and Kolmogorov-Arnold networks. The last of these was quantum machine learning, which sounded very compelling at the time, but my knowledge of quantum computing is so poor that my notes are now indecipherable. Never mind.

This might have been my favourite lecture series, partly from the subject matter, partly because Zhangyang “Atlas” Wang, who delivered them, was very funny. My mathematical chops were fine when I was a biochemist, but something I’ve had to work on since moving to my current job. It was nice to know this has been paying off because these were mathematically pretty intense! As I understood it, the core of these lectures was the idea that real, useful data are points in a lower-dimensional space than the number of dimensions they’re represented in, and that deep neural networks are biased to find these. A lot of the lectures described techniques that exploit this idea to improve optimisation of neural networks. Prof. Wang has developed LLM training methods that improve memory usage when training LLMs by training on a projection of gradients to a low-rank subspace and… another technique I couldn’t follow (it was the end of a long day, OK?)

The final lecture went somewhere a bit unexpected, but very interesting: AI safety. Fine-tuning an LLM can inadvertently remove guard rails, even if you don’t think you’re fine-tuning anywhere near an unsafe domain. It turns out that even if you identify neurons that look like all they do is a safety function, freezing these neurons doesn’t protect you. His group developed a technique that instead looks at subspaces of model weights that define safety behaviours and sort of push the weights so that fine-tuning happens in a safer part of the training landscape. If that sounds as much like magic to you as it did to me, then you’ll be glad I didn’t follow the last part of his lectures at all. It was about getting neural networks to learn algebra but my maths skills are not strong enough to describe it.

Xia “Ben” Hu gave these lectures on how to train and serve LLMs more efficiently. Interestingly, a lot of what he has done on efficient serving is already applied in Llama.cpp, which we use at work, so there’s less for me to apply here than I thought going in! He really nicely described both some LLM basics and the elements of how and why you might want to serve an LLM more efficiently in the first lecture. A useful hierarchy of ideas he used was:

  • Weight compression
    • Sparsification
    • Quantisation
  • KV cache compression

The main bottleneck in serving LLMs is memory, and you can employ these techniques to reduce your memory requirements. Weight compression means you make your model weights take up less memory. You can do this in one of two ways: reducing their number (sparsification), or reducing the size of each parameter (quantisation). If you have reduced the size of your model, your KV cache will start to take up more memory than the model weights. If you’re unfamiliar with a KV cache, it’s a way of storing some of the intermediate calculations for text generation. There are lots of clever tricks you can employ to approximate storing the full KV cache in less space. This, and the subsequent lectures, tied in well with the low rank lectures, which was nice.

The lectures were rounded off with a discussion of interpretability measures, which doesn’t sound like it would tie in, but Prof. Hu had placed emphasis on the serving efficiencies helping developers build with AI, and framed efficient interpretability as part of the development process, which made it all flow better than you might expect.

This was probably the most maths I had to confront here. I came along because I thought it might give me some pointers in dealing with the relationships between OMOP concepts for work. I was very wrong, but got something much better, as I realised when the title of the first lecture was “Hyperbolic space Deep Learning for Foundation model: A Tutorial”. Zhitao “Rex” Ying gave these lectures.

The main argument of the lectures is that the structure of language means that token embeddings naturally follow a graph structure, and that the Euclidean spaces that current foundation models work in aren’t the best choice for graph-structured data. This is because you have a few words that are used very often (e.g. “to”, “have”, “in”, “that”, etc.), which often define the subsequent words, which are less common. The representations that LLMs learn place these near the origin, then place the less common words further out, where there is more space. This is the model trying to arrange the embeddings in an approximation of its natural, hierarchical structure, but leads to distortion. Instead, if you replace these operations with equivalents in hyperbolic spaces, you get the structure without distortion. This is because these spaces are curved, and the further you go out the more space you get. If this doesn’t make sense, it’s because I’m describing it badly; just read the paper if you want something sensible. What I was amazed at was that I came out of the lectures with a decent understanding of what a hyperbolic space is, and why it would be useful. It’s been a long time since I went to a series of lectures rather than one-off seminars, and it was great to get that extra depth.

I’m really interested in how machine learning can be made explainable, and this set of lectures was a combination of a good introduction to the subject with the presentation of recent advances, delivered by Samira Ebrahimi Kahou. The first lecture in the series was a good introduction to the general field of explainability. It followed Christoph Molnar’s excellent book, which I have read enough of that I was a bit concerned at first that I wouldn’t get much from these lectures, but there was enough extra content to persuade me to come to the other lectures, which I’m glad of.

The second lecture covered explainability in LLMs, which is a very hard problem because they’re so big. It covered ideas in probe-based and concept-based explanations, and mechanistic interpretability, like the excellent work on sparse autoencoders from Anthropic. I had sinusitis on the last day of lectures, which is a shame because I couldn’t follow the final lecture, on interpretability in reinforcement learning, as well as I wish I had.

Round table

One evening, some of the senior academics held a round table discussion. This was interesting because there was a mix of industry-partnered and some more industry-skeptical academics. They delivered their strong opinions thoughtfully and respectfully. Something I thought was really interesting was that some of the academics without an industrial partner felt they were limited in what they could do at the cutting edge without the resources of industry. This came up a couple of times in lectures too, that some technique was promising when carried out on the ~7 billion parameter LLMs they could work with, but they were unable to tell whether scaling this up would work. It was kind of sad, that there’s a lot of clever things you can do, but unless you have a data centre to work with, it’s all a bit… academic.

Hackathon

The final part of the experience was taking part in a hackathon. I’ve not done a hackathon before1 and I really enjoyed it!

We were given 7 challenges:

  1. Identifying the Higgs Boson
  2. Classification of particle images
  3. Gravitational lensing
  4. NMR prediction
  5. RenAIssance NLP challenge
  6. Exoplanet discovery
  7. Quantum machine learning

These were really nice problems. I won’t describe all of them in detail, but you can read the repo for more details.

I met Anton and Matt and we formed a team to do the exoplanet discovery challenge. In this challenge, we were given simulated images2 of protoplanetary disks that may or may not contain up to four planets, and had to train a binary classifier to predict whether there was a planet or not. In an example notebook, we were given a classifier that we were told wouldn’t perform well. When I ran it, it had an ROC AUC of about 0.9, which, coming from biology, seemed pretty good. After just increasing the training epochs from the example, and changing the optimiser, I got an AUC of 0.97, which to my naive biologist brain seemed unbeatable. I tried augmenting the data and using a larger model, but couldn’t meaningfully beat this. Luckily, Matt has a lot more image analysis experience and did some clever feature engineering and used XGBoost, getting an AUC of 0.997. I have a hunch for why this worked, but it’s very speculative and based on my GCSE-level understanding of physics. Our entry will be judged on a holdout set of simulations and some actual recordings of protoplanetary disks. Wish us luck!

Footnotes

  1. I was meant to host one, but managed to catch COVID just before and had to remotely help with parts of it, which wasn’t the same at all!↩︎

  2. not really images, it’s much more complicated but you don’t need to know that↩︎