Richard Decal

BIO

Richard is a senior ML scientist at Dendra. There, he trains large computer vision models and oversees curation of one of the largest (if not the largest) labeled ecology datasets.

TITLE

Data Curation: Transforming Bytes into AI Gold

ABSTRACT

Everyone obsesses about big data and big models. However, AI systems are garbage-in-garbage out. Data is the new oil, but that digital crude needs to be refined before it is useful. Behind every successful AI product, there’s a meticulously curated dataset.

In this talk, you’ll learn about the often overlooked science of transforming data into AI-ready gold. We’ll explore the data-centric perspective of production computer vision systems: how models reflect the biases in their training data, dealing with data drift, detection of model confusions, and rectifying root causes of model confusion. These are essential to take an AI from good to great.

INTERVIEW

What inspired you to delve into AI and pursue a career in this field?

It all started when I took a MOOC on computational neuroscience. It blew my mind, and long story short, I ended up working as a researcher at a comp neuro lab using ML for biophysics research. Some of my colleagues were working on translational research (creating artificial neural network architectures inspired by biological neural networks), as well as using artificial neural networks to study biological neural networks. I started gravitating towards that and I thought at the time it was so space-aged.

However, then I started talking to some folks from Google and it made me realize that actually it was still early days, and that we were actually at the lip of an oncoming tidal wave of disruption.

At the same time, there was a lot of activism in Seattle and it made me feel like I should get out of the ivory tower and make a difference in the world. But I felt that most activism I was seeing was ineffective, and it was important to me not just work on a good cause, but also to maximize my impact. So I started thinking about how I could leverage my talent and interests to "punch above my weight". I figured that ML could be that force multiplier.

I then started thinking of causes I thought were the most important to be working on, and started Googling “machine learning for climate change”. That’s how I discovered a start-up using ML for ecosystem restoration. I filed that away as a potential dream job, not knowing that a few years later that I would be hired at Dendra as their first ML engineer.

Can you share a real-world example of how your AI solution has made a significant impact in your industry?

Yes, there are many. Probably the most high profile example is Tesla. They have a world class data engine which they use to systematically improve their models by curating their data. But pretty much all companies that have ML models deployed in open-world scenarios need to invest heavily into data curation. For example, people training language models or general-purpose robots.

What do you see as the most significant challenges in deploying AI in production, and how do you suggest overcoming them?

As usual, the answer is “it depends”. At the risk of being repetitive, if you are deploying a model into the open-world, where it has to be robust to many different scenarios, I’d say one of the top challenges is data curation. I think this is something that is often overlooked in academia, where the datasets are usually static and the focus is on improving the model architectures. In the real world, it’s usually the opposite: you spend most of the time improving your data and only ~5% improving models.

Unfortunately, there isn't a single recipe for data curation that fits all domains, but in my conference talk I will try to describe a few general principles.

How do you stay updated with the rapid advancements in AI technology, and what resources would you recommend to others in the field?

I like Semantic Scholar. It lets you group research papers into topics, and then it’ll give you digests for each of your topics when similar papers get published. It’s great for keeping up with specific niches.

I know it’s fashionable to dunk on Twitter these days, but ML researcher Twitter is great and keeps me updated on wider developments in the field.

But if I am working on specific problems, nothing beats opening up dozens of research papers, scouring their references, and diving into the rabbit holes. It’s often the case that there is specialized jargon for the problem you're trying to solve, and if you don't use those exact keywords in your initial search, you won’t ever find the relevant papers.

What future trends in AI are you most excited about, and how do you believe they will shape the industry?

Some systems are so complex that no human mind is unable to fully comprehend it. ML models can often achieve superhuman pattern recognition when there are many variables or where the solution space is astronomical.

One area where this is being exploited is in accelerating research. For example, nuclear fusion reactors are extremely complicated instruments and plasma dynamics are hard to predict, so nuclear fusion start-ups are using ML to propose experiments. In a similar vein, there are start-ups working on reversing aging using genetic therapy. They are making breakthroughs using ML models to propose gene targets.

Another trend is people applying ML at the intersection with a new domain. There are so many use-cases which are ripe for disruption, and it often takes an outsider to make the connections. I think there are still many low-hanging fruit here.

What advice would you give to product teams and developers just starting with AI projects?

1. Build in vertical slices. Start with a simple baseline model and create a scrappy MVP first. Often, this can capture most of the addressable value of your product and you can start validating it and gathering feedback before you invest in it too heavily.

2. Often, people think that the end goal is to ship an ML product and then you move on. In reality, deploying is often just the beginning. ML systems have an ongoing lifecycle and get very disappointed when the system that works well at first starts degrading in performance after a few months.

3. Most of the time, you don’t need AI. In fact, you should avoid it if possible. Often, people run straight to things like RAG when they would get better results using a classical approach instead.