Jan 8, 2024

Day.ai's head of data Gwen Reynolds embraces the chaos of AI

"You don’t really know the output you’re getting or if you’re going to be able to reproduce it."

Tell us about yourself. What are you working on right now?

I’m Gwen Reynolds, and I’m head of data at Day.ai, a startup providing a twist on the idea of a meeting assistant. We’ve got basic meeting assistant functions, but we can also use your meeting information to give you a lot of other fun things that improve productivity.

We’ve been around since June 2023—we’re only just getting started. I’m primarily responsible for identifying the large language models (LLMs) we use, determining what data to put into them, writing prompts, then making sure the outputs make sense and are useful to our customers.

How do you use AI for work?

We do so much. The classic applications, like picking out and summarizing different points from a meeting and listing follow-up action items, work very well. There are other things you have to force the LLM into doing, like classification—looking at a chunk of text, asking, “Is this a sales call or an interview?” and then spitting out an answer.

It’s a change for me because, before Day.ai, I was a data scientist doing statistics. I’d never worked with an LLM, and had really only touched on natural language processing, so this is my first time dealing with lots of text instead of more traditional AI predictions based on statistical processes. While classification isn’t a traditional use case for an LLM, it’s a very good one for those of us from data analytics who need keywords and other information picked out for feeding into a secondary model.

What was your introduction to using an LLM?

I have a big group of friends from grad school, and since the pandemic we’ve had a long, ongoing group chat. One day I was talking to Christopher O’Donnell [formerly of Hubspot, and Day.ai’s founder], who said he wanted to do something with generative AI. I was curious too, so I went online and found a bunch of Python tutorials, downloaded two years of my friends’ messages, indexed them, and created a chatbot.

We asked it about ourselves—like, “What does Gwen think of Christmas?” or “What is Gwen like talking with her friends?”—and the answers were wild. They were so spot-on, reminding us of things we hadn’t talked about for years. That was the moment I knew LLMs had grabbed my attention and weren’t letting go.

I showed Christopher my chatbot and he said, “We’re in business!” [I joined Day.ai and] we dove in together. What’s cool about this technology is it’s moving so fast that there isn’t a lot of gatekeeping. You need to follow LlamaIndex and LangChain, and find posts on Medium or LinkedIn; I’ve tried Stack Exchange, though, and [the community knowledge] isn’t there. Play and explore! As part of my learning journey I’ve probably written 10,000 lines of code that ended up never being used.

What surprised you most about LLMs?

The craziest thing I had to get used to was that they’re nondeterministic. I’m obsessed with patterns, and I love data and numbers more than words—a realization I had in grad school that made me switch my degree from divinity to statistics—and LLMs take me back into the world of words in a fun way that’s still science- and research-oriented. But it’s still nondeterministic, and you don’t really know the output you’re getting or if you’re going to be able to reproduce it.

It drove me crazy at first. Christopher told me, “There’s a playfulness in this that you have to learn to enjoy; you have to embrace the chaos,” and accepting that helped me let go of the rigidity of statistics. It’s a little chaotic, and that’s the fun part.

How do you choose the right LLM for your use case?

When I first started summarizing calls, I’d give a transcript to ChatGPT-3.5 or Claude 2 and they would only summarize certain portions—focusing on just the beginning, for example. LLMs have a limited ability to utilize all of their local memory, so I switched to recursive summarization, where you break things up into smaller chunks. That was a big learning curve. Even if a transcript technically fits within the context window, an LLM can still forget parts of it.

I like to play with different models, doing what I call a bake-off—get an output from both, then decide which is better. What defines “better”? You have to look at their output side-by-side, in different use cases, and then grade them from best to worst.

These are unlike many of the use cases I’ve dealt with in the past. I was on the Trust and Safety team at Hubspot, and we had to identify and block phishing attempts. Humans can, for sure, spot the difference between a newsletter and a phishing email, and so can many models, but with an LLM there’s more variation in its assessments. I can give very clear-cut, objective criteria, but there are [also] many subjective ways of determining if something is good or not, and that subjectivity is where a lot of people get caught up when evaluating LLMs. They have a hard time saying, subjectively, that something is “better.”

The key is that you have to get a lot of different people to grade your LLM, and especially the person closest to the problem. I might say, “These look the same to me,” but maybe my user or customer would say, “Actually, model two is much better for my use case,” and they might be the only person who can articulate why.

Do you also have a framework for determining if an output is missing information?

As much as you can make an objective framework and grade it yourself like an early-stage data analyst or engineer, as soon as you move into subjectivity you need to ensure that you have the right people on deck. That said, there’s still a lot of objective stuff you can check, like completeness of information, complexity of sentence structure, and so on.

Once you’ve got the LLM for your use case, how do you craft your prompts?

Like I said, I’m not the best person with words, so learning to write prompts was daunting at first. I just wanted an equation or code, something that I could really sink my teeth into as being “right.” But a prompt is more like a LinkedIn post or research paper. The more objective your prompt, or specific your definitions, the better off you’ll be.

I used to start by writing down every single thing I was thinking of telling the LLM, and I still do that to an extent, since you always have to start somewhere—but you have to refine. Your actual prompt should be as succinct as possible. The most important things, that the LLM can’t be allowed to forget, go right at the end. But the prompt is still likely to be longer than you expect.

If it forgets something, maybe you could use two prompts instead. Don’t ask it to do ten different things at once: be discrete, and ask for something describable. In many ways LLMs work like human brains, and if you give them a lot of leeway they’ll go off and imagine something different to what you want.

That’s why the very last thing I write in any prompt is: “Do not make anything up. Only use the context given to you.”

During this process, what tools do you use when refining your prompts?

Right now, I do most of my first-level research in Python Notebooks, where I keep prompts for many different use case contexts that I run through and update as I go.

We’re looking to bring on more robust tools. We’ve played with Braintrust, and we’re also working with another company that is building some evaluation tools, but right now we’re in early stages. It’s still Python Notebook-based research where we can compare our outputs.

What was your most important insight when building your framework for prompts?

Honestly, it was realizing that an LLM works a lot like a human—but without context. Most of the LLM models that exist today, used for products, were trained on an internet where human beings talk to each other. But if I was to give a random person on the internet a task, what would happen?

It’s a computer that’s like a human, specifically a very intelligent—yet forgetful—toddler. When you think of LLMs like that, you become a lot more forgiving of their quirks and figure out how to work around them.

Where are you most excited for this technology to go?

It’s evolving in ways that even researchers are surprised by. I think agent-trained models, given agency, are the most exciting. That’s getting closer and closer to human intelligence.

It’s going to make LLMs so much more useful, and may crack the barrier for doing a lot more complex things.

Do you have a hot take on generative AI?

That LLMs are not conscious intelligence—yet—but they are going in that direction, and it’s not something to fear. It’s exciting!

I was deep in a Reddit AI engineering subthread and a woman told me that she enjoys “playing mermaids” with ChatGPT. I couldn’t get that out of my head. This woman was using ChatGPT to play an imaginative game, and it gave her such joy. That’s part of the future—finding unique human ways to enjoy the interaction. I’m excited about that too.

Sign up for more AI at Work

A occasional newsletter showcasing the latest conversations with leaders, builders, and operators who use generative AI to power their work.

Sign up for more AI at Work

A occasional newsletter showcasing the latest conversations with leaders, builders, and operators who use generative AI to power their work.

Sign up for more AI at Work