Transcript#

This transcript was generated automatically and may contain errors.

Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Dig into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

In this episode, we talk with Mike Bostock, creator of the visualization library D3, one of the top three starred GitHub repos through much of the 2010s. He was graphics editor at the New York Times, and he founded Observable, whose Reactive Notebooks handle the whole of his process from end to end. We talk about his journey into visualization, which was heavily influenced by his early time on the Google search quality team, where feature decisions often came down to a single number that was heavily debated, and how notebooks are sort of an attempt to crack open all the computation that goes into visualization.

And of course, we end talking about AI agents and the future of notebooks. I'm so excited to bring this interview to folks. And so with that, Mike Bostock.

All right, Mike, welcome to The Test Set. We're so excited to have you on. So you're Mike Bostock, creator of D3, and you were the graphics editor at New York Times, and now a founder at Observable, where you build powerful tools for visualizing data through code, UI, and AI. Yeah, thank you so much for coming on.

It's my pleasure. It's great to be here.

And I'm joined by co-hosts, Hadley Wickham, who's chief scientist at Posit, and Isabel, who's just an incredible software engineer at Posit, and graciously agreed to come.

As people can see, especially the people listening via podcast, we're in beautiful San Francisco. If you can't see the video, we're actually on the Golden Gate Bridge, and we're surrounded by birds. It's a beautiful day. So we're so happy to have you.

Origins of D3

Mike, I feel like you've done so much. And as I talk to colleagues about D3, I feel like there's this incredible history of D3. So I'd love to talk a bit about how you got there and built D3. And then I know there's a lot you've talked about on open source and AI and your work with Observable. I was curious if you could just catch us up a bit on some of the history of D3 and your work there.

Sure. I do like to build tools and have been doing it for a while. And I think a lot of my ideas for tools come out of frustrations using existing tools, or in some ways a desire to understand how those existing tools are built or why they were built the way they were built. For D3 specifically, I was doing a lot of work in browsers, in SVG, using JavaScript, using the DOM API. And in particular, the DOM API is very verbose. And if you're doing stuff with SVG, there's this namespace URL that you have to remember to create an SVG element or to set xlink href attribute and stuff like that. And so it was just difficult to remember this specific URL.

So the goal with D3, I mean, and there are other libraries that I worked on that kind of predate that. But I think D3 specifically was focused on kind of interaction and animation and transitions and performance. And I think in many ways, like a lot of its success came out of being at the right place in the right time kind of thing. So it really builds on web standards. So these standards exist already or existed already, SVG and Canvas, and of course, the DOM API, as I mentioned. But in many ways, they were very tedious to use. And so the goal for D3 was to build on those technologies, like let you leverage all of those capabilities, but make it much easier for you to get started, make it much more, you know, performant, I guess, like maintain that performance capabilities, but make it easier for you to use it.

There's also an element to D3 that's about kind of all of the visualization techniques and just kind of packaging those up in a reusable way. So like the tree map, squarified tree map algorithm, for example, like that's fairly tedious to write yourself, like to read the paper and to implement it yourself. And there were some existing implementations of that that predate D3. But what I wanted to do with D3 is to try to think about like, what is the kind of purest encapsulation of that algorithm in a way that's independent of how you display it. So like the layout algorithms in D3 are all like data space. They're just like data in data out. They don't dictate how you display it, like whether you're using SVG or Canvas or WebGL or anything else, or even like React, whatever you want to do. So I want to try to like decompose it into these composable pieces. So that, and I think that helps contribute to its longevity.

I mean, maybe you had the same experience as me. It's like reading these viz papers would be like a cool visualization. And they provide software, but the software does that visualization and like nothing else. And then you're like, well, I'd like to combine it with this other visualization. And there's just like no like connection.

And they're really fun to work on as well. Like the circle packing algorithm, I think has been the most fun one that I worked on. And they have these like fun diagrams that show how they work, how they kind of build out these layouts progressively. And you can kind of work on them and you get these really satisfying animations as it's iterating over the layout. And yeah, like packaging them up so that it's easy for people to reuse that's not tied to like any implementation artifact that's in a paper has all of these kind of somewhat arbitrary choices of what language it's in, or what other kind of parameters or UI that's around it. And so it is fun as a kind of a software engineering puzzle to think about like, what is the most reusable version of this implementation? And in a way like that is kind of the art of open source is like, how do you take a complex problem and kind of pare it down to its essence so that people can then use your solution or use your tool in as many cases as possible.

The art of open source is like, how do you take a complex problem and kind of pare it down to its essence so that people can then use your solution or use your tool in as many cases as possible.

Intrinsic and extrinsic motivation in open source

You talked about this a little bit in your lessons from like 10 years of open source too, but there's also that like, this is like a hard problem and I've solved it and it feels very satisfying. And then sometimes you're like, well, like, was this actually a problem that needed to be solved? Or like, does this actually like move the needle? Like, how do you like some of the, there's definitely the aspect of like, this is a fun challenge and I've like nailed it and like, oh, I've done something useful. And those are like, sometimes totally.

Sure. Yeah. I mean, I think you want to have both your intrinsic motivation and your extrinsic motivation. So I think it's definitely fun to work on stuff that you just enjoy. Like if there's an interesting puzzle of you, like implementing this algorithm, like, and great, like just do that. And I think a lot of that's kind of the beauty of open source is that you can work a lot of these things that are, you're just intrinsically motivated to work on. And then if other people find it useful, now you have an extrinsic motivation to like continue to develop it, right? Because you've got this community, you've got this user base, they're excited about it as well. And they're giving you ideas of what it can do next or ways that you can make it easier to use. So that's good, you know, validation.

Getting into visualization: Google and the search quality eval

And you, I'm really curious in terms of the open source, would you sort of like set the stage? Because I know you did like proto-viz, I think before D3, like, what was it like basically? How did you get into viz? What was it like sort of starting with proto-viz and open source? What was that experience like?

Sure. Well, getting into viz happened much earlier. So I think the first time I really got interested in visualization in a professional capacity, I was working at Google and I was in the search quality evaluation team. And the evals team is, their job is to kind of take all of these experiments that the search quality team is doing, like these potential changes to ranking or like new signals that can be incorporated and try to empirically assess, you know, what the impact will be, like, is this improving quality or not? And at the time, you know, the main outcome of these evals was a number, you know, a number between one and five. And the idea is like, if your experiment scored like 4.1 or higher, you know, you got to launch. And if it was lower than that, then you had to go back and kind of make some improvements.

And the problem was, you know, as is always the case, like there's human opinions and personalities involved. And every time you're working on some change to ranking or whatever, like there are certain things that you're trying to do and maybe other things that you're not thinking about so much. And I found that a lot of the discussions around whether we should launch something or not could devolve into like debating whether the eval metric was good, like, or whether it didn't kind of sufficiently capture the unique nature of this experiment.

Yeah, yeah. You're just arguing over a number. And it's like, well, this, you know, the argument was more about the evals being flawed than the kind of the nature of the experiment. So what I was interested in is like, how can we surface all of this information that we're collecting about these experiments? So some of it was, you know, kind of a, b, whatever, you can do, you can do experiments where you actually launch it and you see what the effect is on real human users.

But we would also do stuff with human readers where we have sort of test sets or query test sets. And we would kind of ask them, like, is this improving the quality? Like, is this a better result for this query? Is this a better result set for this query? And so we had all of this information, and by surfacing that information, partly through visualization, I think we helped the engineers better understand what the impact of their changes was, and then they could spend more time actually debating that, right, and highlight maybe some of the unexpected consequences of their experiments.

The 1998 bar chart made of three GIFs

In 1999, I was like, I did a summer internship, a kind of a Netscape, and I wrote for their developer website. So there is actually — not even 1999, I think it's actually like 1998 or something like that. I wrote a, my first JavaScript visualization library, like pre-Canvas, pre-SVG, pre-everything. It could only like render a table element, and it had to use like an invisible one-by-one — it actually had three one-by-one GIFs. There was like a transparent one, a blue one, and a red one. And you just did like horrible things using just the HTML table element, and it could produce, I think, only a bar chart.

Are you saying it's like, it's like a grid? It just made a grid and now it's filling it in with red and blue and transparent? Like, cause there wasn't even — there was very difficult to do graphics in the browser in that time. And so you, but you could kind of approximate, um, exact pixel positioning by using the table element. Um, and so I did terrible things in order to produce these very rudimentary bar charts. But yeah, I think that, you know, I've, so I've always had an interest in, in helping people understand data through visualization and the technology was not really there at the time, but you know, 20 years later or something, it's, it's really quite impressive.

I mean, even you described a little bit like SVG, even at a little time later being tricky to use too, but probably less tricky than a table with like three one-by-one, uh, images in it.

For sure. And I think, you know, it's not in a sense, like a flaw in the design. I think like it's, it's really helpful to have these, um, kind of formal and very precise, um, APIs that you can then build whatever you want on top of it. Um, so, you know, D3 is obviously a very opinionated way and a specialized way of constructing the DOM. And so it's not in a sense, like better than the DOM API. It's just, it's much more specialized, focused on how do you transform the DOM to conform to a particular data set? How do you describe transformations to the DOM? Um, and so I think, you know, what my interest, I guess, is in this more. It's, I guess it's design, right? It's like, how do you create software interfaces that are more accessible to humans, that they can understand how to use them effectively and use them efficiently?

The D3 paper and betting on web standards

I mean, maybe one of the biggest lessons from D3 is kind of the value of the ecosystem, like the web standards, like I mentioned, um, you know, this ability to interoperate and be compatible with and use the tool in conjunction with all of these other tools. So when we wrote the D3 paper, I think in some ways it was controversial, um, because it was coming after the Protoviz paper. And in a way it was rejecting a bunch of the ideas from Protoviz. Um, and I think one of the things that was heretical about it was this idea that, you know, you don't need a specialized representation for information visualization. Um, and instead it's better to pick kind of a bog standard, um, you know, the DOM API, SVG, something that's not specialized to visualization and just use that even though it's not kind of designed with visualization in mind, but you get so many advantages from interoperating with browser technology, like being able to use your style sheets or being able to use the browser's dev tools, or being able to kind of integrate with, uh, React or other like DOM APIs.

And so I think the, you know, from the research community, I think those kind of practical benefits of the technology weren't as obvious. And so it was somewhat, uh, heretical to come in there and saying like, let's not try to build a specialized representation, but just take this thing that exists already and focus on kind of the more practical benefits of like how it gets put into practice. Cause people never use these tools in isolation. They're always using them with all sorts of other things.

I mean, it's like, it's hard to remember now, but this is also like, like JavaScript was kind of like a joke language. Like this is still, it was still a joke language at the time, like it was slow. You're saying compared to R.

But just like, it wasn't like people didn't consider it a real programming language and like how all these performance issues, but like you clearly could have sort of saw like, hey, the browser is going to be really important for visualization. And I obviously like JavaScript has made tremendous strides in terms of its performance and its capabilities. And it's kind of nothing like it was, you know, 20 years ago, but I think from my side, like it was always inevitable that that would happen because, you know, it really is the, the interface that all human beings are like interacting with their computers, with their displays of data. Like it's really hard to compete with the convenience of delivery through the web, like through browsers.

Like, I think an interesting, like, I was just kind of thinking, like, imagine a world where you were like, I love Java applets, I'm going to like invest in that. But in some ways that that's a little bit like processing, which was kind of around a similar time, like the, not, not for visual, like more of like art, like a system for programming art that was really all in on like Java applets, I think that was, and, you know, and I think that's still thrive is still thriving today, but if had to like make a lot of like technological leaps to kind of stay for like D3, I think there is something inherently beautiful about kind of HTML and the DOM, this idea that it is not kind of an opaque representation of a graphical user interface, but in fact is something that you can inspect and then even manipulate.

So there's all sorts of benefits to the user from that. I mean, like ad blockers is probably the most obvious one, but like screen readers or other accessibility improvements. And like, it just, it is dramatically different than an application that, for example, just draws to like a pixel buffer or whatever, that just is a array of RGB values effectively. And I think that is key to its success.

And do you think it helps that like, you can go in and like inspect, like you can go into a D3, like inspect this DOM and be like, oh, okay, there's now I understand like what's going on. And, you know, that's how I got into programming in the first place, like growing up and like seeing web pages. I mean, you can't really do it these days. Cause there's like 35 megabytes, minified.

But in the old days, you know, you could go to a website and you could just like click on the menu and view source and they'd be like, how did you do this? You know? And it would be right there, kind of obvious. You could inspect it. You could learn from it. And so there are some challenges now with the complexity of modern development, but even so like, yeah, the ability to kind of inspect the SVG and kind of see how it's structured, like definitely gives you some, some clues. And then with D3 in particular, I've enjoyed sometimes like it, you bind the data to the elements. So using the dev tools, you can actually see the data structures that people are using in their charts. And that's kind of fun to understand how they work also.

I, one thing I'm curious to, to hit on is, the popularity of D3. As I looked back, someone mentioned, I think, D3 was like one of the top starred GitHub repos. It was number three at one point. After like Bootstrap and React, I think was the other one. But yeah, it was very popular.

And I think at one point I've heard that just a tremendous amount of traffic was going to the D3 wiki. So I think one of, one of the things that made, I think that helped D3 grow as a community was the fact that the, all the documentation was just an editable, publicly editable wiki on GitHub. So anybody could go in there and like people contributed translations to all different languages, as well as kind of like fixing typos and stuff like that. But I think the thing that had the most kind of momentum around it was the gallery. So it started with just some thumbnails of examples that I had made, often ones that imported from Protoviz, but there was this kind of rite of passage in the D3 community where people would learn D3 and maybe the first or second or whatever visualization that they made, they would take a screenshot of it and they would add it to the gallery.

And so very quickly there were like a thousand examples that various people had contributed and you just went there and it was this overwhelming display of kind of all the different cool things that people were building. A lot of them were like animated GIFs as well. So you would see things moving around. Unfortunately, then once it became popular, you know, spammers started putting links in their like malware and stuff and it just became unmanageable. So I had to take that down.

bl.ocks.org

Was that a precursor because I know you had blocks, is that right? Could you explain a little bit about what blocks? Okay, so I was making all these examples to help people learn D3. And very often, when people would ask questions, so we had the Google Groups mailing list, which is the primary way that people would ask questions. I think now it's more on GitHub discussions and that sort of thing. But people would ask a question, I'd be like, okay, I'm going to make an example for you that shows you how to do this. And it just kind of became difficult to manage all of those examples. And so I started to build some machinery around making it easier for me to manage those examples. Because I didn't want to kind of keep uploading them to a website. And even just naming, like the folder name started to become difficult to come up with a good name for these things when you have like a thousand examples. So I started using GitHub Gist, which is just like a very lightweight Git repo that has just a randomly generated giant hex string as its name. And you could put an index.html file in there. But what I needed was just a way to like actually render that HTML page. And so blocks, which is bl.ocks.org, is just a viewer for GitHub Gists. And so you could then have a URL, you give it a Gist ID, and people could go look at that example, and it would show you the source code below. And so that kind of became the primary way that I started sharing examples. And of course, since it was on GitHub Gist, anybody else could start sharing their examples as well.

Documentation, examples, and the feedback loop

Well, that feedback loop is, I think, the most exciting, most valuable, most powerful thing about open source, where you're putting something out there, and then you get to see how people use it. And they're asking you, like, how do I do this? Or how do I do that? Or they're complaining about something or whatever, or they're giving you ideas. And being in that feedback loop, getting all of those new ideas is what really helps you advance the tool, make the tool better, because you understand the problems that people are running into, and the ideas that they have of other stuff that you can add, and that other people can then benefit from once you've incorporated it into the tool. And so documentation is part of that. But I think examples has been another huge part of that. And in many ways, I think people view the D3 examples as kind of more of what D3 is than D3 itself, if that makes sense. I think there was something that I read once that talked about all the different chart types that were supported in D3. And I was like, no, you've got it all wrong. Like there aren't chart types in D3. But what they're thinking of is the examples that you provide. And they can just be copied and adapted with new data that's being put into it or whatever.

And so I think the examples, as I've written before, they serve multiple purposes. And sometimes it's kind of giving you example code, like helping you get started with something, showing you how to use something. Sometimes it's inspirational, where you're just showing what's possible, kind of the breadth of possibilities in a particular tool, and kind of getting people excited about what they could be doing with it. Yeah, you give a talk about examples. Yeah, I found that like really, that was like really influential for me and like guided a lot. I think a lot of the R package docs. Yeah, like people, people don't really want to read the docs, right? They just want to scroll, find something that looks like what they want and like modify.

Yeah. I mean, especially if you're making a visualization tool, you know, you have to show people the visualizations to get them excited about its capabilities. So that's always been front and center in what I've done. I think the interesting now with AI is whether that demand, for examples, will still be there in a sense, or like whether people will ask the agents essentially to construct the example that they want at the time that they want it. So one of the things that people struggle with, with the examples, like if there's an example, that's exactly what you want, then great. But, you know, the whole idea of D3 is that it can be really expressive and do all sorts of things. So you typically don't want just that example and use it off the shelf. Like maybe you like this example, and then there's another technique from another example and another and so on. And you kind of want to stitch them together to build the thing that's unique to you. It can be hard to do that, particularly if you're new to D3 or if you're new to programming, but models agents are really good at kind of fitting those two things together. So it's good in empowering people, but it also, I think, is a little bit worrisome from the open source community perspective where there's not as much of an incentive for users to come read the documentation, to come look at the examples or to share their own examples or to ask for help, because they can just kind of ask an agent for those things.

Do you feel like your strategy for documentation has changed with the age of AI? No, because I'm stubborn maybe or slow. I mean, I know that there are people that are doing like llms.txt or whatever and like trying to make it even more easily consumable by AI. But I think, you know, my attitude is I don't really want to write documentation specifically for agents or for existing models. I mean, for one reason, like they're going to change all the time. And so if you overfit to whatever the current models are, when the next model comes out, like it may not work as well. And so in that sense, I would want to build something more durable. And so if it works for the models, then great. But I also want it to be understandable by humans and trying to teach humans. And it can be a really good forcing function to figure out or at least another kind of feedback mechanism to figure out like how to explain something to somebody. Like how do I articulate when you should use a certain argument or option and plot or not? You know, you can very quickly evaluate that with the agent, whereas like actually doing that with humans is much more expensive or slower because you have to put it out there and see what they do or do user research and stuff like that.

From D3 to Observable notebooks

So, you know, my interest in building tools in general is like, how do you take these valuable skills, expertise, practices, and make them more broadly accessible. And a big part of that for me has been programming, like most of the tools that I build are programming tools, programming libraries. I love the power of code, the generality of code, kind of these compositional primitives that you can deploy in all sorts of interesting and creative ways. It's not like there are six chart types with whatever 10 options each or something like that. You can really have a lot of flexibility in what you create and not just visualization too. I mean, again, visualization is not just kind of a thing that you do in isolation. It's like, what data are you using? And how are you kind of modeling or transforming that data into kind of an interesting representation that you can then visualize? And then how are you sharing that visualization or distributing it? There's all sorts of other tasks that go alongside visualization.

So I love the power of code, but of course, it is difficult to use. Let's be honest, there's a lot of things that you need to learn. And so I've always been interested in how can I make code more accessible and then how can I make visualization more accessible? So, you know, we were talking about blocks.org earlier as a way for people or as a way for me to share examples and then other people would share examples as well. But, you know, one of the issues of that is still very much a local development environment, right? You typically would git clone a GitHub gist. You'd have a web server running locally. You would use your code editor. You know, those are some barriers to entry in order to just get started.

And so, what I wanted was something that was web-based, where you wouldn't have to install anything. You would just go to the webpage, and not only would you see the example, but you could edit it. You know, you could change the code, you could replace the data, you could do kind of any number of things, because it's an entire development environment that runs in your browser. And furthermore, like a social, collaborative development environment, where you could share what you've built, you could import code from other examples, and stuff like that. And that comes from the open-source mindset, where working out in the public lets people kind of inspire each other, and share techniques, and in general, kind of advances the state of the art much faster than anything that doesn't have that same level of collaboration.

So Observable started as this way of like, how can I make programming more accessible? Like moving it to the web, so you don't need any development environment. But it was also, in a way, like rethinking some of the aspects of programming. So yes, it's a computational notebook, you know, like Jupyter, you have cells, you can run cells, and they can display things, and they can compute values. But I think one of the key differences, one of the key innovations of the Observable notebook was reactivity. This idea, like in a spreadsheet, where you don't have to manually run the cells, that instead, it understands the topological relationships. So if you declare a variable in one cell, and then you reference it in another cell, when you redefine that variable, any referencing cell runs automatically. And it's just like kind of a bookkeeping thing, in a sense, where like, you as the author of your program, don't have to remember to keep rerunning things, or worry about being in this kind of inconsistent state where you've changed the code, but you've forgotten to rerun certain downstream cells, and so like, it's not matching up what you expect.

So like, spreadsheet programming, I think, has made that form of programming much more accessible. But you only have these tiny little cells that can produce numbers, and in a sense, I just wanted bigger cells that could produce graphical outputs as well. And the other thing that comes along with this reactivity model is basically anything can become user controlled, anything that can become interactive. And so you can take, you know, a number, like, let's say you're building an inflation calculator or something, and you want to, like, have a given a certain amount in a certain year, and you want to say, like, what is that value in today's dollars, rather than having to edit the code to change what the number is, like, just replace that with a slider. And because it's reactive, you don't have to change anything downstream, like any constant variable can be turned into something that's interactive, or turned into something that's animated and stuff like that.

And my because my experience from D3 was a lot of that kind of interactivity, like the asynchronous nature of loading data, and kind of handling all of these different states, like that was the hardest part, in a way, about making interactive visualizations. And so if you could handle that at the language layer, or by with the runtime, then it's kind of a simpler way of thinking about it. Are you saying you're saying a lot of the problem was even just getting data in, yeah, maybe doing some stuff with it before, yeah, hit the plot.

I mean, so just as a small example, like on the blocks, examples, you would, it's like pre promises, so you would use these callback functions. So you do like d3.json, or d3.csv, or something, and then there'd be a callback function. And there was always this question of like, what goes in the callback function versus like what goes outside the callback function? What do you do if you want to load multiple data sets and join them together? Like do you nest them together, which is slow? Do you use another, I've read another library that's called d3q, which is like for running, it's totally obsolete now with promises, you just say promise.all. But there are all these kind of mechanical concerns about asynchronous state or handling interaction that I think, you know, is just another barrier for people to produce a good visualization. So I wanted to think about that problem, and make it easier for people to express these reactive or interactive programs.

Observable canvases and the return to notebooks

I'm really curious, there's the new idea of like the observable whiteboards, and how does that fit in? Like, is this a next generation notebook? Is this replacing dashboards? Where do these whiteboards, which are like, almost sticky notes as a Jupyter notebooky thing? Where does this fit in? So I think, you know, we have observable canvases, which has kind of been an alternative to notebooks. And the idea is like, you know, what happens if instead of essentially like a linear layout in a document, you have a infinite canvas, like a 2d layout? I think it's been an interesting experiment for us to think about, like, how we can build more UI versus building more code. Like there are, like, there's the substrate difference, I guess, between a canvas and a notebook. But there's also like the different components and stuff like that. I think after, you know, we've spent about a year on canvases, I think we are basically heading back to notebooks at this point. Having learned some of the innovation that we're able to get through canvases.

I think canvases was also an opportunity for us to revisit how the AI works. And having AI being more integrated into kind of your data exploration and your visualization. And so we have some exciting stuff that we've been working on that pulls that back into notebooks as well. But the challenge has been, like, having kind of fractured or disparate tools. And so one of our lessons is, like, we want to build on, like, notebooks as the foundation. So that you can easily move between kind of different ways of working without kind of having disparate applications that don't work well with each other.

It sounds like you learned a lot doing canvas. It sounds like through canvas, you learned a lot about what notebooks can give people, maybe? Yeah. I mean, I think we're always learning in terms of what's working and what's not working and trying to kind of push that envelope in terms of how do I make this more accessible? So I think with the reactivity, with the web-based development, I think absolutely we've made progress in terms of making kind of interactive visualization more accessible or making web development, let's call it, more accessible as well. But I think, you know, there still clearly is a barrier to entry there. And one of the questions is, like, can you solve that through UI or can you solve it through AI?

And I think a year ago, I mean, these things are changing so quickly, but I think over the last few years, I think we've been doing a lot of work on the UI. And I think we've made progress, but it is very challenging to do. No matter what the UI looks like, it feels like you're sacrificing a lot of the expressiveness of it. And it's very expensive to build UI. And so even if we build a great UI for visualization, what happens if you don't know where your data is, right? Or you want to cross-reference data from different sources? Often there are public datasets that you want to pull in to correlate with whatever proprietary data that you're using. And that's a whole nother technical challenge. If you want to do self-serve analytics or whatever, you want to kind of bring in a broader audience to be able to use these tools. You can't just do the visualization part of it. Like, it's all of the other parts that you have to support.

And so I think we've made progress in the UI, but it's just, it's been very expensive to kind of build out enough to really make a difference in terms of the accessibility of it. And I think now with the advancement in AI, it feels like there's a whole new opportunity to kind of go back to a code-based approach. And so in a way, like we're very well situated in that these computational notebooks and the Reactivity in particular are, you know, more human accessible form of programming, but it turns out they work exceptionally well for the agents as well, because you can take really complex programming problems and break them down into smaller steps, kind of run the code incrementally, see what happens, inspect the results, and kind of iterate and learn from that. And ultimately the goal is that that code, even if it is produced by an agent, should be understandable, interpretable by a human. Like we don't want to create these black box solutions where we don't understand what it's doing and therefore can't trust the output of it.

AI agents and runtime inspection

So that's actually the centerpiece of the new AI that we're working on, which we haven't launched kind of to the web yet, but we did soft launch it yesterday to Observable Desktop, which is our like desktop app. But yeah, the key thing is that it does runtime inspection. So when you ask the agent to do something and, you know, it generates a bunch of code or whatever, it doesn't just assume that that code did exactly what it expected. Like it can actually inspect any of the declared top-level variables, as well as anything that was displayed, and it really changes the behavior of the agent.

So like one of the examples that I like to do — because it has all this innate knowledge, right? Like it knows the Penguin's dataset. But the funny thing is the Penguin's dataset, sometimes it's called BillLengthMM, and sometimes it's CullmanLengthMM, which is like a more technical name for the bridge of the bill. And the one in Observable uses CullmanLength. So I like to do this — kind of like I like to deliberately mislead the agent, and I ask it to make a chart of Penguin BillLengthMM. And so it's like, sure, here you go. And prior to the inspection, it would like make an empty chart for you, because that doesn't exist in the dataset. But it would be like, perfect, you know, here's your chart, and the Adelies are like this or whatever. And because it can't see that it screwed up. But now that it can inspect it, it notices that like, hey, there's something missing here from this chart. And it then goes and like inspects the dataset and says, oh, wait a second, like this has different columns than I expected. And then it goes ahead and does the correct chart.

When you ask the agent to do something and, you know, it generates a bunch of code or whatever, it doesn't just assume that that code did exactly what it expected. Like it can actually inspect any of the declared top-level variables, as well as anything that was displayed, and it really changes the behavior of the agent.

And there's so many other examples like this where, you know, it's working with some dataset and it's just slightly different than it expected. And the fact that it can actually verify that and then correct for it makes it much more robust.

Have you seen our BluffBench? Simon Couch and Sarah Altman did this for us, where we kind of noticed like, the LLMs are like pretty lazy at reading the plots. And often they'll just effectively like read the axis labels and then based on the kind of like if you do a plot of fuel economy and then ask it like what's going on, it just tells you the expected relationship between engine size and fuel economy without actually looking at the plot. It can be very dangerous because it has that kind of innate understanding. And also I think they tend to be too optimistic in a sense where like either is you wrote the code and it assumes you know what you're doing or it wrote the code and it assumes it knows what it's doing. And so yeah, you do have to kind of nudge it to be a little bit more discerning in looking at that output because we found that even when it could see the output, like if we didn't tell it that it really needed to review it and make sure that it did what it expected, it wouldn't do well.

And then similarly, so one of the big challenges with runtime inspection, particularly working with data, is you have large datasets, right? So you have, you know, hundreds of megabytes or even more of datasets you might have loaded into memory. Obviously, you can't fit that entire thing into the context. And even for an SVG, like if you have lots and lots of circles or lines or whatever, like that can be many kilobytes as well. And so we've written some kind of interesting code to inspect those arbitrary values that hopefully does a good job of kind of giving it a broad overview, like deep enough that it can kind of see some examples but not see everything. But you also have to tell it like whether or not it's truncated. Because, you know, if it thinks that that might be truncated, it'll just assume everything's fine. It's like, oh, everything's fine. I just can't see anything, you know, but it's like you have to tell it what it's looking at and tell it to be more discerning.

Evals for agents

And have you all had to dig a lot into like evals, whether with a capital E or lowercase E, like if you change the software, the prompts, how does the agent's behavior change? Yeah, very much. In a way, that's like a flashback to what I was talking about earlier, working on search quality. The Google eval. Yeah. No, for sure that we have like these eval suites that we run. I think, you know, they are helpful, but we need a lot more in a sense, like, you know, and we need, as I said, we just like soft launched it yesterday. So like more people using it, I think we'll hopefully get feedback from real users who share their work with us and kind of get feedback that way. But I think you're right. You like you need some way of knowing when you make a change to your harness or to the prompts or whatever, like what is the actual effect of that? And if you don't have the evals, then you're basically going blind, like whatever you're testing manually. Maybe that works great. But all these other things that people will encounter once you make that change, you're not aware of those changes that you're making.

It is very like that's the most, I think, frustrating thing about working on agents is just nobody really understands how they work or what you should be doing or like you would change one thing and who knows what the effect is going to be. So if you don't have evals, you're really operating blind.

It's really hard to write good evals as well, we've discovered. So, you know, you have these different, I mean, first of all, even if you set the temperature to zero, like it still does change, I mean, you know, like your system prompt includes the current date in it. So that's like one way that things can be different. But who knows, like what the other underlying instabilities are in there. But the challenges with writing the evals are basically like how specific are your assertions and how specific are your prompts? Because when you have failures, you know, your choices are like, do I make my assertions more general? Like basically, do I enumerate more and more of the various things that I've seen that are acceptable? Or do I somehow like try to make them more general? Or do I kind of change the prompt and make the prompt much more specific?

So like one of the examples that we run into is that the agent is just like happily keeps going, you know, like it just keeps showing you more and more things. If you ask it kind of too open-ended a question, it can easily run for, you know, a few minutes or whatever, generating different views and stuff like that. And so if you want to have an eval that runs efficiently and has like a 30 second timeout or something like that, you do have to be a little bit more specific to tell the agent to kind of wrap it up and be concise in its response. But of course, the danger is that if you are too specific in your prompts, then they're no longer kind of representative of what users will do. And so again, having that external feedback, like having real users using it, is essential to know whether your evals are really representative of actual usage or not.

Playing safely with fire: code as a source of truth

Well, the main thing is really about interpretability, verifiability, and trust. You know, I think certainly, the models, the coding agents, are somewhat miraculous or magic in what they can do, but are also clearly capable of misleading you if not lying to you. They can be sycophantic. They can assume things are working when they're not or just make stuff up and all that sort of stuff. So I think while they are very exciting as a technology, I think it's easy to be misled by them as well. And certainly there's a lot of hype around it as well.

And so I think our take is coming back to that, you know, how do we make it a more human medium for computation, for programming? And so if we're incorporating AI into that, it has to be about how does a human understand what the agent is doing and how do we make it more verifiable? And in a way, a lot of what we're doing now, I think bringing the agents into notebooks and kind of having this code-first way of working, is trying to, in a sense, shift some of the agent thinking into code, because that is a more formal specification, a more verifiable specification. So I have this kind of saying where the agent can lie to you with text, but it can't lie to you with code. In the sense of, you know, if you ask it something, you're like, what are my top customers or something like that? And it just gives you a markdown bullet list. You know, who knows if that's right? You need to cross-reference it with an actual query in order to know whether that's correct. If it gives you code, you have the problem of whether or not the code is relevant. But assuming that the code runs and is relevant to the question, you can generally trust that the query will behave deterministically, kind of give you the results that you expect.

The agent can lie to you with text, but it can't lie to you with code.

And so in that sense, we're trying to get the agent to shift more of its thinking to code. Hopefully that code is still interpretable, but that way it's more verifiable and trustworthy and reproducible.

It's interesting, you know, because there was this whole like no code, low code movement before and I think we were on the peripheral of that working on kind of UI cells and notebooks and stuff like that. And I think in a way it's liberating to come back to code as like this sure foundation, like this reliable thing, this powerful expressive thing. And now we're much more specifically focused on, okay, we're not hiding the code or taking away the code or abstracting the code, but how can we help people learn how to do that code? And I think there is a difference between writing code and reading code, like in terms of the challenge, the barrier to entry there. And so I think, you know, even if you can't write code, if we can teach you to read code and review code, you know, that broadens the audience of who can do this.

I think that's the most exciting aspect of it because it's not, I think my hope is that people don't just purely rely on the agent to do all of the work. Like if it helps them get over that initial hurdle and then they're like, oh, if I learn a little bit more about how this is working, then I can get more hands on here. And it really, you know, helps you kind of scale up your expertise. I do think the other thing that I just find really cool is now you can interact with an agent in like any human language and like it speaks back to you in your language. And I think that's also like really cool. Like we never would have been able to like translate all of the documentation, all of the UI and all of these different languages. But now, you know, it's the translation's not perfect, but just again, like tremendously enabling for people.

Designing libraries for agents: plot vs D3

Yeah, I think that's a really, really interesting question about how we design the agents and how we kind of guide them to do certain types of code or certain libraries. So most top of mind for us is kind of when you're asking the agent to do a visualization, you know, is it going to use D3 or is it going to use plot? In general, we very much try to bias it towards plot because plot is a higher level abstraction like grammar of graphics. It is, you know, there are certain things that you can't really do in plot, at least not yet. Like there's no tree maps in plot, for example. And so there are certain things that you do need to use D3 to use those. But in general, we try to bias it towards plot because if it is able to use plot, it's much more likely to produce a good output, a correct, like non-buggy, but also just a better designed output. Like they look better, they have better tooltips that are built in, they kind of make better choices by default and all that sort of thing.

So I think it is important still the design of these abstractions, designs of those libraries. It's not like it can use D3 to do all of the things that you can do in plot. It's just much harder to get it to produce as good of an output with D3 as it is with plot. There's also a really good feedback loop in place where you ask it to do something and it makes a mistake. And then you're like, my choice is I can either change the prompt to teach the AI to do something differently, teach the agent to do something differently. Or I can go back to plot and add a new feature to plot or add some better defaults or better warnings or something like that. And it's fun to have that instantaneous feedback where you just, you change the feature, you change the library and you can immediately test it against the agent with all these different evals. Yeah, it's another feedback loop to stress test the design of your interfaces.

Are the agents pretty good at sticking with plot or where does the path of desire go for agents? Do they want plot or D3? I mean, I don't know how much of it is the prompt that we've given it. It definitely sticks to plot pretty reliably. And I think a big part of that is just the whole like observable community and all the public notebooks that have kind of fed into these models, like it really does have a pretty good understanding of both how D3 works and how plot works.