Skip to content

Instantly share code, notes, and snippets.

@simonw
Created June 17, 2025 21:18
Show Gist options
  • Save simonw/70e8d3d859646f0345a5a16b6af215ae to your computer and use it in GitHub Desktop.
Save simonw/70e8d3d859646f0345a5a16b6af215ae to your computer and use it in GitHub Desktop.

2025-06-17T18:21:00 conversation: 01jxzj16mt96mam6gegjzg7pqk id: 01jxzhyzqxfvqwbxjn9xfya01k

Model: gemini/gemini-2.5-flash

Prompt

Full transcript with timestamps

Schema

{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "timestamp": {
            "type": "string",
            "description": "mm:ss"
          },
          "speaker": {
            "type": "string",
            "description": "best guess at name"
          },
          "text": {
            "type": "string"
          }
        },
        "required": [
          "timestamp",
          "speaker",
          "text"
        ]
      }
    }
  },
  "required": [
    "items"
  ]
}

Attachments

  1. audio/mpeg: /private/tmp/gemini-2.5_smaller.m4a

Response

{
  "items": [
    {
      "speaker": "Logan Kilpatrick",
      "text": "Hey everyone, how's it going?",
      "timestamp": "01:01"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "We will get started in a couple of minutes after all of the awkwardness of starting a X space. So, hang in there for one or two more minutes.",
      "timestamp": "01:07"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Hey Zack, hey Tulsi, hey Melvin, hey Anka.",
      "timestamp": "01:21"
    },
    {
      "speaker": "Speaker",
      "text": "Hello. Hello.",
      "timestamp": "01:26"
    },
    {
      "speaker": "Speaker",
      "text": "Yay, hello, hello.",
      "timestamp": "01:29"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Awesome. I think we have everyone. Are we Google AI host account, are we good to get started?",
      "timestamp": "01:38"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "I see there's three more minutes. Cool. We will hang tight for just another minute or two.",
      "timestamp": "01:50"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "We need lobby music in this, uh, in this space.",
      "timestamp": "02:03"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Okay, I think we are going to get started. Um, we have lots of awesome announcements, uh, that have just been rolling out. We've got lots of awesome questions from folks who, who sent in a bunch of questions, uh, so excited to dive in all of this. Uh, my name is Logan Copatrick, uh, excited to have this conversation. I do developer product stuff. Um, we're joined by an amazing set of folks, um, who I'm excited to sort of get them to share their perspective. Um, Tulsi Doshi, who's our head of product for Gemini models, um, and the co-conspirator of lots of these launches. Uh, hopefully you've seen a bunch of, uh, her incredible tweets. Um, so we'll, we'll make that happen. Um, Anka is our senior director for AI Safety and Alignment, um, and actually also one of the post-training co-leads. Um, so Anka, I'm excited to, to get your perspective today. Melvin Johnson, um, who is a distinguished software engineer, um, another post-training person on our team, who's driving, uh, some of the cross-Google initiatives there. So, excited to have Melvin. Um, and then Zach is our product lead for Gemini pre-training and embeddings, um, and has done a bunch of the small model launches, um, and obviously one of the new models for today's announcement was around flashlight. So, excited to talk about that. Our smallest, uh, our smallest model and smallest reasoning model. Um, so, this is exciting to have the conversation. Maybe, Tulsi, you can actually kick us off with just a high level of some of the model announcements. We've obviously had a bunch of stuff with 2.5 over the last few months, but today feels like the culmination of, of a lot of that stuff coming together.",
      "timestamp": "03:12"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "Yeah, first of all, hey everyone. Uh, we're super excited to be here. Uh, it's a, it's a cool day for us because I think this is really today's set of launches is 2.5 as a family kind of taking the next step. It's us actually having uh, 2.5 Pro and Flash be stable production ready models that we will be supporting for a long time. Um, so this is really us taking the set of previews that we've shipped and gotten your feedback on and iterated on and bringing them now into these stable 2.5 Pro and 2.5 Flash launches. And both models are awesome. We can talk more about what makes them uh, so great, but we're really excited for you to keep building building on the 2.5 family. And then as Logan said, we've released also Flashlight. Uh, so now the 2.5 family has three model sizes. There's Pro, Flash, and Flashlight. And the way you can think about them is Pro is really the model that is um, just amazing performance, right? So if you're trying to get the best quality, especially for code or complex prompts, uh, Pro is your model. Flash is this kind of good workhorse model. It is great cost for quality, um, but is also just an extremely strong reasoning model. Uh, and is also has good latency to be able to support real-time use cases. So especially if you care about that kind of real-time latency, live performance, things like that. And then Flashlight, which we're introducing today, the 2.5 Flashlight model is really optimized to be our fastest and cheapest model. Um, and so it really gives that kind of opportunity, especially if you're caring about, um, high latency, uh, or tasks that require, sorry, very low latency, so like high bandwidth, um, but also tasks where you need a lot of throughput and need to actually worry about the cost kind of at scale. Uh, Flashlight is a great model for a lot of those tasks. Uh, and we've seen and been having a lot of fun with customers over the last few weeks testing this model and trying to figure out what are the use cases it can, it can do really well for. So yeah, we're really excited. I think this is gonna be frankly like the best family of models we've had and maybe the world.",
      "timestamp": "04:49"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, I feel like the best part of this Tulsi is there's no more preview and at least two of our model names. Uh, so hopefully it'll make the lives of people easier as they think about which model to use. Um, and I I think that that actually takes me naturally to this question. I saw a bunch of the replies to this thread and others were just around like, how should people think about, um, what model they should be using? Especially as it feels like there's this capability shift happening, um, where like Flash has historically, Flash was this small workhorse model and I think now with reasoning capabilities on the 2.5 Flash, it's actually doing, it feels like a lot more historically, um, than it was in the past. So I'm I'm curious how you how you think about that and if others have thoughts, uh, please feel free to jump in as well.",
      "timestamp": "06:55"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "Yeah, I mean actually maybe I'll, I'll push this to Melvin. Curious your take on it since you've been thinking about kind of this full series of models and continuing to push the quality of, of Flash, uh, and continuing to make it better. I'm curious like how you think about the continuous improvements of this model and the fact that they're kind of jumping in performance every time we do one of these, you know, revs on our side.",
      "timestamp": "07:40"
    },
    {
      "speaker": "Melvin Johnson",
      "text": "Yeah, I think uh this is something we've like the 2.5 family we've had a lot of fun with it uh from the post training side. So it was a big jump from pre-training coming in and we generally like to do this sort of cycle where we first put out the pro, then we use the pro to further improve the flash and now that the flash gets really good, the gap between the flash and the pro reduces. So we want to push the pro further. So we've done that cycle. Over time, uh, we've leaned in on this Pareto frontier graph that you will see that we're publishing a lot. We really want to be at the frontier of the Pareto when it comes to cost and uh performance. And for flash, we want, since this is the workhorse model, we want this to be most optimal in terms of that trade-off. And for pro, we want to optimize for higher quality at the expense of cost and latency. So I'm quite excited at what we've landed on the frontier and, you know, if you look at the graphs, we are the frontier with the family of the family of models that we have in 2.5. Uh, and we'll continue to push, uh, push upward, uh, with, um, the next set of, uh, releases. But, you know, overall happy with these sets of models becoming stable, becoming GA. Um, and overall happy with where we landed on the graph with Flash and Pro.",
      "timestamp": "08:02"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "And one thing I'll say actually on that point too, is one thing that's kind of cool about Flash is it even with all these performance gains, it continues to stay a very fast model. Right? So it's actually like if you look at its speed compared to other competitors, its decode speed, uh, tokens per second is actually still extremely competitive. It's actually probably one of the fastest models, um, in the market. And so I think what's kind of cool about how we're thinking about Flash is we continue to want it to have, uh, competitive speed and competitive cost, but we continue to improve that performance and so hopefully just becomes an even better value proposition over time.",
      "timestamp": "09:30"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, I love that. Anka, I'm I'm curious this, um, we we now have the sort of full complete set up of, of reasoning models across the board from Pro to Flash to Flashlight. Um, I I know you wear many different hats, but I'm I'm curious to like sort of get your reaction from whichever the dimensions, whichever the hats you you're wearing that you want to give the reaction to about just like how that progress has happened, how it's helped with, you know, our ability to align the model, the actually making like higher quality models, even at small sizes. I'm curious, uh, to get your perspective.",
      "timestamp": "10:07"
    },
    {
      "speaker": "Anka",
      "text": "Yeah, a lot to unpack there, huh? Um, so one thing that I will say, maybe I'll start with Pro. I mean, Flash is very flash. It's a, it's we aptly named it. A Pro, uh, 2.5 Pro, um, has, we've been iterating on it, but, um, one of the things that that stands out is that it's amazing at code and I'll get into that from the safety perspective too a little later. It's amazing at code, it's amazing at all the benchmarks. But something that we don't quite have a public benchmark for is maybe the way it behaves. The way it actually partners with you, the way it kind of shows up as a collaborator for you as a developer or you as an end user, um, which is something, um, you know, we've worked hard on, we keep working hard on, and also to be honest, we're we're starting to get a little bit for free. We're seeing as the models become more and more capable across the board, they're kind of intelligence, their broad intelligence is becoming, um, really good too. So, I'll give you a couple anecdotes on behavior, and I can also talk about what I don't like. But, um, one of, one of the things that struck me about the latest Pro was first of all, it's very witty and humorous. So, we sometimes like to challenge the model with these like fake trolley problems, right? Cuz cuz everyone wants to kind of trick the model and figure out how it responds. And sometimes we throw these kind of ridiculous trolley problems. I threw one at it about like sort of like, oh no, I saved the toaster instead of like two cows today. Um, and I feel, you know, so it's been it was such a difficult decision. And the model, you know, mo mostly our previous models would, um, either say, okay, this is kind of a ridiculous thing, or, you know, maybe show some empathy if they kind of go along with it. But this model is so cool because it just sort of goes on this rant, this hilarious rant about how these, you know, the the toaster is really the pinnacle of the breakfast civilization, you know. And then, and then it makes all these jokes about the toaster. Uh, what did the cows bring to you? Nothing, like, and then, um, and then at the end it says, you know, go and you've had a really rough day, go and enjoy a slice of toast. I mean, it's just it's it's such a cool. I, you know, it's the first time we've kind of seen that with a Pro family, uh, with a Pro line. And then, um, uh, so it's got humor, it's got wit. But maybe more practically, it shows up and helps you strategize, it shows up and helps you business plan, it shows up and and, um, you know, um, if if you're trying to write a, I was trying to write a letter with it, um, to my department chairs at Berkeley. And it came back at me and saying, hm, this is really good start, but here's a very different strategy you could be taking to, you know, take this thing that you're presenting as an ask into an offer and show them what they have to benefit. And I was, uh, I was a little bit blown away, right? So it's just the way that it actually collaborates with you to help you along, um, I think is is new and, uh, we're working to now, you know, use that to make the Flash model better, um, in these areas as well. Um, so, yeah, model behavior, very, very, very transformative, um, in the 2.5 family.",
      "timestamp": "10:42"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, I mean quick follow up, which is like what, um, what what do you think the thing is that's driven some of this? Is this like a, is this just like some intentional sort of design decisions that we've made or is this just like sort of emergent from like a capability improvement or like, yeah, how much of it is like us intentionally wanting some sort of model behavior changes versus like you just make the model better at, you know, code for 2.5 Pro and then it like also becomes, you know, you know, I don't want to say wittier, but like it's better along some dimensions that that you've just described.",
      "timestamp": "14:23"
    },
    {
      "speaker": "Anka",
      "text": "I would say it's both, but that I was really surprised at how much we get by improving kind of core capabilities across the board and improving reasoning. Uh, you know, I've seen this in safety as well. Um, it's always really hard, uh, to kind of draw some red lines, maybe around self-harm, something like that. And then make sure that the model doesn't over infer from that, doesn't over generalize and start saying no for stuff that we actually want it to help. Um, and, um, almost for free, I think reasoning kind of helps the model navigate these nuanced situations. And so you'll see that that uh, 2.5 Pro refuses, we have this metric on over refusals, it refuses a lot, a lot less in places where it was kind of accidentally refusing before, so we've seen it there. And maybe while I'm safety, I will note that this model is so good at coding that it triggered our frontier safety early warning trigger for cyber uplift. So cyber uplift is something where where, um, um, we we're measuring to what extent the model can help cyber attacks. And so far, you know, models that we were testing were sort of like, ah, you know, they can help a little bit, but not in any way that we see like as capabilities we'll keep improving, this will become an actual serious problem. Uh, the, the code and and broad improvements have made 2.5 Pro kind of come to this level where in our frontier safety report, we sort of said, look, this is for us, it's not the critical capability level yet. Uh, but it's early warning and we're starting uh, our, um, the response plan for this.",
      "timestamp": "14:56"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, and just a quick plug for the Gemini 2.5 uh, family technical report, which I think also came out today, which sort of goes into lots of detail across, um, across different, uh, axes of the new, the new model launch. Um, so super helpful perspective Anka. Um, Zach, I think we we sort of heard the the 2.5 Pro story and and how much, um, obviously that's been exciting for folks over the last few months, but I think sort of, uh, we've also always had this sort of small model story and I'm very excited for 2.5 Flashlight, um, also like natively with reasoning capabilities. Um, I'm curious if you can just give us the quick rundown, but also maybe for folks who don't have some of the historical context on small model stuff like Flash 8B and others, if you can sort of give that historical perspective as well.",
      "timestamp": "16:39"
    },
    {
      "speaker": "Zach",
      "text": "Yeah, of course. Uh, I'll start a little bit with the history. Like when we when we kicked off Gemini, uh, we really started with the Pro model. Um, and then when it came to, uh, the 1.5 series, um, we released Flash and we also had this Flash 8B model. Um, because we were seeing a lot of customers who were asking for like lower cost, lower latency models, and we were, uh, really trying to meet that demand. Um, and then as we came out with the like next series, um, you know, 2.0, uh, we came out with the Flashlight model and made that a part of like the full family and, uh, we're continuing to iterate on that and now we are offering this 2.5 Flashlight model. I think we've gotten to a really good place where now I think we have a much better understanding in terms of like cost, latency, quality trade-offs where we're meeting customers, uh, in terms of what they're asking for. Um, and making sure that we could support like the best model for their use cases. So for example, for Flashlight, and these cheaper models, we're seeing a lot of people who really love using this for Rag use cases, um, people, customers using it for classification, like content moderation, translation, so very like high volume use cases, um, that people are doing offline, but we also saw people asking for more latency improvements. Um, I think like some of our learnings from like the Gemini, uh, diffusion model, uh, people were really excited about the low latency there. So we wanted to continue to push for that, uh, in the Flashlight model. So now we have 2.5 Light, which is not only our, uh, most cost-effective model, um, but it is our lowest latency model. Um, so we see significant latency gains in terms of what was available in 2.0 Flashlight and also 2.0 Flash, um, that we're really excited about. And then beyond just, uh, latency improvements, as I mentioned, the quality is significantly better than the 2.0 Flashlight model. Um, and we see some new capabilities. So the model's now, um, has the thinking capabilities, uh, and it also has, uh, like the ability to use tools, um, which was missing, uh, like, uh, um, code execution and search. Uh, so we're we're glad that this model is like a more complete, uh, uh, part of the family and, uh, we're excited to see how people use it.",
      "timestamp": "17:25"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, I feel like this like underlying we need a t-shirt that says this, I think in t-shirts these days, which is unfortunate. Um, if anyone who who hears me talk about this too often.",
      "timestamp": "19:53"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "Well, Logan, I need more t-shirts than. I feel like Well, I know. Well, this is this is on the the we've got people. So hopefully they'll Harrison and others can help us do this. But, um, yeah, I I think the subtext is like the 2.0 Flashlight is the best price per intelligence of any model. And I think the reasoning capability in there sort of pushes that frontier even further. Um, so it's been awesome to see that, you know, hopefully continuing to build on the success of, of, you know, Flash 8B and others, which I think Flash 8B for a while was like one of the most used models on on it was like was the most used model on Open Router. So I'm I'm excited for people to see this model, um, and at 10 cents input and 40 cents output. I feel like it's a, it's a great deal.",
      "timestamp": "20:04"
    },
    {
      "speaker": "Zach",
      "text": "Yeah, no, totally. I think like, you know, in the original days it was all about pushing for quality and, uh, that pro was the high high priority model and we just saw more and more demand for these like low cost, low latency models. So we're we're really happy that we've gotten to this point and I think looking at the competition and competitors, this is like such a like a good price value for the level of quality that you're getting.",
      "timestamp": "20:48"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, I'm super excited. Hopefully we'll, we'll see lots of usage. Um, Tulsi, I've got a bunch of hard questions for you, uh, and a bunch of these are around developer feedback, um, that we've gotten. I think one of them in and you and I have obviously spent, uh, nights and weekends talking about this, but we obviously we shipped a lot of preview models. Um, and we sort of finally taken the the last step of this in, um, releasing sort of the stable version of the model. Can you sort of walk us through that just for folks who don't have context on what that iteration process looks like and and why we actually do this? Cuz again, you and I have talked about it, there is a cost to doing this. There's a little bit of confusion from developers sometimes, but, um, I think in the most optimistic sense, like, it gets us to the best models, um, is my take, but I'm I'm curious what your, yeah, what you're happy about.",
      "timestamp": "21:12"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "Yeah, I mean that's what we what we like to think. I think so, you know, what just for context for folks, you know, we launched the original 2.5 preview now three months ago, two and a half months ago. Um, and we wanted to release these previews because we want to actually work with developers to get feedback. And I think one thing we were feeling is that we have our own internal evals and we have our own internal ways of measuring and hill climbing the models, which are great. Um, and we do leverage those to make sure that we kind of can tap into key capabilities of the model. But ultimately, where you really see the magic of the model is when you see real people using the model to do amazing things. Um, and that's when you actually start to find where the model shines, maybe in ways that we didn't even realize in our in our benchmarks. This goes to Anka's toaster example earlier, you know, sometimes you see sparks when you actually start trying new things with the model that weren't possible before that maybe weren't even captured in your evals in the way that you best wanted them to. And so when we put these models out in preview, it gives us a chance to have developers across the world, across IDEs, across surfaces, try the model, and we find both things that are working really, really well. Some that we expected, some that are surprising. We also find bugs and issues that maybe we didn't originally think through. And that actually then allows us to iterate and continue to improve the model, right? So, for example, something that has gotten better preview to preview is the model's use of tools. And that's because we've kind of seen and learned from how the model is using tools in code IDEs and surfaces and applications and said, okay, wait, how do we actually make this better. And we're still learning from that feedback and still iterating on it. Um, I think another example is as Anka talked through like behavior, but also things like creative writing. We've seen kind of our ability to improve model to model. Code performance has improved model to model because of the feedback. Um, and so that actually preview process allows us to get to a model that we're really excited to put in your hands for kind of a a longer period of time that we feel more confident about. Um, and so I think we're still figuring out what the right process is here. You know, I think we learn also along the way that there are things we really need to make sure we're careful about to not hurt the developer experience, right? So, um, we I think are trying to be more cautious about making sure that we don't just hot swap the model, but that we keep models around and give a sufficient notice so that we actually are not kind of pulling out a rug from underneath, you know, someone who's building something amazing. Um, so we're trying to find the right balance between putting out previews that can allow us to get feedback and iterate with all of you and make it an even better model, um, while also making sure that we provide stability and and reliability in the experience and that's something we're going to continue to work on.",
      "timestamp": "21:59"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, 100%. I I think my take is we're getting closer to the the right balance place, um, and yeah, I think lots of hard work by the team and and you all to make that happen over the last three months. So, uh, hopefully all of the previews were worth it and folks are very excited about sort of the stable version of the model. Um, which takes me to one sort of last big question and then we'll we'll go to a bunch of specific questions that folks asked, uh, in the replies and comments or threads wherever they are. Um, just around like what comes next. Like we we obviously we've seen a bunch of um, a bunch of usage of these models already and folks scaling up, but, um, Anka and Melvin, um, we've also like, uh, part of the challenge is we continue to get feedback about things that folks want the models to be better at. Um, so I'm curious if there's things that we've gotten feedback on that are top of mind for you all, um, and also I'll I'll sort of, I'll give the direct quote just so that folks don't, uh, don't sort of take it the wrong way. But at the bottom of the 2.5, uh, blog post, it says, we can't wait to see even more domains benefit from the intelligence of 2.5 Pro and look forward to sharing more about scaling beyond Pro in the near future. Um, so a little bit of, uh, a little bit of excitement. Those were Tulsi's words for the record. Um, just I'm just kidding. Um, so but but Melvin and Anka, um, I'm curious, uh, what what your perspective is on on where we go next from here and what else we're trying to hill climb on.",
      "timestamp": "24:34"
    },
    {
      "speaker": "Melvin Johnson",
      "text": "Yeah, I mean, I think uh a lot of the evals that we've been tracking actually have uh saturated and that's a good thing. So we're always looking for harder and harder evals and as Tulsi was saying, we're leaning in more into real world use cases, right? Especially with the pro for sweet coding and uh tool use in real environments like cursor and Replit and Klein and so on. So we'll learn a lot from that kind of usage and there actually, uh, it's about, it's not just about performance, it's about reliability. Like you want to reliably 100% of the time do the thing consistently, you don't want to change behavior overnight. You don't want to start making large edits to a code base. Uh, you want to think about, uh, maintenance, you want to think from the user's perspective of, you know, how do you want to help them. So some of these elements we want to bake into the models. And as to, you know, what's what's next for us, like the team is constantly iterating, like we never stop. Uh, I think we're excited about what can we do even beyond pro and how we can push, uh, push further. So, you know, stay tuned, uh, for something more to come from us and, uh, we'll do another preview as, you know, uh, yes, Anka.",
      "timestamp": "25:59"
    },
    {
      "speaker": "Anka",
      "text": "Yeah, I can say, can I say, uh, one cool aspirational thing, which is even beyond reliable code and tools, um, and use cases like cursor, which we have to improve on, I'll say aspirationally, more and more leaning into agentic behavior. So, I think that's that's a big one. And then I'll say something really, can I say something really boring? Uh, Melvin and I are really keen to improve our system instruction following ability. We want you all to be able to put in a whole like product spec or whatever you want and then have the model beautifully follow according to that and not miss any detail and not have any kind of collateral damage. So, this is the very simple, boring thing that we think Gemini should do even better on, um, and, uh, we're going to be, we're going to be cranking on that.",
      "timestamp": "27:22"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "I feel like as a PM, that's the dream. You know, like put in the spec, model magically achieves, it's amazing.",
      "timestamp": "28:18"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "I love that. Um, those were both great answers. I've got a bunch of rapid fire questions now for I'll try to direct it, but if someone else wants to jump in and take it, uh, we'll do it. Lots of people asking, uh, maybe Tulsi for you on this one. When are we getting DeepThink? We we sort of previewed some of the research results. Um, I'm not sure that we're ready to commit to a date today, but obviously lots of excitement about, uh, continuing to scale the the reasoning access, uh, axes.",
      "timestamp": "28:27"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "Yeah, no, we're we're excited about it. I think, um, right now, DeepThink, we actually just, um, did an iteration to it in the last week. The team has been working really hard and actually has, um, even better numbers than than we originally had. And so we're actually doing another round of trusted testing right now. So we kicked that off last week, put the model in the hands of a set of trusted testers. And what's been really fun is actually to see some of the feedback. So for example, uh, we had someone who is an expert in math, come back and say, hey, wow, this can do a problem that I actually haven't been able to see a model do before. Um, and it's that kind of feedback we're trying to get to really understand where is the model standing out, where is it differentiated? Um, we're also working with safety testers. So that's currently, uh, the process we're in right now is going through that trusted testing and then we'll figure out the release plan up to that.",
      "timestamp": "28:52"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Awesome. Um, another question was someone was asking, so is this the final update for the Gemini 2.5 Pro, uh, model family?",
      "timestamp": "29:41"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "What does final mean? I know.",
      "timestamp": "29:53"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Last 2.5 is this the final final version?",
      "timestamp": "29:55"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "I think what I can say for sure is that this is the like stable version of the 2.5 model. What I mean by that is like this model, we plan to have for for a year. We will, you know, have very clear sort of, uh, stable support for it. We will have a very clear deprecation plan for it, you know, in 2026. Uh, so this is a model, this you can consider this the Gemini 2.5 Pro model. Um, I think what you can also expect though is that we're going to continue to drive improvements, right? So as you identify things with this 2.5, uh, model, as you want to, uh, try new use cases, as you find feedback, please continue to share that with us because we're going to just keep iterating from a pre-training and post-training perspective to make the next models we release even better and and that's a process we want to continue to do.",
      "timestamp": "29:58"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Awesome. Um, And DeepThink, we'll add, you know, DeepThink has been kind of previewed, but, um, we're iterating on that and and trying to broaden it. Um, so I feel like that's another place where Pro will evolve. Yeah, I love it. Um, another comment from someone was, love the Flashlight model, are we getting it in the Gemini app? Um, and Melvin, uh, you and Zach, we were sort of talking off camera about like what the or off space, uh, off audio, um, about what some of the, the use cases for that might look like and why people would be excited about that. None of us are on the Gemini app team, so we'll have to go pester Josh after this. Um, but like your, your take on why this would be useful for folks, um, in the Gemini app.",
      "timestamp": "30:45"
    },
    {
      "speaker": "Zach",
      "text": "Yeah, so, um, happy to take this one. In terms of, uh, like one of the key value props, uh, we really pushed for a lower latency model, um, for Flashlight. Um, so we're excited about that and I think that could be valuable for users in the Gemini app if they're looking for lower latency like vibe coding, um, that's something that we've heard a lot of people excited about, um, because just when they're when, you know, now just thinking is the new compiling and having to wait for the model. Uh, so I think like one of the things like especially as we put that early preview or we put out the, uh, the demo for Gemini Diffusion, we're just like really trying to understand and where people want low latency, um, and what will that be most useful for. Um, so we're excited that that helped influence the decision to make Flashlight lower latency. We want to keep pushing that and we want to collect more feedback and understand like what is that sweet spot? How, you know, what does if if we can continue to push the frontier in terms of how fast these models can generate, uh, what new use cases, um, will that unlock. So I think definitely vibe coding is one of those areas. I've heard some feedback, um, and we're excited to hear more.",
      "timestamp": "31:31"
    },
    {
      "speaker": "Melvin Johnson",
      "text": "I mean, I think the other thing is from the usage, we want to understand for these different models, Flashlight, Flash and Pro, what the dynamic thinking budget needs to be, like how how much it needs to think, like are people latency sensitive for certain kinds of use cases versus not. Uh, and, uh, you know, we just don't know. Uh, we're actually at the frontier here because we're trying to build both the reasoning and the chat and the API model into one. Uh, so we really want to figure out, you know, how to do it right instead of shipping like six different models.",
      "timestamp": "32:46"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah. No, 100%. Um, we've got two minutes left and I'll, I'll take two more questions really quickly. One of them was, um, why did 2.5 Flash increase its price by 100% input and 400% output? Seems extreme. Um, just for folks who who don't have context on the API pricing, um, 2.5 Flash previously had a different input price and, uh, it had the same input price but different output prices, uh, depending on whether it was using thinking versus not thinking. Um, and we got a bunch of feedback on this. So the sort of GA version of the model, we consolidated to a single price point. Um, the input price went up a little bit, um, but the output price, uh, relative to the reasoning version, um, actually went down slightly. Um, so we think of 2.5 Flash and anyone feel free to jump in and add extra context as like a super performant reasoning model. Like the main use cases for 2.5 Flash are the reasoning use cases. Um, you can turn reasoning off if you want to, but like the model was really built to be, um, a reasoning model from the ground up. Uh, that takes me to the question of like, um, someone was asking, we were using 2.5 Flash non-thinking without the reasoning capability. Um, is it safe to say that we can migrate to 2.5 Flashlight, uh, with no performance differences? So we did, um, I'm curious Zach, Melvin, Tulsi, Anka, if someone wants to jump in here, but we did release a a bunch of like graphics which show some of the the model metrics. Um, I think this depends like very much on the use case, like what, what the difference will look like, but I don't know if anyone sort of broad strokes wants to talk about, um, the differences. I'll make a quick note, which is one of the things I'm happy about, which is, if you were using, um, the non-thinking version of 2.5 Flash, if you migrate to Flashlight, the like base prices are actually less expensive. Um, so hopefully for some use cases it actually becomes more cost efficient, um, and, you know, relatively in the same ballpark for performance, but I don't know if who wants to jump in here and add more context.",
      "timestamp": "33:21"
    },
    {
      "speaker": "Zach",
      "text": "Yeah, I can, I can, uh, talk a little bit about here. Um, yeah, so we're, we're, you know, want to make sure that migration path is as smooth as possible. Um, at least for the 2.5 Flash, uh, Flashlight model. Um, the model is, uh, you know, definitely significantly higher quality, um, than 2.0, uh, Flashlight. Um, we also with thinking turned on, it is higher quality than the 2.0 Flash. Um, without the thinking model, without the thinking turned on, I think, as you were mentioning, Logan, it's pretty use case specific where I think for some of the easier, uh, simpler use cases, you're definitely going to see the model, uh, on par with quality. And in other cases, uh, it might not be on par. I think the model is in preview, so this is, you know, key area that we're trying to learn more so that the migration can be as smooth as possible, um, and make sure that we heal, uh, continue to hill climb on the most important use cases that people are prioritizing for the Flashlight model.",
      "timestamp": "35:20"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, 100%. I love that. Tulsi, you wanted to add something else?",
      "timestamp": "36:22"
    },
    {
      "speaker": "Tulsi Doshi",
      "text": "I was just going to say, I think, um, this is also going back to the kind of what is our launch approach and what are we trying to do here. I think we really want to make sure that Flashlight is a model that ideally developers can rely on as a as a solid migration path from from some of the 2.4 models. And so, um, you know, this is a call to action to all of you who are listening, like as you're trying Flashlight and you're finding areas where it's working really well, share those. If you're finding areas where it's not, uh, and you're seeing deltas in performance, that's also really helpful for us. Um, to something Melvin said earlier, you know, we're continuing to push the performance of these models. We're going to do that for Flashlight too. Uh, and we want to make sure that we, you know, know where we should really be investing, where it's most valuable for all of you.",
      "timestamp": "36:27"
    },
    {
      "speaker": "Logan Kilpatrick",
      "text": "Yeah, I love that. Awesome. Well, this was, um, this was wonderful. I feel like it's always a pleasure to get you all together and and have these conversations. Um, I'll echo Tulsi's comment, which is I think part of what makes the model improvement flywheel spin is feedback from developers. So if you have stuff, we're all here, uh, please ping any of us. Uh, we'll continue to make, uh, make progress on on hill climbing. Um, thank you for all the questions. Thank you for all of the speakers, Tulsi, Melvin, Zach and Anka for taking the time to chat and excited to hopefully do more of these, uh, sometime soon.",
      "timestamp": "37:09"
    },
    {
      "speaker": "Speaker",
      "text": "Thanks. Bye bye.",
      "timestamp": "37:47"
    },
    {
      "speaker": "Speaker",
      "text": "Bye folks. See ya. Take care.",
      "timestamp": "37:48"
    },
    {
      "speaker": "Speaker",
      "text": "Thanks everybody.",
      "timestamp": "37:50"
    }
  ]
}

Token usage

74,073 input, 10,477 output, {"candidatesTokenCount": 10305, "promptTokensDetails": [{"modality": "TEXT", "tokenCount": 5}, {"modality": "AUDIO", "tokenCount": 74068}], "thoughtsTokenCount": 172}

@simonw
Copy link
Author

simonw commented Jun 17, 2025

Audio input tokens are priced differently, at $1/million compared to $0.30/million for text. I'll ignore the 5 text input tokens.

https://www.llm-prices.com/#it=74068&ot=10477&ic=1&oc=2.50 = 10.026 cents.

@simonw
Copy link
Author

simonw commented Jun 17, 2025

72603ms = 72.6 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment