预测输出

减少模型响应的延迟,其中大部分响应都是提前知道的。

Predicted Outputs(预测输出)使您能够在提前知道许多输出令牌时加快 Chat Completions 的 API 响应速度。当您重新生成稍作修改的文本或代码文件时,这种情况最为常见。您可以使用predictionChat Completions 中的 request 参数.source

Predicted Outputs 现在使用最新的gpt-4ogpt-4o-mini模型。请继续阅读,了解如何使用 Predicted Outputs 来减少应用程序中的延迟。source

代码重构示例

Predicted Outputs 对于重新生成文本文档和进行少量修改的代码文件特别有用。假设您希望 GPT-4o 模型重构一段 TypeScript 代码,并将username属性的User要成为的类email相反:source

1
2
3
4
5
6
7
class User {
  firstName: string = "";
  lastName: string = "";
  username: string = "";
}

export default User;

除了上面的第 4 行外,大多数文件将保持不变。如果您使用代码文件的当前文本作为预测,则可以以较低的延迟重新生成整个文件。对于较大的文件,这些时间节省很快就会累积起来。source

下面是使用prediction参数来预测模型的最终输出将与我们用作预测文本的原始代码文件非常相似。source

使用 Predicted Output 重构 TypeScript 类
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import OpenAI from "openai";

const code = `
class User {
  firstName: string = "";
  lastName: string = "";
  username: string = "";
}

export default User;
`.trim();

const openai = new OpenAI();

const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only 
with code, and with no markdown formatting.
`;

const completion = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "user",
      content: refactorPrompt
    },
    {
      role: "user",
      content: code
    }
  ],
  prediction: {
    type: "content",
    content: code
  }
});

// Inspect returned data
console.log(completion);
console.log(completion.choices[0].message.content);

除了重构的代码之外,模型响应还将包含如下所示的数据:source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  id: 'chatcmpl-xxx',
  object: 'chat.completion',
  created: 1730918466,
  model: 'gpt-4o-2024-08-06',
  choices: [ /* ...actual text response here... */],
  usage: {
    prompt_tokens: 81,
    completion_tokens: 39,
    total_tokens: 120,
    prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
    completion_tokens_details: {
      reasoning_tokens: 0,
      audio_tokens: 0,
      accepted_prediction_tokens: 18,
      rejected_prediction_tokens: 10
    }
  },
  system_fingerprint: 'fp_159d8341cc'
}

请注意,这两者的accepted_prediction_tokensrejected_prediction_tokensusage对象。在此示例中,使用了预测中的 18 个标记来加快响应速度,而 10 个标记被拒绝。source

请注意,任何被拒绝的令牌仍会像 API 生成的其他完成令牌一样计费,因此 Predicted Outputs 可能会为您的请求带来更高的成本。source

流式处理示例

当您对 API 响应使用流式处理时,预测输出的延迟增加甚至更大。以下是相同代码重构用例的示例,但在 OpenAI SDK 中使用流式处理。source

带流的预测输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import OpenAI from "openai";

const code = `
class User {
  firstName: string = "";
  lastName: string = "";
  username: string = "";
}

export default User;
`.trim();

const openai = new OpenAI();

const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only 
with code, and with no markdown formatting.
`;

const completion = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "user",
      content: refactorPrompt
    },
    {
      role: "user",
      content: code
    }
  ],
  prediction: {
    type: "content",
    content: code
  },
  stream: true
});

// Inspect returned data
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

响应中预测文本的位置

提供预测文本时,您的预测可以显示在生成的响应中的任何位置,并且仍能为响应提供延迟减少。假设您的预测文本是如下所示的简单 Hono 服务器:source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";

const app = new Hono();

app.get("/api", (c) => {
  return c.text("Hello Hono!");
});

// You will need to build the client code first `pnpm run ui:build`
app.use(
  "/*",
  serveStatic({
    rewriteRequestPath: (path) => `./dist${path}`,
  })
);

const port = 3000;
console.log(`Server is running on port ${port}`);

serve({
  fetch: app.fetch,
  port,
});

您可以使用如下提示提示提示模型重新生成文件:source

1
2
3
4
Add a get route to this application that responds with 
the text "hello world". Generate the entire application 
file again with this route added, and with no other 
markdown formatting.

对提示的响应可能如下所示:source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";

const app = new Hono();

app.get("/api", (c) => {
  return c.text("Hello Hono!");
});

app.get("/hello", (c) => {
  return c.text("hello world");
});

// You will need to build the client code first `pnpm run ui:build`
app.use(
  "/*",
  serveStatic({
    rewriteRequestPath: (path) => `./dist${path}`,
  })
);

const port = 3000;
console.log(`Server is running on port ${port}`);

serve({
  fetch: app.fetch,
  port,
});

您仍会在响应中看到 accepted prediction tokens,即使预测文本在添加到响应的新内容之前和之后都出现:source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  id: 'chatcmpl-xxx',
  object: 'chat.completion',
  created: 1731014771,
  model: 'gpt-4o-2024-08-06',
  choices: [ /* completion here... */],
  usage: {
    prompt_tokens: 203,
    completion_tokens: 159,
    total_tokens: 362,
    prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
    completion_tokens_details: {
      reasoning_tokens: 0,
      audio_tokens: 0,
      accepted_prediction_tokens: 60,
      rejected_prediction_tokens: 0
    }
  },
  system_fingerprint: 'fp_9ee9e968ea'
}

这一次,没有被拒绝的预测令牌,因为我们预测的文件的全部内容都用于最终响应。好!🔥source

局限性

使用 Predicted Outputs(预测输出)时,应考虑以下因素和限制。source

  • 仅 GPT-4o 和 GPT-4o-mini 系列模型支持预测输出。
  • 在提供预测时,提供的任何不属于最终完成情况的令牌仍按完成令牌费率收费。请参阅rejected_prediction_tokens属性的usage对象以查看最终响应中未使用的令牌数。
  • 使用预测输出时,不支持以下 API 参数
    • n:不支持大于 1 的值
    • logprobs:不支持
    • presence_penalty:不支持大于 0 的值
    • frequency_penalty:不支持大于 0 的值
    • audio:预测输出与音频输入和输出不兼容
    • modalities:只text支持模式
    • max_completion_tokens:不支持
    • tools:预测输出当前不支持函数调用