预测输出
Predicted Outputs(预测输出)使您能够在提前知道许多输出令牌时加快 Chat Completions 的 API 响应速度。当您重新生成稍作修改的文本或代码文件时,这种情况最为常见。您可以使用prediction
Chat Completions 中的 request 参数.
Predicted Outputs 现在使用最新的gpt-4o
和gpt-4o-mini
模型。请继续阅读,了解如何使用 Predicted Outputs 来减少应用程序中的延迟。
代码重构示例
Predicted Outputs 对于重新生成文本文档和进行少量修改的代码文件特别有用。假设您希望 GPT-4o 模型重构一段 TypeScript 代码,并将username
属性的User
要成为的类email
相反:
1
2
3
4
5
6
7
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
除了上面的第 4 行外,大多数文件将保持不变。如果您使用代码文件的当前文本作为预测,则可以以较低的延迟重新生成整个文件。对于较大的文件,这些时间节省很快就会累积起来。
下面是使用prediction
参数来预测模型的最终输出将与我们用作预测文本的原始代码文件非常相似。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import OpenAI from "openai";
const code = `
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
`.trim();
const openai = new OpenAI();
const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
`;
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "user",
content: refactorPrompt
},
{
role: "user",
content: code
}
],
prediction: {
type: "content",
content: code
}
});
// Inspect returned data
console.log(completion);
console.log(completion.choices[0].message.content);
除了重构的代码之外,模型响应还将包含如下所示的数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
id: 'chatcmpl-xxx',
object: 'chat.completion',
created: 1730918466,
model: 'gpt-4o-2024-08-06',
choices: [ /* ...actual text response here... */],
usage: {
prompt_tokens: 81,
completion_tokens: 39,
total_tokens: 120,
prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
accepted_prediction_tokens: 18,
rejected_prediction_tokens: 10
}
},
system_fingerprint: 'fp_159d8341cc'
}
请注意,这两者的accepted_prediction_tokens
和rejected_prediction_tokens
在usage
对象。在此示例中,使用了预测中的 18 个标记来加快响应速度,而 10 个标记被拒绝。
流式处理示例
当您对 API 响应使用流式处理时,预测输出的延迟增加甚至更大。以下是相同代码重构用例的示例,但在 OpenAI SDK 中使用流式处理。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import OpenAI from "openai";
const code = `
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
`.trim();
const openai = new OpenAI();
const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
`;
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "user",
content: refactorPrompt
},
{
role: "user",
content: code
}
],
prediction: {
type: "content",
content: code
},
stream: true
});
// Inspect returned data
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
响应中预测文本的位置
提供预测文本时,您的预测可以显示在生成的响应中的任何位置,并且仍能为响应提供延迟减少。假设您的预测文本是如下所示的简单 Hono 服务器:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";
const app = new Hono();
app.get("/api", (c) => {
return c.text("Hello Hono!");
});
// You will need to build the client code first `pnpm run ui:build`
app.use(
"/*",
serveStatic({
rewriteRequestPath: (path) => `./dist${path}`,
})
);
const port = 3000;
console.log(`Server is running on port ${port}`);
serve({
fetch: app.fetch,
port,
});
您可以使用如下提示提示提示模型重新生成文件:
1
2
3
4
Add a get route to this application that responds with
the text "hello world". Generate the entire application
file again with this route added, and with no other
markdown formatting.
对提示的响应可能如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";
const app = new Hono();
app.get("/api", (c) => {
return c.text("Hello Hono!");
});
app.get("/hello", (c) => {
return c.text("hello world");
});
// You will need to build the client code first `pnpm run ui:build`
app.use(
"/*",
serveStatic({
rewriteRequestPath: (path) => `./dist${path}`,
})
);
const port = 3000;
console.log(`Server is running on port ${port}`);
serve({
fetch: app.fetch,
port,
});
您仍会在响应中看到 accepted prediction tokens,即使预测文本在添加到响应的新内容之前和之后都出现:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
id: 'chatcmpl-xxx',
object: 'chat.completion',
created: 1731014771,
model: 'gpt-4o-2024-08-06',
choices: [ /* completion here... */],
usage: {
prompt_tokens: 203,
completion_tokens: 159,
total_tokens: 362,
prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
accepted_prediction_tokens: 60,
rejected_prediction_tokens: 0
}
},
system_fingerprint: 'fp_9ee9e968ea'
}
这一次,没有被拒绝的预测令牌,因为我们预测的文件的全部内容都用于最终响应。好!🔥
局限性
使用 Predicted Outputs(预测输出)时,应考虑以下因素和限制。
- 仅 GPT-4o 和 GPT-4o-mini 系列模型支持预测输出。
- 在提供预测时,提供的任何不属于最终完成情况的令牌仍按完成令牌费率收费。请参阅
rejected_prediction_tokens
属性的usage
对象以查看最终响应中未使用的令牌数。 - 使用预测输出时,不支持以下 API 参数:
n
:不支持大于 1 的值logprobs
:不支持presence_penalty
:不支持大于 0 的值frequency_penalty
:不支持大于 0 的值audio
:预测输出与音频输入和输出不兼容modalities
:只text
支持模式max_completion_tokens
:不支持tools
:预测输出当前不支持函数调用