GPT4泄露的技术细节

本文最后更新于:2023年7月11日 上午

From Twitter of Yam Peleg.

GPT-4's details are leaked.

It is over. Everything is here:

Parameters count

GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers.

GPT-4的规模是GPT-3的10倍以上。我们认为它拥有约1,800亿个参数,分布在120层。

Mixture Of Experts - Confirmed

OpenAl was able to keep costs reasonable by utilizing a mixture of experts(MoE) model.

They utilizes 16 experts within their model, each is about ~111B parameters for MLP 2 of these experts are routed to per forward pass.

OpenAI通过使用专家混合(MoE)模型,能够将成本控制在合理的范围内。

他们的模型利用了16个专家,每个专家大约拥有1110亿个参数,其中每次前向传递(forward pass)使用其中的2个专家。

MoE Routing

While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAl's is allegedly quite simple, for the current GPT-4 model.

There roughly ~55B shared parameters for attention.

尽管文献中谈到了很多关于选择将每个标记(token)路由到哪些专家的先进路由算法,但据称OpenAI在当前的GPT-4模型中采用的路由方法相当简单。

在注意力机制中,大约有550亿个共享参数。

Inference

Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.

每次前向传递的推理(生成一个标记)只使用了大约280亿个参数和560 TFLOPS的计算量。这与纯密集模型每次前向传递所需的约1,800亿个参数和3,700 TFLOPS形成了对比。

Dataset

GPT-4 is trained on ~13T tokens. These are not unique tokens, they count the epochs as more tokens as well.

Epoch number: 2 epochs for text-based data and 4 for code-based data.

There is millions of rows of instruction fine-tuning data from ScaleAl & internally.

GPT-4的训练使用了大约130万亿个标记(tokens)。这些标记不是唯一的标记,还计算了多个时期(epochs)的标记数量。

对于基于文本的数据,使用了2个时期,对于基于代码的数据,使用了4个时期。

此外,还有来自ScaleAl和内部的数百万行指令微调数据。

GPT-4 32K

There was an 8k context length (seq len) for the pre-training phase. The 32k seq len version of GPT-4 is based on fine-tuning of the 8k after the pre-training.

在预训练阶段,GPT-4使用了8,000个上下文长度(序列长度)。而32,000个序列长度的GPT-4版本是在预训练后对8,000个序列长度进行微调得到的。

Batch Size

The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAl was using a batch size of 60 million! This, of course, is “only” a batch size of 75 million tokens per expert due to not every expert seeing all tokens.

在集群上的几天时间里,批量大小逐渐增加,但最终OpenAI使用了6000万的批量大小!当然,由于并非每个专家都能看到所有的标记,因此每个专家实际上只处理了7500万个标记的批量大小。

For the real batch size

Divide this number by the seq len to get the real batch size. Just stop with this misleading numbers already.

将这个数字除以序列长度以获取实际的批量大小。请不要再提供这种误导性的数字了。

Parallelism Strategies

To parallelize across all their A100s GPUs

They utilized 8-way tensor parallelism as that is the limit for NVLink. Beyond that, they are using 15-way pipeline parallelism.

(likely used ZeRo Stage 1.lt is possiblethey used block-level FSDP)

为了在所有A100 GPU上实现并行计算,他们采用了8路张量并行(tensor parallelism),因为这是NVLink的限制。此外,他们还使用了15路管道并行(pipeline parallelism)。

很可能他们使用了ZeRo阶段1,同时也可能使用了块级FSDP(Fully Sharded Data Parallelism)。

Training Cost

OpenAl's training FLOPS for GPT-4 is ~2.15e25,

on~25,000 A100s for 90 to 100 days at about 32% to 36% MFU.

Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.

If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million

(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)

OpenAI的GPT-4训练的浮点运算速度(FLOPS)约为2.15e25次。

在大约25,000个A100 GPU上进行训练,持续时间为90至100天,利用率约为32%至36%。

这种极低的利用率部分是由于大量的故障导致需要重新启动检查点。

如果他们在云端的成本约为每个A100每小时1美元,单单这次训练的成本将约为6300万美元。

(现在,使用大约8,192个H100 GPU进行预训练,需要大约55天时间,成本为2150万美元,每个H100每小时2美元。)

Mixture of Expert Tradeoffs

There are multiple MoE tradeoffs taken:

For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.

This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.

Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that's purely research.

There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with.

With such a large training run, OpenAI Instead chose to be more conservative onthe number of experts.

有多个专家混合(MoE)的权衡被考虑:

例如,在推理过程中,处理MoE非常困难,因为并非每个模型的部分在每个标记生成过程中都被利用。

这意味着某些部分可能处于闲置状态,而其他部分正在被使用。在为用户提供服务时,这会极大地降低利用率。

研究人员已经证明,使用64到128个专家比使用16个专家可以获得更好的损失结果,但这仅仅是研究结果。

选择较少的专家有多个原因。OpenAI选择使用16个专家之一的原因是因为更多的专家在许多任务上很难进行泛化。更多的专家也可能更难实现收敛。

在如此庞大的训练过程中,OpenAI选择在专家数量上更加保守。

GPT-4 Inference Cost

GPT-4 costs 3x that of the 175B parameter Davinchi.

This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved.

AN estimate of it's costs is $0.0049 cents per 1k tokens for 128 A100s to inference GPT-4 8k seq len and $0.0021cents per 1k tokens for 128 H100's to inference GPT-4 8k seq len. It should be noted, we assume decent high utilization,and keeping batch sizes high.

GPT-4的成本是1750亿参数的Davinci模型的3倍。

这主要是由于GPT-4需要更大规模的集群,并且实现了更低的利用率。

据估计,使用128个A100 GPU进行GPT-4 8,000个序列长度的推理,每1,000个标记的成本约为0.0049美分;使用128个H100 GPU进行GPT-4 8,000个序列长度的推理,每1,000个标记的成本约为0.0021美分。需要注意的是,我们假设了良好的高利用率,并保持了较高的批量大小。

Multi-Query Attention

OpenAl are using MQA just like everybody else.

Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32k seq len GPT-4 definitely cannot run on 40GB A10Os, and the 8k is capped onmax bsz.

OpenAI也像其他人一样使用了MQA(Multi-QKV Attention)。

由于使用了MQA,只需要一个注意力头(head),并且可以显著降低KV缓存的内存需求。即便如此,32,000个序列长度的GPT-4肯定无法在40GB的A10O上运行,而8,000个序列长度则有最大批量大小的限制。

Continuous batching

OpenAl implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs.

OpenAI实现了可变批量大小和连续批处理。这样做可以在一定程度上允许最大延迟,并优化推理成本。

Vision Multi-Modal

It is a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8T of GPT-4. lt is fine-tuned with another ~2 trillion tokens, after the text only pre-training.

On the vision model, OpenAl wanted to train it from scratch, but it wasn't mature enough, so they wanted to derisk it by starting with text.

One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what's in images and video.

Some of the data they train on is joint data (rendered LaTeX/text), screen shots of web page, youtube videos: samplingframes, and run Whisper around it to get transcript.

GPT-4引入了一个与文本编码器分离的独立视觉编码器,并具有交叉注意力机制。其架构类似于Flamingo模型。这使得GPT-4的参数量在1,800亿的基础上增加了更多。在仅进行文本预训练之后,还对该视觉模型进行了约2万亿个额外的微调标记。

在视觉模型方面,OpenAI原本希望从头开始训练,但该模型尚未足够成熟,所以他们选择通过从文本开始来减轻风险。

该视觉功能的主要目的之一是使自主代理能够阅读网页并转录图像和视频中的内容。

他们训练的一部分数据是联合数据,包括渲染的LaTeX/文本、网页截图、YouTube视频(采样帧),并使用Whisper技术对其进行转录。

Speculative Decoding

OpenAl might be using speculative decoding on GPT-4's inference. (not sure100%)

The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch.

lf the small model was right about its predictions-the larger model agrees and we can decode several tokens in a single batch.

But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And wecontinue with the larger model.

The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.

OpenAI可能正在使用GPT-4推理中的推测解码(speculative decoding),但无法确定。

这种方法是使用一个更小更快的模型提前解码多个标记,然后将它们作为一个批次输入到大型的预测模型中。

如果小型模型的预测正确,大型模型会同意并可以一次性解码多个标记。

但是,如果大型模型拒绝了草稿模型预测的标记,那么批次中的剩余部分将被丢弃,并继续使用大型模型。

有一种阴谋论认为,GPT-4的质量下降是因为他们让预测模型接受了由推测解码模型生成的低概率序列。

请注意,以上内容仅为推测,实际情况可能与之有所不同。

Inference Architecture

The inference runs on a cluster of 128 GPUs.

There are multiple of these clusters in multiple datacenters in different locations.

It is done in 8-way tensor parallelism and 16-way pipeline parallelism.

Each node of 8 GPUs has only ~130B parameters, or less than 30GB per GPU at FP16 and less than 15GB at FP8/int8.

The model has 120, so it fits in 15 different nodes. [Possibly the there are less layers on the first node since it needs to also compute the embeddings]

According to these numbers: OpenAl should have trained on 2x the tokens if they were trying to go by chinchilla'soptimal.

[let alone surpass it like we do]

This goes to show that they are strugglingto get high quality data.

推理过程在由128个GPU组成的集群上运行。

这些集群分布在不同地点的多个数据中心中。

推理过程采用8路张量并行和16路管道并行。

每个包含8个GPU的节点只有约1,300亿个参数,即每个GPU约30GB(FP16)或15GB(FP8/INT8)。

模型有120层,因此需要15个不同的节点来容纳模型。(可能第一个节点的层数较少,因为它还需要计算嵌入层)

根据这些数字,如果OpenAI试图按照chinchilla的最优设置进行训练,他们应该训练2倍的标记。

这表明他们在获取高质量数据方面遇到了困难。

Why no FSDP?

A possible reason for this could be that some of the hardware infra they secured is of an older generation.

This is pretty common at local compute clusters as the organisationusually upgrade the infra in several "waves" to avoid a complete pause ofoperation.

With such a high amount of pipeline parallelism it is very likely that just like the rest of us they suffer from the "batch bubble": slight idle timebetween batches.

Again: There is no magic.

They know what they are doing but it is not magic.

一个可能的原因是,他们所使用的硬件基础设施中有一部分是较旧一代的。

这在本地计算集群中非常常见,通常组织会分几个"波次"进行基础设施的升级,以避免完全停止运营。

由于存在如此高的管道并行性,他们很可能像其他人一样遭受"批处理泡沫"的影响:批次之间存在轻微的空闲时间。

再次强调:没有什么魔法。

他们知道自己在做什么,但这并非魔法。


OpenAI是将GPT-4的架构保密,这并不是因为对人类存在着某种潜在风险,而是因为他们所构建的东西是可复制的。实际上,我们预计在不久的将来,谷歌、Meta、Anthropic、Inflection、Character、腾讯、字节跳动、百度等公司都将拥有与GPT-4同等甚至更强大的模型能力。

不要误会,OpenAI的工程能力非常出色,他们所构建的东西令人难以置信,但是他们所采取的解决方案并非神奇。这是一个优雅的解决方案,需要考虑许多复杂的权衡。扩大规模只是战斗的一部分。OpenAI最持久的优势在于他们在实际应用中拥有最多的用户、领先的工程人才,并且可以通过未来的模型继续超越其他竞争对手。


GPT4泄露的技术细节
http://enderfga.cn/2023/07/11/gpt4/
作者
Enderfga
发布于
2023年7月11日
许可协议