MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

(arxiv.org)

78 points | by chrsw 1 hour ago

7 comments

internetguy 1 hour ago
> MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state
This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.
I have a lot more CPU RAM in my PC, and this would likely increase the size of models I can train locally.
[-]
- weitendorf 12 minutes ago
  To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware.
  In the past couple months there's been a kind of explosion in small-models that are occupying a niche in this kind of AI-transcoding space. What I'm hoping we're right on the cusp of achieving is a similar explosion in what I'd call tool-adaptation, where an LLM paired with some mostly-fixed suite of tools and problem cases can trade off some generality for a specialized (potentially hyper-specialized to the company or user) role.
  The thing about more transcoding-related tasks is that they in general stay in sync with what the user of the device is actively doing, which will also typically be closely aligned with the capabilities of the user's hardware and what they want to do with their computer. So most people aren't being intentional about this kind of stuff right now, partly out of habit I think, because only just now does it make sense to think of personal computer as "stranded hardware" now that they can be steered/programmed somewhat autonomously.
  I'm wondering if with the right approach to MoE on local devices (which local llms are heading towards) we could basically amortize the expensive hit from loading weights in and out of VRAM through some kind of extreme batch use case that users still find useful enough to be worth the latency. LoRa is already really useful for this but obviously sometimes you need more expertise/specialization than just a few layers' difference. Experimenting with this right now. It's the same basic principle as in the paper except less of a technical optimization and more workload optimization. Also it's literally the beginning of machine culture so that's kind of cool
- giancarlostoro 6 minutes ago
  > This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.
  I'm on the same board, its intimidating to me if I even want to bother training anything at all. Do you mind sharing what kind of training you've done with that GPU? :)
1aurent29 20 minutes ago
sounds very similar to https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_... i wonder how much this could be replicated using only this pytorch primitive
WithinReason 47 minutes ago
I was wondering how well this would work :) You can definitely push this further, the question is: how well can the gradients and updates compress?
olliepro 1 hour ago
This would likely only get used for small finetuning jobs. It’s too slow for the scale of pretraining.
[-]
- onion2k 41 minutes ago
  It’s too slow for the scale of pretraining.
  There isn't really such a thing as 'too slow' as an objective fact though. It depends on how much patience and money for electricity you have. In AI image gen circles I see people complaining if a model takes more than 5s to generate an image, and other people on very limited hardware who happily wait half an hour per image. It's hard to make a judgement call about what 'too slow' means. It's quite subjective.
  [-]
  - jandrese 35 minutes ago
    If it would take so long to train that the model will be obsolete before the training is finished that might be considered too long. With ML you can definitely hit a point where it is too slow for any practical purpose.
    [-]
    - ismailmaj 23 minutes ago
      Obsolete because of what? Because with limited hardware you’re never aiming for state of the art, and for fine-tuning, you don’t steer for too long anyway.
      [-]
      - jandrese 12 minutes ago
        Because there is a new model that is better, faster, more refined, etc...
        If your training time is measured in years or decades it probably won't be practical.
- greenavocado 32 minutes ago
  So distribute copies of the model in RAM to multiple machines, have each machine update different parts of the model weights, and sync updates over the network
l1n 54 minutes ago
Seems similar to Microsoft DeepSpeed.
adamsilvacons 18 minutes ago
[dead]
edoardobambini- 1 hour ago
[dead]