Don't rent the cloud, own instead

(blog.comma.ai)

162 points | by Torq_boi 2 hours ago

22 comments

  • speedgoose 23 minutes ago
    I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.

    For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.

    For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.

    Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.

    I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.

    I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.

  • pu_pe 9 minutes ago
    > Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.

    It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.

    I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.

  • jillesvangurp 52 minutes ago
    At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.

    There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.

    People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.

    The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.

    This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.

    • lelanthran 18 minutes ago
      Your calculation assumes that an FTE is needed to maintain a few beefy servers.

      Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.

      OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.

      I expect the marginal cost of one employee over the other is zero.

    • ashu1461 24 minutes ago
      And not just any FTEs, probably few senior / staff level engineers who would cost a lot more.
    • g-b-r 28 minutes ago
      You should keep in mind that for a lot of things you can use a servicing contract, rather than hiring full-time employees.

      It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.

  • simianwords 1 hour ago
    The reason companies don’t go with on premises even if cloud is way more expensive is because of the risk involved in on premises.

    You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

    It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.

    Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.

    I would be vary of a smallish company building their own Jira in house in a similar way.

    • fauigerzigerk 28 minutes ago
      I'm starting to wonder though whether companies even have the in-house competence to compare the options and price this risk correctly.

      >Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

      Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.

      • simianwords 13 minutes ago
        Software companies have higher margins so these decisions are lower stakes. Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.

        Think of a ~5000 employee startup. Two scenarios:

        1. if they win the market, they capture something like ~60% margin

        2. if that doesn't happen, they just lose, VC fund runs out and then they leave

        In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.

    • d1sxeyes 1 hour ago
      It’s also opex vs capex, which is a battle opex wins most of the time.
      • bayindirh 41 minutes ago
        Opex is faster. Login, click, SSH, get a tea.

        Capex needs work. A couple of years, at least.

        If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.

      • simianwords 1 hour ago
        I think it wins because opex is seen as stable recurring cost and capex is seen as the money you put in your primary differentiation for long term gains.
        • d1sxeyes 1 hour ago
          True, but for a lot of companies “our servers are on-prem” is not a primary differentiator.
          • simianwords 37 minutes ago
            i think we are saying the same thing?
        • TonyStr 1 hour ago
          Capex may also require you to take out loans
  • satvikpendem 24 minutes ago
    I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.

    https://blog.railway.com/p/launch-week-02-welcome

  • durakot 39 minutes ago
    There's the HN I know and love
  • sys42590 1 hour ago
    It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.
    • sschueller 1 hour ago
      • AndroTux 22 minutes ago
        contingency plan: Don't build your data center out of wood.
    • fpoling 1 hour ago
      They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.
    • instagib 1 hour ago
      Flooding due to burst frozen pipe, false sprinkler trigger, or many others.

      Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.

    • twelvechairs 1 hour ago
      Theres only one solution to this problem and its 2 data centres in some way or form
      • mbreese 1 hour ago
        What's the line from Contact?

        why build one when you can have two at twice the price?

        But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.

        • golem14 46 minutes ago
          Or build two 2.5MM DCs (if can parallelize your workload well enough) and in case of disaster, you only lose capacity.

          You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)

  • danpalmer 1 hour ago
    > Cloud companies generally make onboarding very easy, and offboarding very difficult.

    I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

    • lelanthran 15 minutes ago
      > As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

      Its the other way around. How do you think all businesses moved to the cloud in the first place?

  • intalentive 56 minutes ago
    I like Hotz’s style: simply and straightforwardly attempting the difficult and complex. I always get the impression: “You don’t need to be too fancy or clever. You don’t need permission or credentials. You just need to go out and do the thing. What are you waiting for?”
    • tirant 39 minutes ago
      This was written by Harald Schäfer, the CTO of comma.ai. I'm not so sure if G. Hotz is still involved in comma.ai.
  • cgsmith 1 hour ago
    I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.

    Ps... bx cable instead of conduit for electrical looks cringe.

    • vidarh 30 minutes ago
      The main reason not to colocate is if you're somewhere with high real estate costs... E.g Hetzner managed servers competes on price w/co-location for me because I'm in London.
  • kaon_2 12 minutes ago
    Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.

    Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...

    • vachina 10 minutes ago
      Then literally own the cloud, like run the hardware on-prem yourself.
  • pja 26 minutes ago
    I’m impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.
  • tirant 38 minutes ago
    Well, their comment section is fore sure not running on premises, but on the cloud:

    "An error occurred: API rate limit already exceeded for installation ID 73591946."

  • hbogert 1 hour ago
    Datacenters need cool dry air? <45%

    No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.

    • mbreese 1 hour ago
      Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.

      It is much cheaper to use external air for cooling if you can.

      • hbogert 46 minutes ago
        Yeah but the article makes it sound as if lower is better, which it is definitely not. And yeah you need to control humidity, that might mean sometimes lowering, and sometimes increase it by whatever solution you have.

        Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.

  • kavalg 38 minutes ago
    This was one of the coolest job ads that I've ever read :). Congrats for what you have done with your infrastructure, team and product!
  • comrade1234 1 hour ago
    15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.
    • TonyStr 1 hour ago
      Azure provides their own "Total Cost of Ownership" calculator for this purpose [0]. Notably, this makes you estimate peripheral costs such as cost of having a server administrator, electricity, etc.

      [0] - https://azure-int.microsoft.com/en-us/pricing/tco/calculator...

      • Symbiote 31 minutes ago
        I plugged in our own numbers (60 servers we own in a data centre we rent) and Microsoft thinks this costs us an order of magnitude more than it does.

        Their "assumption" for hardware purchase prices seems way off compared to what we buy from Dell or HP.

        It's interesting that the "IT labour" cost they estimate is $140k for DIY, and $120k for Azure.

        Their saving is 5 times more than what we spend...

    • vidarh 27 minutes ago
      If you buy, maybe. Leasing or renting tends to be cheaper from day one. Tack on migration costs and ca. 6 months is a more realistic target. If the spreadsheet always said 3 years, it sounds like an intentional "leak".
    • g-b-r 43 minutes ago
      Did the AWS part include the egress costs to extract your data from AWS, if you ever want to leave them?
    • Onavo 1 hour ago
      Well, somebody should recreate it. I smell a potential startup idea somewhere. There's a ton of "cloud cost optimizers" software but most involve tweaking AWS knobs and taking a cut of the savings. A startup that could offload non critical service from AWS to colo and traditional bare metal hosting like Hetzner has a strong future.

      One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.

      https://www.silicondata.com/use-cases/h100-gpu-depreciation/

      Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..

  • langarus 1 hour ago
    This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.
    • hyperbovine 1 hour ago
      I agree, and cloud compute is poised to become even more commoditized in the coming years (gazillion new data centers + AI plateauing + efficiency gains, the writing is on the wall). There’s no way this makes sense for most companies.
      • NitpickLawyer 1 hour ago
        > AI plateauing

        Ummm is that plateauing with us in the room?

        The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.

    • ocdtrekkie 1 hour ago
      It's the opposite. The more consistent your workload the more practical and cost-effective it is to go on-prem.

      Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.

  • Semaphor 1 hour ago
    In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, it’s relative instead of absolute.
    • croisillon 1 hour ago
      no because it's on premise you see? you don't need to access the world wide web, just their server

      /s

  • rvz 18 minutes ago
    Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.

    It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.

    It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.

    [0] https://blog.railway.com/p/data-center-build-part-one

    [1] https://oxide.computer/

  • clarity_hacker 32 minutes ago
    [dead]
  • gogasca 54 minutes ago
    [dead]
  • camilajets 1 hour ago
    [dead]