The model weights (the thing being updated by the training process) stay loaded in gpu memory during training (the slow part). This could be useful to serialize the model weights to disk when checkpointing or completed, but it's a drop in the bucket compared to the rest of the time spent training.
Today’s managed languages are very fast. For example, if Java is not fast enough for your HFT algorithm, than nor is C++ or generic CPUs even! You have to go the custom chip route then. Where there is a significant difference between these categories is memory usage and predictability of performance. (In other applications, e.g. video codecs you will have to write assembly by hand in the hot loops, since here low-level languages are not low level enough). Since these concerns not apply to compilers, I don’t think that a significant performance difference would be observable between, say a java and zig implementation of a certain compiler.
We've also been running airflow for the past 2-3 years at a similar scale (~5000 dags, 100k+ task executions daily) for our data platform. We weren't aware of a great alternative when we started. Our DAGs are all config-driven which populate a few different templates (e.g. ingestion = ingest > validate > publish > scrub PII > publish) so we really don't need all the flexibility that airflow provides. We have had SO many headaches operating airflow over the years, and each time we invest in fixing the issue I feel more and more entrenched. We've hit scaling issues at the k8s level, scheduling overhead in airflow, random race conditions deep in the airflow code, etc. Considering we have a pretty simplified DAG structure, I wish we had gone with a simpler, more robust/scalable solution (even if just rolling our own scheduler) for our specific needs.
Upgrades have been an absolute nightmare and so disruptive. The scalability improvements in airflow 2 were a boon for our runtimes since before we would often have 5-15 minutes of overhead between task scheduling, but man it was a bear of an upgrade. We've since tried multiple times to upgrade past the 2.0 release and hit issues every time, so we are just done with it. We'll stay at 2.0 until we eventually move off airflow altogether.
I stood up a prefect deployment for a hackathon and I found that it solved a ton of the issues with airflow (sane deployment options, not the insane file-based polling that airflow does). We looked into it ~1 year ago or so, I haven't heard a lot about it lately, I wonder if anyone has had success with it at scale.
If your team is comfortable writing in pure python and you're familiar with the concept of a makefile you might find Luigi a much lighter and less opinionated alternative to workflows.
Luigi doesn't force you into using a central orchestrator for executing and tracking the workflows. Tracking and updating tasks state is open functions left to the programmer to fill in.
It's probably geared for more expert programmers who work close to the metal that don't care about GUIs as much as high degrees of control and flexibility.
It's one of those frameworks where the code that is not written is sort of a killer feature in itself. But definitely not for everyone.
Really interesting to see a bioinformatics tool be proposed. I've worked in bioinformatics for over 20 years, written several workflow system for execution on compute clusters, used several other people's and been underwhelmed by most. I was hoping that AirFlow might be better, since it was written by real software engineers rather than people who do systems design as a means to their ends, but AirFlow was completely underwhelming.
The other orchestrator besides Toil to check out is Cromwell, but that uses WDL instead of Python for defining the DAG, and it's not a super powerful language, even if it hits exactly the needs for 99% of uses and does exactly the right sort of environment containment.
I'm also hugely underwhelmed by k8s and Mesos and all those "cloud" allocation schemes. I think that a big, dynamically sized Slurm cluster would probably serve a lot of people far better.
I did a proof of concept in luigi pretty early on and really liked it. Our main concerns were that we would have needed to bolt on a lot of extra functionality to make it easy to re-run workflows or specific steps in the workflows when necessary (manual intervention is unavoidable IME). The fact that airflow also had a functional UI out of the box made it hard to justify luigi when we were just getting off the ground.
Very similar experience to yours. Adopted Airflow about 3 years ago. Was aware of Prefect but it seemed a bit immature at the time. Checked back in on it recently and they were approaching alpha for what looked like a pretty substantial rewrite (now in beta). Maybe once the dust has settled from that I'll give it another look.
creator of prefect was an early major airflow committer. anyone know what motivated the substantial rewrite of prefect? i had assumed original version of prefect was already supposed to fix some design issues in airflow?
I'm a heavy Prefect user and was also very confused about the initial rewrite, even after reading several summaries. My best advice is to just try using 2.0 (Orion). Here's how I'd summarize the difference:
Prefect 1.0 feels like second-gen Airflow--less boilerplate, easy dynamic DAGs, better execution defaults, great local dev, etc etc. It's more sane but you still feel the impedance mismatch from working with an orchestrator.
Prefect 2.0 is a first-principles rewrite that removes most of the friction from interacting with an orchestrator in the first place. Finally, your code can breathe.
Yes, the original stack 'Prefect' was written to address issues in airflow.
The DAG on prefect was built using decorators in a context which was pretty cool and worked well but they moved to DAG generation as code on Orion.
Prefect very cleanly written, good design and flexible. IMHO it is a platform that will be the next big thing in the area.
How I know, I deployed prefect as a static config gathering system across 4000 servers, both Linux and Windows. No other software stack came close, as one of the core concepts of prefect is 'expect to fail'. Things like Ansible Tower die really quick with large clusters due to the normal number of failures and the incorrect assumption that most things will work (as you can for a small cluster).
I wish I got to use it in my current work but there is no use case. Yet.
I had many thousands of machines. I needed to collect disk size, ram, software inventory, some custom config, if present. Some machines are Linux, some windows.
With prefect I created a task 'collect machine details for windows', another 'collect machine details for Linux', another 'collect software inventory'.
I have a list of machines in a database so I create a task to get them. That task is an sqlalchemy query so I can pass the task a filter.
I get a list of linux machines and pass that to a task to run.
I get a list of windows machines and pass that to a task.
Note that the above don't depend on each other.
I have a task that filters good results from bad.
I have another task that writes a list to a database.
Other tasks have credentials.
Another task puts errors to an error table, the machines that failed get filtered from the results and run into this task.
I plumb the above up with a Prefect flow and it builds a DAG that runs the flow.
Everything that can be run in parallel does so, everything that has some other input waits for the input.
Tasks that fail can be retried by Prefect automatically. Intermediate results cached. And, I get a nice gui for everything. I can even schedule it in the gui.
It's a good question. I believe airflow was probably the right choice at the time we started. We were a small team, and deploying airflow was a major shortcut that more or less handled orchestration so we could focus on other problems. With the aid of hindsight, we would have been better off spinning off our own scheduler some time in the first year of the project. Like I mentioned in my OP, we have a set of well-defined workflows that are just templatized for different jobs. A custom-built orchestration system that could perform those steps in sequence and trigger downstream workflows would not be that complicated. But this is how software engineering goes, sometimes you take on tech debt and it can be hard to know when it's time to pay it off. We did eventually get to a stable steady state, but with lots of hair pulling along the way.
Can dbt run arbitrary code? If it can, it's not well advertised in the documentation. Every time I've looked into dbt, I found that it's mostly a scheduled SQL runner.
The primary reason we run Airflow is because it can execute Python code natively, or other programs via Bash. It's very rare that a DAG I write is entirely SQL-based.
You’re right. I think the strength of dbt is in the T part of ELT. I wrote ELT to make a distinction in principle from the traditional ETL. (E)xtract and (L)oad is the data ingestion phase that would probably be better served by Dagster, where you could use Python.
(T)transform is decoupled and would be served in set-based operations managed by dbt.
parquet is great but it's not particularly easy to read or write. the libraries that do exist to work with it are few and far between, and those that do either have a hundred dependencies or depend on native code (e.g. libarrow). certainly an important dimension in an ideal file format should be the ease of parsing/writing it, and parquet gets an extremely low score on that front imo
Parquet is also column-major which is great for many use cases, but bad for others, where row-major is more useful. For example, if you want to get just the first x rows.
Sure, but any new format is going to have the same problems. I think you're right that implementation complexity needs to be considered, but it's not like Word or Excel files or something where you need to replicate bug for bug a format accreted over decades.
Parquet isn't trivial to parse / write but that's probably good imo. CSV is really easy to write, and... that just means everybody does it slightly differently. Being somewhat difficult to interact with encourages people to use a library to centralize a bit, but it's not so complex that someone motivated couldn't write a new implementation in a reasonable amount of time.
core to tesla's strategy is to do massive data collection from consumer-owned cars using beta software (and hardware, that the consumer pays for). that model is not compatible with expensive lidars, which contrary to some other comments in this thread, are still very expensive (just because the entry-level pucks are cheap, does not mean full lidar coverage is cheap). there is no way they could push $100k of sensors on consumers to build out their data collection pipeline. when tesla was first starting out, affordable lidar did not even exist so it's hard to call that a lame excuse.
all that said, I'm still pessimistic about tesla's chances at making camera-only L4 work in any short time horizon. we will see if they pull it off, but it's such a severe disadvantage compared to fully-kitted competitors.
vi is specified by POSIX, but (at least currently) it's marked as a User Portability Utilities (UP) extension, which is optional. It is required for XSI conformance, which mandates other extensions like C-Language Development Utilities (CD). But XSI itself is mostly legacy SysV stuff; the more widely supported parts were rolled into POSIX proper and either made a requirement or grouped into their own optional extension.
In general, most utilities and functions primarily used interactively are optional in POSIX.
But your containers may be a bit weird if their environment isn’t POSIX.
Like if you removed the “cd” command or the ability to read environment variables. “Containers” is any definition you want, but surely they’re built to some standard.
PS: I do make “from scratch” images a lot, I know you don’t “need” to have any utilities at all, but I’m fairly certain that a lot of software expects the “OS” to be POSIX.
I don't know what you mean. Linux containers don't contain operating systems. They contain processes. POSIX describes operating systems.
Every Linux process can read environment variables. They are contained in its address space. "cd" is a shell built-in. When there is no shell, there is no "cd". Not providing access to a shell sounds like great security practice tbh. Your applications shouldn't be using it anyway (they should create new processes directly).
This is just being obtuse for the sake of it. Yes, technically nothing in the notion of a container necessitates that; containers are just a form of namespacing. Have a gold star for understanding that. But the fact is: the vast majority of containers in use are going to be based on some minimal OS image, which is what the commenter was referring to.
People use minimal OS images when they absolutely have to, but the ideal case is just your binary sitting there all alone. (Sometimes you need the tzdata files, SSL certificates, and other support files. But rarely a shell.)
I agree that the ideal case is just a single binary with minimal supporting data, but people using OS images is certainly the norm not the exception. To suggest otherwise is ludicrous.
# on_update="nothing": does nothing when an update is tried
frozen_shared_state = freeze(shared_state, on_update="nothing")
frozen_shared_state.count = 4 # Does nothing, as this update did not exist
yikes. Thoughts on when this feature would ever be useful? Just the thought of working in a codebase with this subtle inconsistency makes me cringe.
I'm guessing this would be most useful when interfacing with naughty code that you can't rewrite. E.g. you need to call a function from another library that does something useful and also modifies its argument, and you only want it to do the useful thing.
I added this feature yesterday. Note the default value is "exception". With "nothin" I was thinking in passing some frozen object through a pipeline of unsafe/inherited/bad code, but without polluting the console with warnings or stopping the execution with exceptions.