I'm not quite sure how n8n works after reading the docs, but our company Reflect provides an AI-driven approach to automation that may be similar, but for a more narrower use-case: automated end-to-end testing.
You can see a video example of how it works in our docs here (https://reflect.run/docs/recording-tests/testing-with-ai/), but the idea is that you describe the actions and assertions you want to take in plain-text prompts, and the AI interprets those prompts in real time and executes them against a running browser session. In practice, it's a lot like writing a manual test script and having it automatically execute. We use both GPT 3.5 and 4 and will be releasing Vision support once OpenAI has deemed the gpt-4-turbo-with-vision model ready for production use.
actually, that is what it is about, that it is not a narrower use-case, and you can extend and change the workflows however you like them. but an interesting way to present your own company. Seems like you are doing something similar to what https://www.octomind.dev/ is doing, right?
Hey HN - excited to share this with you and hopefully get some feedback. ZeroStep is a JavaScript library that adds the power of AI to Playwright tests. ZeroStep’s ai() function lets developers test a web application using simple plain-text instructions, embedded directly in their Playwright tests.
The goals of this library are:
- Make tests easier to write. Writing an ai() step is equivalent to describing the action (“Fill out the form with realistic values”), assertion (“Verify there are no errors displayed on the page”), or extraction (“What is the shopping cart total?”) in plain English. The AI does the rest. There’s no need to write selectors, or add data-test-id attributes all over your app so that your tests have stable locators. Our website and GitHub repo show several examples of tricky testing scenarios that are made easy with ZeroStep.
- Make tests less frustrating to maintain. Unlike selectors, ZeroStep ai() steps are not tightly coupled to your app’s markup. The ZeroStep AI interprets your ai() steps at runtime which means even large scale changes in the app won’t break your tests, so long as your functional requirements remain unchanged. Selectors, in my opinion, are one of the worst leaky abstractions in software development. I can’t imagine how many dev hours have been lost due to them. We’re happy to not use any concept of selectors with this approach.
- Keep it simple - ai() steps have no predefined syntax (unlike Cucumber). You just need to be able to clearly describe what action, assertion, or extraction you want the AI to perform.
Fascinating. This is certainly a hot space right now and this seems to strike a good balance between being low level enough and saving time. I do wonder though, are there issues with getting it to do the same test every single time without fiddling a lot with the prompt? I wonder if at some point you’re trading one headache for another, seeing as codegen is a command away.
For anyone looking to try this in an E2E testing context, we just released a library for Playwright called ZeroStep (https://zerostep.com/) that lets you script AI based actions, assertions, and extractions.
This is a working example that tests the core "book a meeting" workflow in Calendly:
import { test, expect } from '@playwright/test'
import { ai } from '@zerostep/playwright'
test.describe('Calendly', () => {
test('book the next available timeslot', async ({ page }) => {
await page.goto('https://calendly.com/zerostep-test/test-calendly')
await ai('Verify that a calendar is displayed', { page, test })
await ai('Dismiss the privacy modal', { page, test })
await ai('Click on the first available day of the month', { page, test })
await ai('Click on the first available time in the sidebar', { page, test })
await ai('Click the Next button', { page, test })
await ai('Fill out the form with realistic values', { page, test })
await ai('Submit the form', { page, test })
const element = await page.getByText('You are scheduled')
expect(element).toBeDefined()
})
})
It would be much easier to consider this as solution if it would _output_ the generated test steps, and/or cache them and only modify them if needed.
Your example above - 7 function calls in one test. let's say usually closer to 5, we have hundreds of tests. Every single PR runs E2E tests. We open a handful of PRs a day. Let's call it 5. We're already looking at thousands of invocations a day. Based on your pricing, that would be incredibly expensive.
Pricing is listed on https://zerostep.com - you get 1,000 ai() calls per month for free, and then the cheapest paid plan is 2,000 ai() calls per month for $20, 4,000 for $40, etc. So basically you pay a penny per ai() call.
In terms of reliability - we have a hard dependency on the OpenAI API, so that's what will affect reliability the most. We're using GPT-3.5 and GPT-4 models, which have been fairly reliable, but we'll bump to GPT-4-Turbo eventually. Right now GPT-4-Turbo is listed as "not suited for production use" in OpenAI's docs: https://platform.openai.com/docs/models
That's one aspect of reliability, but the one I was more curious about was determinism. If I repeatedly run the same test suite on the same code base and the same data and configuration, am I guaranteed to get the same test results every time, or is it possible for ai() to change its mind about what actions to take?
Ah got it. So GPT is non-deterministic, but we somewhat handle that by having a caching layer in our AI. Basically if you make an ai() call, and we see that the page state is identical to a previous invocation of that exact AI prompt, then we will not consult the AI and install return you the cached result. We did this mainly to reduce costs and speed up execution of the 2nd-to-nth run of the same test, but it does make the AI a bit more deterministic.
There are some new features in GPT-4-Turbo that will let us handle determinism better, and we will be exploring that once GPT-4-Turbo is stable.
That makes a lot of sense, thank you for the explanation, I will have to explore this the next time I am building page tests. Have considered doing it myself but much happier using a relatively inexpensive product than maintaining the creaky homebuild version.
Three years ago we launched Reflect on HN (https://news.ycombinator.com/item?id=23897626). We're back to show you some new AI-powered features that we believe are a big step forward in the evolution in automated end-to-end testing. Specifically, these features raise the level of abstraction for test creation and maintenance.
One of our new AI-powered features is something we call Prompt Steps. Normally in Reflect you create a test by recording your actions as you use your application, but with Prompt steps you define what you want tested by describing it in plain text, and Reflect executes those actions on your behalf. We're making this feature publicly available so that you can sign up for a free account and try it for yourself.
Our goal with Reflect is to make end-to-end tests fast to create and easy to maintain. A lot of teams face issues with end-to-end tests being flaky and just generally not providing a lot of value. We faced that ourselves at our last startup, and it was the impetus for us to create this product. Since our launch, we've improved the product by making tests execute much faster, reducing VM startup times, adding support for API testing, cross-browser testing etc, and doing a lot of things to reduce flakiness, including some novel stuff like automatically detecting and waiting on asynchronous actions like XHRs and fetches.
Although Reflect is used by developers, our primary user is non-technical - someone like a manual tester, or a business analyst at a large company. This means it's important for us to provide ways for these users to express what they want tested without requiring them to write code. We think LLMs can be used to solve some foundational problems these users experience when trying to do automated testing. By letting users express what they want tested in plain English, and having the automation automatically perform those actions, we can provide non-technical users with something very close to the expressivity of code in a workflow that feels very familiar to them.
In the testing world there's something called BDD, which stands for Behavior-Driven Development. It's an existing way to express automated tests in plain English. With BDD, a tester or business analyst typically defines how the system should function using an English-language DSL called "Gherkin", and then that specification is turned into an automated test later using a framework called Cucumber. There are two main issues that we've heard a lot when talking to users practicing BDD:
1. They find the Gherkin syntax to be overly restrictive.
2. Because you have to write a whole bunch of code in the DSL translation layer to get the automation to work, non-technical users who are writing the specs have to rely heavily on the developers writing the DSL translation layer. In addition, the developers working on the DSL layer would rather just write Selenium or Playwright code directly versus having to use English language as a go-between.
We think our approach solves for these two main issues. Reflect's prompt steps have no predefined DSL. You can write whatever you want, including something that could result in multiple actions (e.g. "Fill out all the form fields with realistic values"). Reflect takes this prompt, analyzes the current state of the DOM, and queries OpenAI to determine what action or set of actions to take to fulfill that instruction. This means that non-technical users who practice BDD can create automated tests without developers having to build any sort of framework under the covers.
Our other AI feature is something we call the 'AI Assistant'. This is meant to address shortcomings with the Selectors (also called Locators) that we generate automatically when you're using the record-and-playback features in Reflect. Selectors use the page structure and styling of the page to target an element, and we generate multiple selectors for each action you take in Reflect. This approach works most of the time, but sometimes there's just not enough information on the page to generate good selectors, or the underlying web application has changed significantly at the DOM-layer while being semantically equivalent to the user. Our "AI Assistant" feature works by falling back to querying the AI to determine what action to take when all the selectors on hand are no longer valid.
This uses the same approach as prompt steps, except that the "prompt" in this case is an auto-generated description of the action that we recorded (e.g. something like "Click on Login button", or "Input x into username field"). We're usually able to generate a good English-language description based on the data in the DOM, like the text associated with the element, but on the occasions that we can't, we'll also query OpenAI to have it generate a test step description for us. This means that Selectors effectively become a sort of caching layer for retrieving what element to operate on in for a given test step. They'll work most of the time, and element retrieval is fast. We believe that this approach will be resilient to even large changes to the page structure and styling, such as a major redesign of an application.
It's still early days for this technology. Right now our AI works by analyzing the state of the DOM, but we eventually want to move to a multi-modal approach so that we can capture visual signals that are not present in the DOM. It also has some limitations - for example right now it doesn't see inside iframes or Shadow DOM. We're working on addressing these limitations, but we think our coverage of use cases is wide enough that this is now ready for real-world use.
We're excited to launch this publicly, and would love to hear any feedback. Thanks for reading!
If you're a fan of Tolkien, check out MUME (Multi-user Middle Earth), a MUD that's been online since 1990: https://mume.org/
Its unofficial community site is http://elvenrunes.com/ which has hosted forums and player-submitted "logs" (text logs of PvP fights) for over 20 years.
I feel compelled to provide a counter-point to all of the "you can't" comments here.
It's possible. I started a company with my co-founder three years ago. We both were mid-thirties with kids when we started. We've since went through YC, raised a seed-round, and the company has been growing ever since.
What worked for me:
- I saved up money and drew from that while we were getting the company off the ground. My co-founder and I went close to a year without a salary. Having a spouse that works also helps.
- Having a relatively low cost-of-living helped. It never felt like we were reducing our quality of life, although this was in the middle of the pandemic so we weren't able to go on vacations etc even if we wanted to.
Building a company is a lot of work, but that's true regardless if it you're married with children or not. I think parents have the advantage of being forced to be efficient with their time. It did require a lot of work after-hours coding to get things off the ground, but I probably would've been coding on something anyway.
And is about as faulty. The Witch-king would definitely have been affected after putting on the ring of power (and would have immediately killed them before returning it to Sauron). It is also worth noting that Frodo didn't have his normal nervousness about handing the ring to Bombadil.
Is it possible to personalize your pitches to individual users? At our startup [1] we try to get straight to point when pitching the product and demo something that is as close as possible to how the person we're talking to would actually use the product.
One advantage we have is that it only takes a few minutes to show the product, and it works on any publicly available site so with a little research it's pretty easy to show something that's pretty close to how they'd use the product themselves.
i think that absolutely qualifies. also its pretty neat that the Loom video just unfurls inside of HN because i have the Loom extension installed! nice hack Loom.
tdm is an open-source library for generating seed data for your QA and staging environments.
I'm the co-founder of Reflect (https://reflect.run) which is a no-code tool for creating automated regression tests. We've realized that an accurate tool for building and running tests is necessary, but not always sufficient when it comes to being successful with automated regression testing. If your tests can't make assumptions about the state of your application, it doesn't matter how accurate your testing tool is; your tests are going to be flaky.
Every end-to-end test makes some implicit assumptions about the state of the application. For example, if you were testing an e-commerce store, you'd create a test that clicks through the site, adds a product to the cart, enters a dummy payment method, and validates that an order has been placed. There's going to be lots of baked-in assumptions in this test about the state of the application. If that product goes out of stock next month, or its product name changes, or its size / color attributes are different, then the test will probably fail. We built tdm to help software teams manage the underlying data that their end-to-end tests depend on. It's meant to run in your test environment just before your test suite executes, and it gets your application in a consistent state so that the implicit assumptions in your tests are correct.
tdm operates like a Terraform for test data; you describe the state that your data should be in, and tdm takes care of putting your data into that state. Rather than accessing your database directly, tdm interfaces with your APIs. This means that the same approach to managing your first-party data can also be used to manage test data in third-party APIs. If the API has an OpenAPI spec, we can auto-gen Typescript bindings to make integration easier.
Test data is defined as fixtures that are checked into source code. These fixtures look like JSON but are actually Typescript objects, which means your data gets compile-time checks, and you get the structural-typing goodness of TS to wrangle your data as you see fit.
Similar to Terraform, you can run tdm in a dry-run mode to first check what changes will be applied, and then run a secondary command to apply those changes. With this "diffing" approach, any data that's generated by the tests themselves gets cleared out for the next run.
I'd love to get any feedback on this approach! Hopefully this is something that can be useful regardless of what you're using to build E2E tests.
Yes definitely, there's lots of products in the QA space trying to tackle the problem you're describing. I'm a co-founder of a no-code product in the space (https://reflect.run). Being no-code has the advantage of enabling all QA testers to build test automation, regardless of coding experience.
You can see a video example of how it works in our docs here (https://reflect.run/docs/recording-tests/testing-with-ai/), but the idea is that you describe the actions and assertions you want to take in plain-text prompts, and the AI interprets those prompts in real time and executes them against a running browser session. In practice, it's a lot like writing a manual test script and having it automatically execute. We use both GPT 3.5 and 4 and will be releasing Vision support once OpenAI has deemed the gpt-4-turbo-with-vision model ready for production use.