More

tmcneal · on Jan 11, 2024

I'm not quite sure how n8n works after reading the docs, but our company Reflect provides an AI-driven approach to automation that may be similar, but for a more narrower use-case: automated end-to-end testing.

You can see a video example of how it works in our docs here (https://reflect.run/docs/recording-tests/testing-with-ai/), but the idea is that you describe the actions and assertions you want to take in plain-text prompts, and the AI interprets those prompts in real time and executes them against a running browser session. In practice, it's a lot like writing a manual test script and having it automatically execute. We use both GPT 3.5 and 4 and will be releasing Vision support once OpenAI has deemed the gpt-4-turbo-with-vision model ready for production use.

crazytest · on Jan 11, 2024

actually, that is what it is about, that it is not a narrower use-case, and you can extend and change the workflows however you like them. but an interesting way to present your own company. Seems like you are doing something similar to what https://www.octomind.dev/ is doing, right?

tmcneal · on Nov 10, 2023

Hey HN - excited to share this with you and hopefully get some feedback. ZeroStep is a JavaScript library that adds the power of AI to Playwright tests. ZeroStep’s ai() function lets developers test a web application using simple plain-text instructions, embedded directly in their Playwright tests.

The goals of this library are:

- Make tests easier to write. Writing an ai() step is equivalent to describing the action (“Fill out the form with realistic values”), assertion (“Verify there are no errors displayed on the page”), or extraction (“What is the shopping cart total?”) in plain English. The AI does the rest. There’s no need to write selectors, or add data-test-id attributes all over your app so that your tests have stable locators. Our website and GitHub repo show several examples of tricky testing scenarios that are made easy with ZeroStep.

- Make tests less frustrating to maintain. Unlike selectors, ZeroStep ai() steps are not tightly coupled to your app’s markup. The ZeroStep AI interprets your ai() steps at runtime which means even large scale changes in the app won’t break your tests, so long as your functional requirements remain unchanged. Selectors, in my opinion, are one of the worst leaky abstractions in software development. I can’t imagine how many dev hours have been lost due to them. We’re happy to not use any concept of selectors with this approach.

- Keep it simple - ai() steps have no predefined syntax (unlike Cucumber). You just need to be able to clearly describe what action, assertion, or extraction you want the AI to perform.

antoniojtorres · on Nov 10, 2023

Fascinating. This is certainly a hot space right now and this seems to strike a good balance between being low level enough and saving time. I do wonder though, are there issues with getting it to do the same test every single time without fiddling a lot with the prompt? I wonder if at some point you’re trading one headache for another, seeing as codegen is a command away.

tmcneal · on Nov 9, 2023

For anyone looking to try this in an E2E testing context, we just released a library for Playwright called ZeroStep (https://zerostep.com/) that lets you script AI based actions, assertions, and extractions.

This is a working example that tests the core "book a meeting" workflow in Calendly:

    import { test, expect } from '@playwright/test'
    import { ai } from '@zerostep/playwright'

    test.describe('Calendly', () => {
      test('book the next available timeslot', async ({ page }) => {
        await page.goto('https://calendly.com/zerostep-test/test-calendly')

        await ai('Verify that a calendar is displayed', { page, test })
        await ai('Dismiss the privacy modal', { page, test })
        await ai('Click on the first available day of the month', { page, test })
        await ai('Click on the first available time in the sidebar', { page, test })
        await ai('Click the Next button', { page, test })
        await ai('Fill out the form with realistic values', { page, test })
        await ai('Submit the form', { page, test })

        const element = await page.getByText('You are scheduled')
        expect(element).toBeDefined()
      })
    })

jasonjmcghee · on Nov 10, 2023

It would be much easier to consider this as solution if it would _output_ the generated test steps, and/or cache them and only modify them if needed.

Your example above - 7 function calls in one test. let's say usually closer to 5, we have hundreds of tests. Every single PR runs E2E tests. We open a handful of PRs a day. Let's call it 5. We're already looking at thousands of invocations a day. Based on your pricing, that would be incredibly expensive.

This is with 3 eng.

jaggederest · on Nov 9, 2023

What's the reliability and cost on something like this? I would need to see high-90s at <$0.10 before wanting to put it into a CI loop.

tmcneal · on Nov 9, 2023

Pricing is listed on https://zerostep.com - you get 1,000 ai() calls per month for free, and then the cheapest paid plan is 2,000 ai() calls per month for $20, 4,000 for $40, etc. So basically you pay a penny per ai() call.

In terms of reliability - we have a hard dependency on the OpenAI API, so that's what will affect reliability the most. We're using GPT-3.5 and GPT-4 models, which have been fairly reliable, but we'll bump to GPT-4-Turbo eventually. Right now GPT-4-Turbo is listed as "not suited for production use" in OpenAI's docs: https://platform.openai.com/docs/models

koreth1 · on Nov 9, 2023

That's one aspect of reliability, but the one I was more curious about was determinism. If I repeatedly run the same test suite on the same code base and the same data and configuration, am I guaranteed to get the same test results every time, or is it possible for ai() to change its mind about what actions to take?

tmcneal · on Nov 9, 2023

Ah got it. So GPT is non-deterministic, but we somewhat handle that by having a caching layer in our AI. Basically if you make an ai() call, and we see that the page state is identical to a previous invocation of that exact AI prompt, then we will not consult the AI and install return you the cached result. We did this mainly to reduce costs and speed up execution of the 2nd-to-nth run of the same test, but it does make the AI a bit more deterministic.

There are some new features in GPT-4-Turbo that will let us handle determinism better, and we will be exploring that once GPT-4-Turbo is stable.

jaggederest · on Nov 9, 2023

That makes a lot of sense, thank you for the explanation, I will have to explore this the next time I am building page tests. Have considered doing it myself but much happier using a relatively inexpensive product than maintaining the creaky homebuild version.

jaggederest · on Nov 9, 2023

Thank you for the clarifying comment, this was really the thing I was meaning when I imprecisely said "reliability".

msoad · on Nov 9, 2023

Nice! I'm going to try this out! Nit: For me, it would be nicer if `ai` was a fixture itself.

      test.describe('Calendly', ({ ai }) => {

tmcneal · on Nov 11, 2023

Done! We added the ability to use it as a fixture. Documented here: https://github.com/zerostep-ai/zerostep#playwright-fixture

ushakov · on Nov 12, 2023

Does it send the webpage contents to ZeroStep?

Cool demo btw.

tmcneal · on June 26, 2023

Hi HN,

Three years ago we launched Reflect on HN (https://news.ycombinator.com/item?id=23897626). We're back to show you some new AI-powered features that we believe are a big step forward in the evolution in automated end-to-end testing. Specifically, these features raise the level of abstraction for test creation and maintenance.

One of our new AI-powered features is something we call Prompt Steps. Normally in Reflect you create a test by recording your actions as you use your application, but with Prompt steps you define what you want tested by describing it in plain text, and Reflect executes those actions on your behalf. We're making this feature publicly available so that you can sign up for a free account and try it for yourself.

Our goal with Reflect is to make end-to-end tests fast to create and easy to maintain. A lot of teams face issues with end-to-end tests being flaky and just generally not providing a lot of value. We faced that ourselves at our last startup, and it was the impetus for us to create this product. Since our launch, we've improved the product by making tests execute much faster, reducing VM startup times, adding support for API testing, cross-browser testing etc, and doing a lot of things to reduce flakiness, including some novel stuff like automatically detecting and waiting on asynchronous actions like XHRs and fetches.

Although Reflect is used by developers, our primary user is non-technical - someone like a manual tester, or a business analyst at a large company. This means it's important for us to provide ways for these users to express what they want tested without requiring them to write code. We think LLMs can be used to solve some foundational problems these users experience when trying to do automated testing. By letting users express what they want tested in plain English, and having the automation automatically perform those actions, we can provide non-technical users with something very close to the expressivity of code in a workflow that feels very familiar to them.

In the testing world there's something called BDD, which stands for Behavior-Driven Development. It's an existing way to express automated tests in plain English. With BDD, a tester or business analyst typically defines how the system should function using an English-language DSL called "Gherkin", and then that specification is turned into an automated test later using a framework called Cucumber. There are two main issues that we've heard a lot when talking to users practicing BDD:

1. They find the Gherkin syntax to be overly restrictive. 2. Because you have to write a whole bunch of code in the DSL translation layer to get the automation to work, non-technical users who are writing the specs have to rely heavily on the developers writing the DSL translation layer. In addition, the developers working on the DSL layer would rather just write Selenium or Playwright code directly versus having to use English language as a go-between.

We think our approach solves for these two main issues. Reflect's prompt steps have no predefined DSL. You can write whatever you want, including something that could result in multiple actions (e.g. "Fill out all the form fields with realistic values"). Reflect takes this prompt, analyzes the current state of the DOM, and queries OpenAI to determine what action or set of actions to take to fulfill that instruction. This means that non-technical users who practice BDD can create automated tests without developers having to build any sort of framework under the covers.

Our other AI feature is something we call the 'AI Assistant'. This is meant to address shortcomings with the Selectors (also called Locators) that we generate automatically when you're using the record-and-playback features in Reflect. Selectors use the page structure and styling of the page to target an element, and we generate multiple selectors for each action you take in Reflect. This approach works most of the time, but sometimes there's just not enough information on the page to generate good selectors, or the underlying web application has changed significantly at the DOM-layer while being semantically equivalent to the user. Our "AI Assistant" feature works by falling back to querying the AI to determine what action to take when all the selectors on hand are no longer valid.

This uses the same approach as prompt steps, except that the "prompt" in this case is an auto-generated description of the action that we recorded (e.g. something like "Click on Login button", or "Input x into username field"). We're usually able to generate a good English-language description based on the data in the DOM, like the text associated with the element, but on the occasions that we can't, we'll also query OpenAI to have it generate a test step description for us. This means that Selectors effectively become a sort of caching layer for retrieving what element to operate on in for a given test step. They'll work most of the time, and element retrieval is fast. We believe that this approach will be resilient to even large changes to the page structure and styling, such as a major redesign of an application.

It's still early days for this technology. Right now our AI works by analyzing the state of the DOM, but we eventually want to move to a multi-modal approach so that we can capture visual signals that are not present in the DOM. It also has some limitations - for example right now it doesn't see inside iframes or Shadow DOM. We're working on addressing these limitations, but we think our coverage of use cases is wide enough that this is now ready for real-world use.

We're excited to launch this publicly, and would love to hear any feedback. Thanks for reading!

tmcneal · on Jan 21, 2023

If you're a fan of Tolkien, check out MUME (Multi-user Middle Earth), a MUD that's been online since 1990: https://mume.org/

Its unofficial community site is http://elvenrunes.com/ which has hosted forums and player-submitted "logs" (text logs of PvP fights) for over 20 years.

arglebargle123 · on Jan 21, 2023

WoTMUD was pretty awesome too, I haven't looked to see if it's still online for years.

JauntTrooper · on Jan 22, 2023

It is online and still quite active! Drop into the discord and say hello.

I spent many of my formative years on wotmud.

Freeboots · on Jan 22, 2023

Is Trill still sitting at the Keep well, refusing to be useful?

JauntTrooper · on Jan 23, 2023

Aw, that brings me back. It's so nice to meet another wotmudder in the wild!

Diederich · on Jan 22, 2023

I played MUME a ton in the mid to late 1990s, and still login once in a while.

It's truly a hidden gem. Walking around the world often feels like being inside a Tolkien novel.

wara23arish · on Jan 24, 2023

I remember playing the two towers growing up

My internet/pc were never enough to run games but good enough to run a t2t client.

brings me back

tmcneal · on Nov 22, 2022

I feel compelled to provide a counter-point to all of the "you can't" comments here.

It's possible. I started a company with my co-founder three years ago. We both were mid-thirties with kids when we started. We've since went through YC, raised a seed-round, and the company has been growing ever since.

What worked for me:

- I saved up money and drew from that while we were getting the company off the ground. My co-founder and I went close to a year without a salary. Having a spouse that works also helps.

- Having a relatively low cost-of-living helped. It never felt like we were reducing our quality of life, although this was in the middle of the pandemic so we weren't able to go on vacations etc even if we wanted to.

Building a company is a lot of work, but that's true regardless if it you're married with children or not. I think parents have the advantage of being forced to be efficient with their time. It did require a lot of work after-hours coding to get things off the ground, but I probably would've been coding on something anyway.

benjaminwootton · on Nov 22, 2022

Similar story here. I saved some money, reduced burn and had a girlfriend in work. I also bootstrapped in spare time.

It also worked out OK for me.

Not trying would have been soul destroying for me even if it was a bit of a measured gamble.

tmcneal · on July 28, 2022

This reminds me of the fan theory that Tom Bombadil is secretly the Witch-King of Angmar: http://www.flyingmoose.org/tolksarc/theories/bombadil.htm

hajile · on July 28, 2022

And is about as faulty. The Witch-king would definitely have been affected after putting on the ring of power (and would have immediately killed them before returning it to Sauron). It is also worth noting that Frodo didn't have his normal nervousness about handing the ring to Bombadil.

outworlder · on July 28, 2022

The Witch King would immediately seize the One Ring for himself.

Although the notion of a ringwraith singing and dancing is amusing.

permo-w · on July 28, 2022

the first two points really don’t help the credibility of this theory. it does actually get better after that

orthoxerox · on July 28, 2022

But did Aragorn wear pants?

ncmncm · on July 28, 2022

And, did ents have knees?

tmcneal · on June 17, 2022

Is it possible to personalize your pitches to individual users? At our startup [1] we try to get straight to point when pitching the product and demo something that is as close as possible to how the person we're talking to would actually use the product.

For example, here's a video I just recorded a few minutes ago for someone that I've been talking to via email: https://www.loom.com/share/01fd4a6963a04258908f7b12e2afaa3a

One advantage we have is that it only takes a few minutes to show the product, and it works on any publicly available site so with a little research it's pretty easy to show something that's pretty close to how they'd use the product themselves.

[1] https://reflect.run

swyx · on June 17, 2022

i think that absolutely qualifies. also its pretty neat that the Loom video just unfurls inside of HN because i have the Loom extension installed! nice hack Loom.

mkmk · on June 17, 2022

to each their own – i found it a bit obnoxious as it goes over the line of what i expected their extension to do for me...

tmcneal · on Feb 17, 2022

tdm is an open-source library for generating seed data for your QA and staging environments.

I'm the co-founder of Reflect (https://reflect.run) which is a no-code tool for creating automated regression tests. We've realized that an accurate tool for building and running tests is necessary, but not always sufficient when it comes to being successful with automated regression testing. If your tests can't make assumptions about the state of your application, it doesn't matter how accurate your testing tool is; your tests are going to be flaky.

Every end-to-end test makes some implicit assumptions about the state of the application. For example, if you were testing an e-commerce store, you'd create a test that clicks through the site, adds a product to the cart, enters a dummy payment method, and validates that an order has been placed. There's going to be lots of baked-in assumptions in this test about the state of the application. If that product goes out of stock next month, or its product name changes, or its size / color attributes are different, then the test will probably fail. We built tdm to help software teams manage the underlying data that their end-to-end tests depend on. It's meant to run in your test environment just before your test suite executes, and it gets your application in a consistent state so that the implicit assumptions in your tests are correct.

tdm operates like a Terraform for test data; you describe the state that your data should be in, and tdm takes care of putting your data into that state. Rather than accessing your database directly, tdm interfaces with your APIs. This means that the same approach to managing your first-party data can also be used to manage test data in third-party APIs. If the API has an OpenAPI spec, we can auto-gen Typescript bindings to make integration easier.

Test data is defined as fixtures that are checked into source code. These fixtures look like JSON but are actually Typescript objects, which means your data gets compile-time checks, and you get the structural-typing goodness of TS to wrangle your data as you see fit.

Similar to Terraform, you can run tdm in a dry-run mode to first check what changes will be applied, and then run a secondary command to apply those changes. With this "diffing" approach, any data that's generated by the tests themselves gets cleared out for the next run.

I'd love to get any feedback on this approach! Hopefully this is something that can be useful regardless of what you're using to build E2E tests.

tmcneal · on Jan 28, 2022

Yes definitely, there's lots of products in the QA space trying to tackle the problem you're describing. I'm a co-founder of a no-code product in the space (https://reflect.run). Being no-code has the advantage of enabling all QA testers to build test automation, regardless of coding experience.