Exploring generative AI products on the market, as well as those under development, reveals that every query and request yields unique outcomes. This generative characteristic is to be expected. Additionally, the models and data for AI products are in a constant state of flux, making variability in outputs not only easily noticeable but also frequent. Moreover, new ideas and features develop at short cadences, and users witness all these changes combined. While evaluations and feedback loops ensure quality at scale, it's equally vital for us as developers and designers to manually test and build intuition for our AI products—and to do so early and routinely.
Test early builds. Get the baseline.
Early product builds that leave the local dev environment will likely crash at first. If not outright crashes, often many components are broken, preventing effective testing, e.g., AI not generating content, AI slowing down screen transitions, etc. This is the baseline intuition. Things only get better. Report crashes.
Test every build. Spot themes.
Stepping through the key use case scenarios of our product—install, first run, ask a question, landing on a page via sharing, etc.—we can directly experience the core product values while noticing shortcomings of the original design. I find it helpful to complete full passes of the major flows before stopping at every glitch. Jot down issues briefly to revisit at the second pass. What’s working well? What are the priorities to unblock at this point? What requires ongoing monitoring? Log tickets capturing themes and priorities. Draw the line on what requires fixes. Add in missing designs. We get to better acknowledge constraints to work with, e.g., LLM AI text streaming can come very slowly, and be able to find resolutions with better understanding.
Test towards the upcoming milestone. Stay up to date.
It might be a user study, internal or external friendlies, or a big demo. Getting hands-on with the latest build allows for fully grasping user feedback. Also, report issues to help minimize bugs and maximize chances to get meaningful feedback. I had a handful of testers participate in user studies, only to find that the intended AI features did not show up at all.
Test routinely. Sense variability.
Many times, when I test the products I design, I find myself paying attention only to whether the implementation matches the design mocks, the UI operates correctly, and so on. I find that I need to get the UI implementation correctness done first. Then, I can be more mindful of the users in the following routine tests to see the product holistically and see what’s changing. The UI that used to work yesterday might not work soon; for example, a UI that accommodates AI latency might not work when the model becomes extremely fast.
It requires a time investment. Over time, we become fluent at it!