Artificial Intelligence (AI) is something we are starting to hear a lot about, and as a tester, you may soon be thrown in to a project with ‘AI’ in its title. The following is my experience in testing an ‘artificial human’ for the first time
What is AI?
When you think of an AI, you immediately jump to super smart “sentient” robots or computers that want to take over the world. Wikipedia describes as sentience as “the capacity to feel, perceive or experience subjectively”.
Right now, no one has created sentience. The closest we have to it is smart software that can imitate intelligence.
A more accurate term for this type of software is machine intelligence or intelligent machines. They are driven by business rules coded by humans.
There have been some attempts with powerful machine learning, but this paper will be talking about the type of AI that use low level machine learning and can run on the everyday devices that people already have.
Where to start?
So, you’re at your morning meeting at the bank and you are told that you will be testing a new, experimental, customer service chat-bot, also commonly known as a virtual assistant, artificial human or avatar. It will have a real-life human face and listen to customers’ spoken questions, seek clarification and provide useful answers.
You find out that there will only be three people on the project: a content creator, tester and developer. If this sounds overwhelming, it is and was, but at its heart it’s still software.
So as with all software testing, planning is important. This includes identifying its functional and non-functional parts, and defining and understanding the scope.
A great place to start planning is to understand at a high level what is wanting to be achieved.
This could be:
- The 30 most asked questions from the company’s help website to use as the input questions.
- No customer verification, as no customer information will be accessed.
- Output responses based on information already publicly available on the company’s website.
- Pop-up boxes, along with spoken responses, with links to further information.
Sounds straight forward, right?
As a tester, you have now received some process flows, showing conversations from beginning to end, so you jump into the design and implementation stages. Decision tree testing seems to fit well here.
The only problem is that flows can change on an almost hourly basis. This means that every test case written essentially becomes obsolete the moment its finished.
It becomes apparent that testing in such a dynamic environment requires a different, flexible approach. What else becomes apparent is you’re not just testing the conversation flows, but a variety of factors, both concrete and high level.
Let’s split them up into more defined categories:
- Voice input recognition accuracy
- Text input recognition accuracy
- Response accuracy
- Flow accuracy
- Look and Feel
Some questions to ask:
- Are all possible voice and text inputs to that conversation flow recognised?
- When the artificial human seeks clarification is it recognised?
- Are the inappropriate inputs being handled appropriately?
- Is the artificial human’s speech response technically correct?
- Does it sound right?
- Does the conversation go to the correct next step?
- Is the pop-up box information technically correct?
- Is the formatting to the company brand standard?
- Do the links all work correctly?
Look and Feel
- Is the avatar’s mouth synced to the spoken words?
- Are the expressions and eye movements appropriate for the subject matter? (You don’t want a happy face avatar when talking about financial hardship!)
- What about image and sound quality?
- Start up and capture?
- Multiple users?
- Time out functions?
The list is long and can be quite complex.
So why does it change so fast?
The avatar I tested is from Soul Machines while Google’s Dialogflow was used for the natural language engine. Also required for it to work down under was the regional pronunciation plugin.
The Content Manager then started creating the conversation flows within Dialogflow. The conversation has a trigger called an input and can have many outputs. These depend on the customer’s response and the path the conversation flows down.
Originally, it was assumed that basing the conversations on the (already approved) website content would be adequate and appropriate but the SME owners of the individual website content wanted to approve everything. Then the legal department thought it might be good to approve everything.
This meant that every conversation now had to go through an SME(s) specific to its content and then on to legal before it was able to be published into the test environment. At times, the content manager challenged the SME and legal, as they turned the responses from sounding natural to ones that sounded like the avatar was reading a brochure.
This was moving away from the expectations around the behaviour of Jaime, so we wanted to preserve the avatar’s unique personality. Another hurdle to overcome was that testing voice-controlled software required a quiet environment, otherwise it picked up all the nearby conversations.
Consideration also has to paid to nearby colleagues. They don’t want to listen to a repetitive conversation all day!
Two attempts were made to automate parts of the testing.
Automation would have worked well for the conversation path accuracy. If decision tree or pairwise testing had been able to be automated, it could have proved the accuracy of each decision path.
Due to the constant content and structure changes, the decision tree was soon abandoned. It was thought that pairwise might be easier to maintain, as then only the parts of the paths that changed would need re-writing.
Every decision was painstakingly recorded. Note only ‘yes’ and ‘no’ inputs were used. The ‘input unrecognised’ path was left out, as the code handled it universally.
This automated test suite was loaded into the tool and started. It turned out it would have needed three weeks to complete the suite.
This was three weeks we didn’t have, so it was quickly abandoned.
What would automated testing cover?
Automated testing would have covered the decision points being robust if a customer said nothing more than yes or no at each decision point. It would have been an excellent functional regression test that could have been run every time there was a deployment.
What would automated testing not cover?
When a customer talks to an avatar, they tend to say unpredictable things.
Some of them are bad use of English and some of them are incomprehensible. As humans, we make allowances for this and can often glean the intention right away or know how to quickly seek clarification.
Once Dialogflow gets past the accent, it listens for key words. This, it was claimed, is all that would need to be done.
In practice, it required pages and pages of training for every input. There were 320 variations on the word ‘yes’ alone. Who would have thought there so many ways to indicate in the affirmative!
Now imagine the automated testing suite re-testing with every variation of every input, then every combination of every input.
Automated testing can also not cover:
- The voice package delivered by Soul Machines - The voice package was separately recorded at every interaction to stop the unnatural robot sound you get with speech synthesis. In practice, this was always very accurate and only required a quick one-time listen, and even then only a few odd-ball problems were ever uncovered.
- Spoken and text inputs - Is every reasonable input recognised? If it’s not, it’s added to where it should be recognised. Conversely, is something recognised as something it’s not? This input can be captured and sent to the right place.
- Pop-up boxes – These need to be checked for informational accuracy and formatting. The included links need to be checked that they provoke the correct webpage.
- Page fit optimisation on various devices and browsers.
- Does it handle low data speeds? (It sacrifices video quality to keep voice quality as high as possible).
- Does it time out after one minute of inactivity?
- Is it timing out when it shouldn’t?
- Are the API points behaving correctly?
- Did you know you can ask about the weather?
Look and feel
- Does it feel as natural as possible? We are trying to avoid ‘uncanny valley’.
- Voice / mouth sync, avatar smoothness, eye position and blinking.
- Does it respond appropriately to inappropriate questions?
- Is it behaving in a way that could impact on the company’s reputation?
We made it to production
Now the avatar is in production and viewable to the world. It’s time to celebrate, right?
Well, yes and no. It turns out the top 30 questions that are asked to the on-line FAQ section are quite different to what’s being asked of the avatar.
They are also being asked in ways that haven’t been thought of. Not wrong specifically, just differently so they are not recognised.
This requires some early release maintenance. Dialogflow is flexible, as it can be continuously released to production with no outage to the front-end customer.
Every day the text logs of the conversations were sorted through, and anything that was not understood, but should have been, was added in. Questions that appear multiple times and weren’t originally anticipated are captured, so new conversation flows can be created, approved and added.
Through this daily process, another 120 entry points with conversation flows were added. Every time a conversation was added, extra care had to be taken not to weaken one of the other ones.
In the original design, a sentence with the word ‘account’ went to a single entry point. But it was found that customers were asking very specific questions about very different subjects that included the word account in them.
So, the original conversation that tried to narrow down what the customer wanted about their account, with a series of follow up questions, was diversified. Instead of the customer going on a long journey to get what they wanted, they went straight to what they asked for.
This type of customisation was adopted across the board. An example of this is if a customer just said ‘account’, then they went on the original flow to find out what they wanted.
But if they said ‘I want to open a savings account’, they immediately went to that subject. Of course, it meant that we had to remove all combinations of ‘I want to open a savings account’ from the original flow.
Another reason we updated our methodology is we could detect frustration in customers who stated exactly what they wanted, only to be asked again if they wanted it. E.g. the customer says ‘I want a SWIFT code for the USA’ and then the AI says ‘please tell me the country you want the SWIFT code for’.
Being thrown into a cutting-edge project is always exciting and daunting at the same time. Working in a small group and having direct control on the quality of the end product is also very rewarding. It forces you to think outside the square, and create things from scratch based on the knowledge you have and what you make up along the way.
I found that my background in testing IVR (Interactive Voice Response) systems to be helpful, but not an absolute requirement. Anyone with testing skills and desire to provide an excellent customer experience can handle it.
So is this type of AI the future of customer service? In my opinion yes, but for some people.
An avatar provides another option for a company to answer customer queries. It won’t be suited to all customers, but for those who use it, it can be fun and informative. For the company, it has the possibility of freeing up customer service representatives from answering commonly asked and tedious questions.
As a tester, if you are ever offered (or volunteer) to test an AI, don’t be put off by the scary sound of it and jump in. The results will be their own reward!