Your product could be secretly misbehaving 😳

Tips
Your product could be secretly misbehaving 😳

Or, in other words, how do we know if a product works as we need it to work from a user experience and business perspective, without silently doing something completely wrong?

It’s way too frequent that products seem to work because they don’t report any errors or raise any exceptions, however they are still broken from a user flow or a business perspective.

Imagine we are a fintech startup, providing loans for individuals.
Our product might be fully functional, yet it might be rejecting loans that it should approve, or, even worse, approving loans that it should reject.
Yet we wouldn’t even know that this is happening until it’s way too late, and we suddenly start seeing too many users defaulting on their repayments.


How can we catch that early on, prevent it from happening, and, as a worst case scenario, be notified in case it does happen in production?


By tight flow & behaviour monitoring.

By connecting the actual plan we rely on when building our product and writing our code - to the actual code.

By looking at our user tracking as a live mini-map in GTA, The Elder Scrolls Skyrim (or Morrowind, if you’re old like us), or any other RPG - which shows us exactly where the user is and what they’re doing, in real-time.

Let’s put these ideas into practice.

Say we’ve got this beautiful user journey, which describes a loans application, start to finish:

A loan application journey

It’s a long process, so let’s focus on the main bit that interests us in our example - the approval or rejection of a loan based on the credit score of an individual.

In a very simple case, it might look something like this:

The logic for approving a loan

If a credit rating of a person applying for a loan is 600 or more - we would like to approve it.
If, however, it’s less than 600 - we would like to reject the loan.

In the real world, you would most likely NOT be notified for loans that are rejected. You would most likely ONLY be notified in case your code throws an unexpected error, or if a timeout occurs somewhere in the chain, however not if a loan is rejected for valid reasons.

Now, what if, in one of the releases, we introduced a bug that suddenly starts approving loans that have a credit rating of 500?

We would not be notified about that, which would result in ineligible people receiving credit.

And we wouldn’t even know until it’s too late.

So, what can we do about it?

Let’s split our solution into two parts; prevention, and notification

Prevention

Prevention can be achieved by thorough testing. Though testing is usually mostly visual, and it is very easy to let such a bug slip when relying purely on logs or visual QA of the frontend.

So it helps a lot to use a tracking tool that shows you every single action that every end-user performs.


For example, if we were to monitor your end-user, whose (for testing purposes) ID is “your-user-id-1234” - we could see something like this while this user is going through the loan application:

Tracking how a specific user experienced the loan application

We can see that this user’s loan has been successfully approved following the credit check.

Let’s focus on the fact that the loan has been approved - which is the green circle on the right (which, by the way, looks exactly the same as the circle we drew when we planned this journey in the first place).

When looking inside the “Approve” step (by clicking on the Attachments icon with the red dot) - we will see exactly what happened to `your-user-1234` when their loan was approved:

Diving deeper into the "Approve" event for a particular user

Looking closely at the JSON object on the right (which represents some data from the API, including the credit score) - we will notice that the last field in this object states that the user’s credit score was 400.

Yet, the event we’ve sent to STATEWIZE notified STATEWIZE that the loan has been approved (see the green label on the right, just above the JSON object).

As you can see, it’s obvious that there’s been a mistake somewhere. But instead of digging through logs, or connecting to the database and writing some SQL queries, we can see the reason right here and now, in a completely effortless & immediate kind of way.

Performing this simple check during the QA process will dramatically decrease the possibility of such an issue arising in production.

Still, we need to prepare for the worst, so let’s imagine it somehow slipped anyway. We would like to be notified about that, right?


Let’s configure a Slack notification.

Notifications

You can add Slack notifications (as well as custom webhooks) for every event received by STATEWIZE. Now, to avoid being overloaded by notifications, you may add notifications only for events of high importance, such as failures during sanity checks.

For our specific purposes, we will add a step called “Sanity check”, where we would check if an approved loan had a credit score below a certain hard-limit threshold, for example, 500.

So let’s start by going back to our planning board, and adding this step to the board:

Adding a sanity check to the plan

Now, let’s implement the change in our actual product’s codebase, for example with something like this:

An example sanity check implementation (Can also be done as a query in tools like DataDog)

Now you might be thinking “Well, if we add this code to the codebase, couldn’t we have prevented the loan from being approved in the first place?”

We could, but usually, the thresholds are dynamically configured elsewhere, i.e in the database, while here we are adding a completely separate sanity check -  purposefully hard-coded, with low-risk tolerance, solely for the goal of acting as an independent whistleblower in case we made a mistake with our configuration.

To add a Slack notification for when the sanity check fails, we just need to perform a few clicks, with no additional coding;


(If you don’t see the Slack option, make sure you’ve signed in to your Slack workplace from your settings)

Now once that’s done, let’s test our user again and see if we receive a Slack notification:

Slack notification for a failed sanity check




And that’s it!

Now we can be certain that not only we’ve done everything we could to prevent our product from silently misbehaving, but also added an additional safeguard and a notification that will alert us in case our product misbehaves, or our tech implementation doesn’t match our business requirements.