In this post, we’ll round up our work on BingoBuzz, which you can find right here

GOOD/BAD points:

We think we used a good methodical, systematical approach: every iteration tested a specific part, had clear goals and was meaningful to us. Based on those iterations, we were able to make changes that improved our application remarkably. We had a similar systematic approach for writing our blogs. So, all-in-all, we’re quite happy with how we came to our result.

Furthermore we’re content with the end result. Our app has limited functionalities (which is an obvious flaw), but the interface for those functionalities seems quite good to us. On the other hand, the design of our interface also proves that we’re not really artists 😉 but as far as usability goes, we think it’s okay, or maybe even good.

However, we’re not so happy about the fact that we haven’t had an opportunity to release in the wild. Although some of our relatives/friends have been playing our app outside of the test iteration,we always had things to test that weren’t really doable on a large scale. Last week, we could have released something, but classes were over by that time, so there was no public to test with… The reason for our late release is our busy schedule, which we’ll talk about in one of the next sections.

What we would do differently:

First of all, we would never again do this course in the second year of our master’s. We think the course is super-interesting, we really do, and we would have loved to allocate more time to it, but money-time for UI was also money-time for our thesis. Since we kinda saw that coming, we chose to push our deadline to, well, today. As stated, everybody was preparing for exams instead of going to class when we had some spare time to catch up… This was illustrated again in the last couple of days, when we tried to do our final iteration on the campus of both sciences and medicine, and got rejected several times because “sorry, I’m on my way to my exam, I’m too nervous”. (which is understandable) We can’t really blame anyone but ourselves. In hindsight, we would pick this course in our first year.

A second, less important remark: we should have gone straight for the webapp idea, instead of losing time on steroids…

What we would do with more time:

Mainly add some functionality and then go play it in class 🙂 Obviously, we would love to have an evaluation iteration with a lot of users, aka in the wild, and study whether they lose focus or how long they can play before getting bored. We might revive our moderation idea, but that’s a long shot.

We’d also contact someone with some aesthetic ability to pick a decent color scheme en clean it up a bit…



Iteration Overview

During the course we went from an initial app idea to a concrete implementation. The first prototypes were easy to create/modify paper prototypes. These prototypes allowed us to quickly find major ui design issues before starting with the real implementation. There were four different paper prototype iterations, each solving and testing changes made to improve upon our design.

We chose to create a simple web app modelled after the paper prototypes. Our first version only featured part of the app. The conducted test showed the app was working within the goals we defined. This gave us the green light to implement the missing features of the app. With all features in place, the next three iterations tested the efficiency of the different parts of the application.
For each of the steps within the building process of the application there’s at least one blog post. A summary of  the steps can be found on the following spreadsheet.

Digital iteration 4

In one of our previous digital iterations, we noticed that it was obvious to people how to remove a word while starting their own game, they often tapped the wrong button and accidentally removed the wrong word.

One of the changes we made after that specific iteration, was making the font-size bigger, mainly for readability reasons. Consequently, the button to remove a word grew a bit in size (as it was dependent on font-size). However, we wanted to be sure that they are now functioning properly.


Goal: We want to find out how good our buttons to remove a word from the list are. Simple as that.

Method: We devised a pretty specific test for this one: We made a variation of our addWords screen, showing a list of ten words and asking to remove a specific word. When any word is removed, a new list is presented with another assignment. This is repeated ten times, so test users had to remove ten words in total. These were the same words in the same lists for every test user. In every list, the “right” word was in another position, so each position was tested exactly once in one test. Furthermore, to get some information on the influence of screen size & quality, we asked every test user to do the test twice: once on a Samsung Galaxy S3 (4.8 inches, android smartphone) and once on a Mobistel Cynus T1 (4 inches, low-end android smartphone).  You can try it out yourself right here. Before reading on, we highly recommend trying it out for yourself, since everything will be way clearer 😉

Rationale: With this accuracy test, we can see how well our buttons are performing expressed in objective numbers: no opinions/think-aloud/questions, just right or wrong. We refrained from letting test users use their own smartphone, and chose to use the same devices for every test. Because accidentally removing the wrong word is super-extra-annoying, we were hoping to get an accuracy rate of at least 95 procent on each device.

Test subjects: We performed 30 tests: 15 test subjects, two tests each, on different smartphones. The test user group consisted of 9 males and 6 females, ranging in age from 18 to 24. All of them are students, which we think isn’t really a problem, because our application is aimed at students. Five out of the fifteen testers are doing a master’s in Computer science, one is doing his first year in engineering. There were two medicine students, a nursing student, three students that are doing an extra bachelor in ER & intensive care (which is a follow-up program of nursing) and a pharmacology student. Yes, we went to another campus 🙂 . Finally, we also tested with a nanotechnology student and a industrial sciences student.

Results: Letting 15 users remove ten words two times results in 300 test samples. However two people were a bit fast and accidentally tapped the “ok” button of the assignment twice, resulting in a failed test sample. Lucky for us, those occurrences weren’t on the same device. We discarded those two samples, leaving us with 298 samples, 149 per device. Obviously we’re not going to talk through all of time one-by-one. So, we’ll give a thorough summary:

  • Overall score: in 267 out of the 298 cases, the right word was removed. This accumulates to 89,5 %, falling way short of our goal of 95 %.
  • However, for the bigger device, the score was way better: 146/149 (98%), while the smaller device lead to more problems: 121/149 (81%).
  • ALL of the 31 errors had a similar nature: instead of the right word being removed, the test subject accidentally removed the word below the right one.
  • On the smaller device, we noticed a big correlation between error rate and position in the list: A position higher in the list leads to more errors. To give you an idea of the distribution of errors, we put the errors per position in the list in a nice diagram, and combined it with one of the lists (keep in mind that there were ten different lists, so test subject were never instructed to remove “sardine”. The word for this list was pizza):
  • Also remarkable: Guys made 2.333 errors on average, while girls made only 1.66. Oops.

Interpretation: We think these results are quite remarkable. First of all, there is a very obvious difference between devices. While this possibly has to do with screen size, it might also be related to touchscreen quality: Samsung (and, according to some of our test users, apple) apparently know that people aim with the top of their thumb, and that the actual contact is a bit lower, explaining things like touchscreen calibration on the devices of those brands. The low-end phone we used doesn’t do anything like that, explaining the bad results on that device.  This also explains why all of our errors were of the same nature: accidentally tapping the button below the one you intended to tap. Conclusion: people are aiming bad, and some devices aren’t making up for that, and some are. In either case, our app should work smoothly. A pretty obvious solution is upsizing the buttons again, or leaving a bit more margin between buttons, without exaggerating.

At first, we were a bit puzzled by the difference caused by position in the list. However, we think it has something to do with the fact that the contact surface between touch screen and thumb is smaller when touching the lower part of the screen. We made two pictures to explain the previous sentence:

lower part of the screen

Lower part of the screen

Upper part of the screen

Upper part of the screen

As you can see, the contact area is bigger when touching higher parts of the screen, which is also where the “higher” parts of the list were located. It’s kinda hard to incorporate this in our app. Upsizing the buttons will obviously help anyway, and most of the newer smartphones already accommodate the errors made by human hands, so we don’t think we need to do a lot to fix this.

Finally, we don’t want to change anything about the difference between males and females. We’re fine with the fact that women are the more elegant creatures of our species. 😉 On the serious side, this was probably due to some random variation, and we don’t actually believe there is a significant difference in male/female usage of our app 🙂

Third Iteration: Joining and playing a game

Iteration 3 testing

Goal: In our previous iteration, we got the most obvious flaws out of the menus to create a game by using a small group of test-subjects in a task-based think aloud format. With this iteration, we’re doing the same thing, but for all the other screens of our app: the game-list (the screen you see after pressing join) and the game-screen itself. However, we already did an iteration on the game-screen, and instead of just looking for the major flaws, we’re hoping to investigate whether our clue helps people to detect how to win. As you might recall, we’re hoping to avoid a tutorial. In our first iteration people seemed to know what to do, but not how to win the game. We’re hoping that has changed.

Method: We are going to use the think aloud loud principle again. The test will consist of two phases. In the first phase, people have to join a specific game. In order to make that a bit of challenge, we’ll add some dummy games to the game-list. In the second phase, after they joined the right game, we’ll pretend that they’re actually playing: “so now the prof says word x”… after each word, and on specific points of gameplay, we’ll ask them how they think they can win. We hope that 75% will know after seeing our “clue” (green borders around a word that would make them win), and 90% will know after completing the game.

Rationale:  We’d like to know whether our search functionality in the game-list works as people expect it to work, and what they think should change/added. For the game-screen part, we’re hoping to find evidence that people can find out how to play and how to win by just playing the actual game.

Test subjects:  Because of the task-based think-aloud format, we’re again aiming at 10 to 15 people.


In a previous post, we described how we were going to test our menus to make a game. This post describes the results of that test iteration. As a reminder, here’s what we said we were going to do:

Goal: We want to test the ‘Start your own game’ functionality we created over the last weeks. The goal is to find any stupid issues that we missed ourselves or didn’t think of. As most of you will know, such issues always arise 😉 As we said before, we believe the ability to start a game efficiently is a crucial element of our game.

Method: We are going to use the think aloud loud principle and have the users tell us all their frustrations or struggles. Before the test we will explain to the testee what the purpose is of our ‘game’ and what we will be testing. The task they will have to perform is quite simple: “start a game with nine words about  your favorite televisionprogram”.

Rationale:  We thought about timing the process of starting a game, but that would be pointless because we wouldn’t have much reference. Furthermore, the time necessary to start a game depends on the ability of the user to come up with words 🙂 However having users say what they think will automatically let them say what they think is annoying. We chose to not do a “full-scale’ release, but to test with a smaller group of test subjects, which will be sufficient to find the main issues. That way, the larger public will get a improved version right from the start.

Test subjects:  As pointed out we’re not going to full blow release it. We want to make sure we solve the main issues before we try and reach a bigger public. A group of 10 people, preferably from different ages, will probably suffice to find the biggest/obvious issues.

So, what are the results?

Test Subjects: we tested with 10 people, as we planned. They were equally divided between female and male, with a variety of occupations and ages. The oldest person was 45, the youngest was 18. The other 8 were al in their twenties. The group also included four computer science students, a medicine student, a nursing student, an entrepreneur, a student of the master in management, a biochemics student and a waiter.

Results: We’re going to list the most common observations/remarks that were made:

  • Six out of Ten said that the input textfield to add a word should clear after you add the word. This is pretty obvious, we know.  We don’t know how that one slipped by us.. Probably we didn’t really notice/mind because we mainly tested it on our computers (which is pretty dumb for a smartphone app).
  •  There was a problem with the number of words people had to add. We asked them to add 9, and they had to count manually. Seven out of ten were obviously slowed down by this, but didn’t mention it themselves. Three of these seven actually made a mistake and submitted a game with either 8 or 10 words. The three other people had the same problem, but did mention it.
  • Four out of ten people said some labels or buttons were too small to read. Related to that, four out of ten (not the same four btw) subjects accidentally removed the wrong word when they were asked to remove a specific word.
  • Two people said they were surprised they could choose the amount of words: we asked them to add nine, but there was no such restriction in the actual app. Those people remarked that a game of seven words can be won with only one word…
  • One person was confused by the label “lecture subject”.
  • One person tapped the word itself when asked to remove it, instead of the red x at the right side.

Apart from these issues, everything appeared to be easy to comprehend 🙂

Interpretation: The overall flow of the game-making process was certainly okay. The first issue is easily solvable. We also need to add a counter to the add words screen, and maybe we have to place some restrictions on the amount of words. We’re inclined to let users choose between 4, 9 or 16. We’ll have to size up the font and perhaps some buttons (especially those to remove a word), and we noticed that Apple’s safari does really weird things with our font-family. Since the last two issues were raised by a minority, we’re assuming that they won’t be a problem in the future.

What’s next: The minor tweaks described above should finish most of the game-making process. The only issue that we think we might have to test again is removing words: the buttons to do so should be big enough, without taking more screen real estate than necessary. We’re thinking to do this with an accuracy test of some kind…