03. User Interface Evaluation

Designers have complete and comprehensive knowledge of their interface and hence are uniquely unqualified to assess usability.

This makes them blind to the mismatch between the user and designer models. In order to find these, it is important to record realistic interactions; simple observation is insufficient.

Designers must mistrust their interfaces; what is difficult for a user may be obvious to them.

“Think Aloud” Evaluation

Prompt subjects to verbalize their thoughts as they work through the system:

It is hard to talk and concentrate on the task at the same time - you may get a lot of incomprehensible mumbling so the facilitator must ensure they give good and continual prompts to the user.

Apart from the prompts, it should be one-way communication from the subject - otherwise, you will pollute the user’s model.

It is also likely to be very uncomfortable, unpleasant and difficult for the subjects - do your best to make them comfortable.

Cooperative Evaluation

A variation of “think aloud”. In “think aloud”, it feels as if the user is being studied while with cooperative evaluation, two subjects study the system together (with natural two-way communication).

Sometimes, one of the subjects is a confederate - someone involved with the system.

The two subjects work together to solve the problem. It is more comfortable to the subjects and comments about failures of the system emerge much more naturally.

Interviews

The more obvious the technique appears, the less preparation designers intuitively think they need to put into it: designing good interviews (and questionnaires) is difficult and are expensive in terms of time for both the designers and users.

Interviews are:

Plan a central set of questions in for consistency between interviewees and to focus the interview, but still be willing to explore interesting leads.

Questionnaires

Expensive to prepare but cheap to administer - evaluator not required.

NB: ~20% response rate.

Questionnaires can give quantitative (e.g. 30% of users xyz) and qualitative (why did you like x). Question types:

Questionnaires are over-determined user interfaces - a badly-designed question may ‘box in’ the user. Hence, when designing questions:

Continuous Evaluation

Monitoring actual system use:

Crowd-Sourced Experiments

Mechanical turk et al.:

Formal Empirical Evaluation

When you want to see how a small number of competing solutions perform.

This requires strict, statistically testable hypotheses: better/worse or no evidence/difference.

Measure the participants’ response to manipulation of experimental conditions.

The results should be repeatable - the experimental methods must be defined rigorously, but are also time-consuming and expensive.

Ethics

Testing can be distressing.

As an experimenter you care about overall, not individual results, but if a subject makes a mistake, it can make them feel embarrassed and inadequate, especially if there are other other subjects that can see what they are doing.

Treat subjects with respect; at the very least, ensure the experience is not negative.

Before the test:

During the test:

Controlled Experiments

Characteristics:

Research Questions

Congratulations! You have invented ABC. Now you need a research question/hypothesis:

Most research questions are comparative:

Null Hypothesis Significance Testing (NHST):

μ^σ^/n \frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}}

Where:

We want to increase the signal-to-noise ratio, so we need to reduce the denominator:

Aside - the ‘file drawer’ effect:

Internal vs external validity:

Using multiple experiments, some with high internal validity and others with high external validity, can be used to overcome the shortcomings of both.

Be careful in generalizing conclusions:

Point analysis versus depth/theory/model:

Experimental Terminology

Independent variables:

Dependent variables:

Within vs. between subjects:

Counterbalancing:

                      Tiny
Population --------> Sample
                    + noise
  ^                     |
  | Inference           |
  | about the           |
  | population          |
  |                     |
   ---- Statistics <-----

Data Analysis

T-Test

Determines if two samples are likely to be from different populations?

Paired T-Test (within subjects): each participant is tested under both conditions.

Unpaired T-Test (between subjects): independent samples; each participant is only tested under one condition.

data <- read.table('filename', header=TRUE)

t.test(data$conditionA, data$conditionB, paired=TRUE|FALSE)
# If paired=TRUE, values on each row must belong to the same participant

# t-ratio: signal to noise. The bigger (the absolute value), the better
# p-value: can reject null hypothesis if p is less than $\alpha = 0.05$

Lots of additional information available through pairing, dramatically increasing sensitivity: t-ratio will usually be much larger and p-value smaller.

Correlation: Relating Datasets

Determining the strength of the relationship between variables (e.g. is typing and pointing speed correlated?).

Many different models available (e.g. linear, power, exponential), but always look at the graph to see if the model fits.

Common models:

Remember that correlation does not mean causation.

Regression: Relating Datasets

Predicting one value from another.

Line of best fit:

Analysis of Variance (ANOVA)

T-tests allow us to compare between two samples with different values for an independent variable. But what about if the independent variable (factor) can take on more than two values?

We could simply exhaustively compare all pairs, but if the IV can take on nn values, there will be n(n1)2\frac{n(n - 1)}{2} comparisons. Each comparison may find a statistically significant difference by chance (Type I error), so as nn increases, the chance of falsely finding at least one statistically significant difference between pairs increases quadratically.

ANOVA supports factors with more than two levels of a factor and handle multiple factors, while reducing the risk of incorrectly rejecting the null hypothesis by asking if all conditions are from the same population: H0:μ1=μ2==μnH_0: \mu_1 = \mu_2 = \dots = \mu_n. Invert this to see if at least one condition is different.

If there is only one factor (independent variable), it is called one way ANOVA. Factors can be either within or between subjects (although you cannot do both within a factor).