03. User Interface Evaluation

Designers have complete and comprehensive knowledge of their interface and hence are uniquely unqualified to assess usability.

This makes them blind to the mismatch between the user and designer models. In order to find these, it is important to record realistic interactions; simple observation is insufficient.

Designers must mistrust their interfaces; what is difficult for a user may be obvious to them.

“Think Aloud” Evaluation

Prompt subjects to verbalize their thoughts as they work through the system:

What they are trying to do
Why they did the action they did
How they interpret feedback from the systems

It is hard to talk and concentrate on the task at the same time - you may get a lot of incomprehensible mumbling so the facilitator must ensure they give good and continual prompts to the user.

Apart from the prompts, it should be one-way communication from the subject - otherwise, you will pollute the user’s model.

It is also likely to be very uncomfortable, unpleasant and difficult for the subjects - do your best to make them comfortable.

A variation of “think aloud”. In “think aloud”, it feels as if the user is being studied while with cooperative evaluation, two subjects study the system together (with natural two-way communication).

Sometimes, one of the subjects is a confederate - someone involved with the system.

The two subjects work together to solve the problem. It is more comfortable to the subjects and comments about failures of the system emerge much more naturally.

Interviews

The more obvious the technique appears, the less preparation designers intuitively think they need to put into it: designing good interviews (and questionnaires) is difficult and are expensive in terms of time for both the designers and users.

Interviews are:

Good for probing particular issues
Can lead to constructive suggestions
Prone to post-hoc rationalization

Plan a central set of questions in for consistency between interviewees and to focus the interview, but still be willing to explore interesting leads.

Questionnaires

Expensive to prepare but cheap to administer - evaluator not required.

NB: ~20% response rate.

Questionnaires can give quantitative (e.g. 30% of users xyz) and qualitative (why did you like x). Question types:

Open-ended comments give important insights
Closed questions restrict responses and give qualitative data - make sure there is no ambiguity in the options
Likert items: level of agreement with a statement
Ranked choice questions are good for forcing comparisons
- e.g. ‘Was A better than B?’ is preferred over 'How much did you like ‘A’ and 'How much did you like ‘B’ asked together; comparison on the latter often contains a lot of noise

Questionnaires are over-determined user interfaces - a badly-designed question may ‘box in’ the user. Hence, when designing questions:

What purpose does the question serve? What information are you hoping to get?
Know how will you analyze the results
For each quantitative question, consider adding a qualitative one asking why they picked the result
Iterate
Know the dissemination method

Continuous Evaluation

Monitoring actual system use:

Field studies
- Design team goes to users and see if they use the system as you expected
Diary studies
- Users write out a few lines describing their experience with the system over the last few hours
Logging and ‘Customer Experience Programs’
- LOG EVERYTHING!
  - Exploratory questions: hope something interesting shows up
  - Difficult to analyze
  - Aside: in controlled experiments, log everything (until the point at which it slows down the UI)
- Targeted data collection
  - How often are specific features used?
  - Characterize their activities
User feedback and gripe lines

Crowd-Sourced Experiments

Mechanical turk et al.:

Workers complete ‘Human intelligence tasks’
- They have a HIT approval rating that can be used for filtering
Problems with noisy data and criteria for exclusion
Include ‘attention check’ questions
- A significant proportion of ‘workers’ are bots
Great with COVID - can’t do face-to-face studies

Formal Empirical Evaluation

When you want to see how a small number of competing solutions perform.

This requires strict, statistically testable hypotheses: better/worse or no evidence/difference.

Measure the participants’ response to manipulation of experimental conditions.

The results should be repeatable - the experimental methods must be defined rigorously, but are also time-consuming and expensive.

Ethics

Testing can be distressing.

As an experimenter you care about overall, not individual results, but if a subject makes a mistake, it can make them feel embarrassed and inadequate, especially if there are other other subjects that can see what they are doing.

Treat subjects with respect; at the very least, ensure the experience is not negative.

Before the test:

Don’t waste their time; use pilots to debug experiments/questionnaires and ensure everything is ready when they arrive
Make them comfortable
- Emphasize the system, not the user is being tested
- Let them know they can stop at any time
Privacy: let them know individual test results will be confidential
Inform: explain what is being monitored and answer their questions
Only use volunteers: informed consent form required

During the test:

Make them comfortable
- Relaxed atmosphere
- Never indicate displeasure with the subject’s performance
- Avoid disruptions
- Stop the test if it becomes too unpleasant
Privacy: do not allow management to observe the test

Controlled Experiments

Characteristics:

Lucid and testable hypothesis
- Know exactly why you are conducting it and what data you are hoping to get out of it to expose the success/failure of the hypothesis
Quantitative measurements
Measure of confidence in results (statistics)
- Is A > B, A < B or is there no discernable results
- Does the experiment successfully discriminate between outcomes?
Replicability
Control of variables and conditions
Removal of experimenter bias; ensure it is objective

Research Questions

Congratulations! You have invented ABC. Now you need a research question/hypothesis:

~~Lets do a user study of ABC because it’s required for my PhD~~
~~Is ABC any good?~~
~~Does ABC beat the competition?~~
~~Is ABC faster than the competition?~~
Is ABC faster than XYZ after 10 minutes of use?
Is ABC faster and less error prone than XYZ after 10 minutes of use?

Most research questions are comparative:

Is it faster, more accurate, preferred etc. (in relation to the baseline(s))
Is there a difference when compared to the baseline?
- How big is the difference (and is it a practical difference)?
- How likely is it that the results were due to chance

Null Hypothesis Significance Testing (NHST):

Widely used set of techniques for dichotomous testing
The hypothesis should be expressed as a negative (e.g. XYZ is not faster than the baseline, ABC)
- $H_0: \mu_1 = \mu_2$ (average performance of 1 and 2 are the same)
Reject null hypothesis, $H_0$ , when $P(D|H_0) < \alpha$
- Given that the null hypothesis is true, the probability of observing data as extreme as what we saw should be very low ( $\alpha$ )
- $\alpha$ is usually $0.05$
Failure to reject does not mean that ‘they are the same’
- It could be that they are the same or that the experiment was not sensitive enough (e.g. too few participants)
- Reject or fail to reject; never accept the null hypothesis

\frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}}

Where:

$\hat{\mu}$ is the signal: the magnitude of difference between the means
$\hat{\sigma}$ is the standard deviation
$n$ is the number of data points

We want to increase the signal-to-noise ratio, so we need to reduce the denominator:

Reduce $\hat{\sigma}$
- Better training:
  - There will be a large amount of variance in the first few trials (power law of learning)
  - If you only care about performance in proficient users, the first few trials are just noise
  - Hence, more training will get the participants out of this area, reducing noise
- Outlier removal
- Log transformation
Increase $n$
- Diminishing returns due to square root

Aside - the ‘file drawer’ effect:

‘Unsuccessful’ experiments - those that fail to reject the null hypothesis, are ‘uninteresting’ and tend to go unpublished
Survivorship bias: 19 studies correctly failing to reject null hypothesis go unpublished while one study that (by chance) that claims a significant effect gets published
- Do enough experiments, some will get lucky
- https://xkcd.com/882
- https://xkcd.com/1478

Internal vs external validity:

External validity:
- Broad truth of the interface: if people used ABC, would the world be better?
- Findings are broad/real (e.g. is ABC any good?)
- Makes the world better
Internal validity:
- Precise and replicable, but gets away from the fundamental truth we are trying to get too
- Findings are valid under specific circumstances that may not reflect real world usage by the general population
  - e.g. valid for undergraduate psychology students at UC

Using multiple experiments, some with high internal validity and others with high external validity, can be used to overcome the shortcomings of both.

Be careful in generalizing conclusions:

ABC was better than XYZ; ensure you identify the right cause for the improvement
When generalizing, identify the human factor underlying the difference and rephase the research question around the human factor
e.g. list of bookmarks vs 3D, spatial layout where all the items were shown at once
- Can’t conclude that 3D is better than 2D; would need to compare against a 2D, spatial layout rather than a list

Point analysis versus depth/theory/model:

Identify and include salient secondary factors: is the result generally true or only true under the tested conditions?

Experimental Terminology

Independent variables:

Controlled conditions
Manipulated independent of behavior
May arise from participant classification
- e.g. male/female
Discrete values: independent variable levels
Called ‘Factors’ in ANOVA

Dependent variables:

Measured variables
Dependent on participant’s response to manipulation of IVs

Within vs. between subjects:

How IVs are administered between/within subjects
Within subjects: each participant tested on all levels
- Use this whenever you can
- Participants act as a control for their own variability (some people just fast, some people just slow)
  - Can measure relative performance for each subject
- Fewer participants required
- But need to account for learning/fatigue effects
- Every participant must be tested on every single level; their data must be thrown out otherwise
Between subjects: each participant tested on a single level
- Sometimes necessary if using participant classification (e.g. male/female)
- Unmoderated variability
Don’t mix within and between subject treatments within a single factor

Counterbalancing:

When using within-subjects, need to control order of exposure to control for learning/fatigue effects
Participants divided into groups; different order for each group
- Group becomes a between subjects factor

                      Tiny
Population --------> Sample
                    + noise
  ^                     |
  | Inference           |
  | about the           |
  | population          |
  |                     |
   ---- Statistics <-----

Data Analysis

T-Test

Determines if two samples are likely to be from different populations?

Paired T-Test (within subjects): each participant is tested under both conditions.

Unpaired T-Test (between subjects): independent samples; each participant is only tested under one condition.

data <- read.table('filename', header=TRUE)

t.test(data$conditionA, data$conditionB, paired=TRUE|FALSE)
# If paired=TRUE, values on each row must belong to the same participant

# t-ratio: signal to noise. The bigger (the absolute value), the better
# p-value: can reject null hypothesis if p is less than $\alpha = 0.05$

Lots of additional information available through pairing, dramatically increasing sensitivity: t-ratio will usually be much larger and p-value smaller.

Correlation: Relating Datasets

Determining the strength of the relationship between variables (e.g. is typing and pointing speed correlated?).

Many different models available (e.g. linear, power, exponential), but always look at the graph to see if the model fits.

Common models:

Pearson’s $r$ (for linear correlation)
- Correlation coefficient between -1 and 1
- Cohen’s rule of thumb:
  - 0.1 - 0.3 is ‘small’
  - 0.3 - 0.5 is ‘medium’
  - 0.5 - 1.0 is ‘large’
Spearman’s $\rho$ (for ranked data)

Remember that correlation does not mean causation.

Regression: Relating Datasets

Predicting one value from another.

Line of best fit:

Linear
$R^2$ : coefficient of determination
- (same $r$ as Pearson’s, but upper case for some reason)
Between 0 and 1
Proportion of variability explained by the model
A value of 0.8 or larger is good for human performance
- Fitts’ law experiments usually give values around 0.95

Analysis of Variance (ANOVA)

T-tests allow us to compare between two samples with different values for an independent variable. But what about if the independent variable (factor) can take on more than two values?

We could simply exhaustively compare all pairs, but if the IV can take on $n$ values, there will be $\frac{n(n - 1)}{2}$ comparisons. Each comparison may find a statistically significant difference by chance (Type I error), so as $n$ increases, the chance of falsely finding at least one statistically significant difference between pairs increases quadratically.

ANOVA supports factors with more than two levels of a factor and handle multiple factors, while reducing the risk of incorrectly rejecting the null hypothesis by asking if all conditions are from the same population: $H_0: \mu_1 = \mu_2 = \dots = \mu_n$ . Invert this to see if at least one condition is different.

If there is only one factor (independent variable), it is called one way ANOVA. Factors can be either within or between subjects (although you cannot do both within a factor).