Designers have complete and comprehensive knowledge of their interface and hence are uniquely unqualified to assess usability.
This makes them blind to the mismatch between the user and designer models. In order to find these, it is important to record realistic interactions; simple observation is insufficient.
Designers must mistrust their interfaces; what is difficult for a user may be obvious to them.
“Think Aloud” Evaluation
Prompt subjects to verbalize their thoughts as they work through the system:
- What they are trying to do
- Why they did the action they did
- How they interpret feedback from the systems
It is hard to talk and concentrate on the task at the same time - you may get a lot of incomprehensible mumbling so the facilitator must ensure they give good and continual prompts to the user.
Apart from the prompts, it should be one-way communication from the subject - otherwise, you will pollute the user’s model.
It is also likely to be very uncomfortable, unpleasant and difficult for the subjects - do your best to make them comfortable.
Cooperative Evaluation
A variation of “think aloud”. In “think aloud”, it feels as if the user is being studied while with cooperative evaluation, two subjects study the system together (with natural two-way communication).
Sometimes, one of the subjects is a confederate - someone involved with the system.
The two subjects work together to solve the problem. It is more comfortable to the subjects and comments about failures of the system emerge much more naturally.
Interviews
The more obvious the technique appears, the less preparation designers intuitively think they need to put into it: designing good interviews (and questionnaires) is difficult and are expensive in terms of time for both the designers and users.
Interviews are:
- Good for probing particular issues
- Can lead to constructive suggestions
- Prone to post-hoc rationalization
Plan a central set of questions in for consistency between interviewees and to focus the interview, but still be willing to explore interesting leads.
Questionnaires
Expensive to prepare but cheap to administer - evaluator not required.
NB: ~20% response rate.
Questionnaires can give quantitative (e.g. 30% of users xyz) and qualitative (why did you like x). Question types:
- Open-ended comments give important insights
- Closed questions restrict responses and give qualitative data - make sure there is no ambiguity in the options
- Likert items: level of agreement with a statement
- Ranked choice questions are good for forcing comparisons
- e.g. ‘Was A better than B?’ is preferred over 'How much did you like ‘A’ and 'How much did you like ‘B’ asked together; comparison on the latter often contains a lot of noise
Questionnaires are over-determined user interfaces - a badly-designed question may ‘box in’ the user. Hence, when designing questions:
- What purpose does the question serve? What information are you hoping to get?
- Know how will you analyze the results
- For each quantitative question, consider adding a qualitative one asking why they picked the result
- Iterate
- Know the dissemination method
Continuous Evaluation
Monitoring actual system use:
- Field studies
- Design team goes to users and see if they use the system as you expected
- Diary studies
- Users write out a few lines describing their experience with the system over the last few hours
- Logging and ‘Customer Experience Programs’
- LOG EVERYTHING!
- Exploratory questions: hope something interesting shows up
- Difficult to analyze
- Aside: in controlled experiments, log everything (until the point at which it slows down the UI)
- Targeted data collection
- How often are specific features used?
- Characterize their activities
- LOG EVERYTHING!
- User feedback and gripe lines
Crowd-Sourced Experiments
Mechanical turk et al.:
- Workers complete ‘Human intelligence tasks’
- They have a HIT approval rating that can be used for filtering
- Problems with noisy data and criteria for exclusion
- Include ‘attention check’ questions
- A significant proportion of ‘workers’ are bots
- Great with COVID - can’t do face-to-face studies
Formal Empirical Evaluation
When you want to see how a small number of competing solutions perform.
This requires strict, statistically testable hypotheses: better/worse or no evidence/difference.
Measure the participants’ response to manipulation of experimental conditions.
The results should be repeatable - the experimental methods must be defined rigorously, but are also time-consuming and expensive.
Ethics
Testing can be distressing.
As an experimenter you care about overall, not individual results, but if a subject makes a mistake, it can make them feel embarrassed and inadequate, especially if there are other other subjects that can see what they are doing.
Treat subjects with respect; at the very least, ensure the experience is not negative.
Before the test:
- Don’t waste their time; use pilots to debug experiments/questionnaires and ensure everything is ready when they arrive
- Make them comfortable
- Emphasize the system, not the user is being tested
- Let them know they can stop at any time
- Privacy: let them know individual test results will be confidential
- Inform: explain what is being monitored and answer their questions
- Only use volunteers: informed consent form required
During the test:
- Make them comfortable
- Relaxed atmosphere
- Never indicate displeasure with the subject’s performance
- Avoid disruptions
- Stop the test if it becomes too unpleasant
- Privacy: do not allow management to observe the test
Controlled Experiments
Characteristics:
- Lucid and testable hypothesis
- Know exactly why you are conducting it and what data you are hoping to get out of it to expose the success/failure of the hypothesis
- Quantitative measurements
- Measure of confidence in results (statistics)
- Is A > B, A < B or is there no discernable results
- Does the experiment successfully discriminate between outcomes?
- Replicability
- Control of variables and conditions
- Removal of experimenter bias; ensure it is objective
Research Questions
Congratulations! You have invented ABC. Now you need a research question/hypothesis:
Lets do a user study of ABC because it’s required for my PhDIs ABC any good?Does ABC beat the competition?Is ABC faster than the competition?- Is ABC faster than XYZ after 10 minutes of use?
- Is ABC faster and less error prone than XYZ after 10 minutes of use?
Most research questions are comparative:
- Is it faster, more accurate, preferred etc. (in relation to the baseline(s))
- Is there a difference when compared to the baseline?
- How big is the difference (and is it a practical difference)?
- How likely is it that the results were due to chance
Null Hypothesis Significance Testing (NHST):
- Widely used set of techniques for dichotomous testing
- The hypothesis should be expressed as a negative (e.g. XYZ is not faster than the baseline, ABC)
-
(average performance of 1 and 2 are the same)
-
- Reject null hypothesis,
, when - Given that the null hypothesis is true, the probability of observing data as extreme as what we saw should be very low (
) -
is usually
- Given that the null hypothesis is true, the probability of observing data as extreme as what we saw should be very low (
- Failure to reject does not mean that ‘they are the same’
- It could be that they are the same or that the experiment was not sensitive enough (e.g. too few participants)
- Reject or fail to reject; never accept the null hypothesis
Where:
-
is the signal: the magnitude of difference between the means -
is the standard deviation -
is the number of data points
We want to increase the signal-to-noise ratio, so we need to reduce the denominator:
- Reduce
- Better training:
- There will be a large amount of variance in the first few trials (power law of learning)
- If you only care about performance in proficient users, the first few trials are just noise
- Hence, more training will get the participants out of this area, reducing noise
- Outlier removal
- Log transformation
- Better training:
- Increase
- Diminishing returns due to square root
Aside - the ‘file drawer’ effect:
- ‘Unsuccessful’ experiments - those that fail to reject the null hypothesis, are ‘uninteresting’ and tend to go unpublished
- Survivorship bias: 19 studies correctly failing to reject null hypothesis go unpublished while one study that (by chance) that claims a significant effect gets published
- Do enough experiments, some will get lucky
- https://xkcd.com/882
- https://xkcd.com/1478
Internal vs external validity:
- External validity:
- Broad truth of the interface: if people used ABC, would the world be better?
- Findings are broad/real (e.g. is ABC any good?)
- Makes the world better
- Internal validity:
- Precise and replicable, but gets away from the fundamental truth we are trying to get too
- Findings are valid under specific circumstances that may not reflect real world usage by the general population
- e.g. valid for undergraduate psychology students at UC
Using multiple experiments, some with high internal validity and others with high external validity, can be used to overcome the shortcomings of both.
Be careful in generalizing conclusions:
- ABC was better than XYZ; ensure you identify the right cause for the improvement
- When generalizing, identify the human factor underlying the difference and rephase the research question around the human factor
- e.g. list of bookmarks vs 3D, spatial layout where all the items were shown at once
- Can’t conclude that 3D is better than 2D; would need to compare against a 2D, spatial layout rather than a list
Point analysis versus depth/theory/model:
- Identify and include salient secondary factors: is the result generally true or only true under the tested conditions?
Experimental Terminology
Independent variables:
- Controlled conditions
- Manipulated independent of behavior
- May arise from participant classification
- e.g. male/female
- Discrete values: independent variable levels
- Called ‘Factors’ in ANOVA
Dependent variables:
- Measured variables
- Dependent on participant’s response to manipulation of IVs
Within vs. between subjects:
- How IVs are administered between/within subjects
- Within subjects: each participant tested on all levels
- Use this whenever you can
- Participants act as a control for their own variability (some people just fast, some people just slow)
- Can measure relative performance for each subject
- Fewer participants required
- But need to account for learning/fatigue effects
- Every participant must be tested on every single level; their data must be thrown out otherwise
- Between subjects: each participant tested on a single level
- Sometimes necessary if using participant classification (e.g. male/female)
- Unmoderated variability
- Don’t mix within and between subject treatments within a single factor
Counterbalancing:
- When using within-subjects, need to control order of exposure to control for learning/fatigue effects
- Participants divided into groups; different order for each group
- Group becomes a between subjects factor
Tiny
Population --------> Sample
+ noise
^ |
| Inference |
| about the |
| population |
| |
---- Statistics <-----
Data Analysis
T-Test
Determines if two samples are likely to be from different populations?
Paired T-Test (within subjects): each participant is tested under both conditions.
Unpaired T-Test (between subjects): independent samples; each participant is only tested under one condition.
data <- read.table('filename', header=TRUE)
t.test(data$conditionA, data$conditionB, paired=TRUE|FALSE)
# If paired=TRUE, values on each row must belong to the same participant
# t-ratio: signal to noise. The bigger (the absolute value), the better
# p-value: can reject null hypothesis if p is less than $\alpha = 0.05$
Lots of additional information available through pairing, dramatically increasing sensitivity: t-ratio will usually be much larger and p-value smaller.
Correlation: Relating Datasets
Determining the strength of the relationship between variables (e.g. is typing and pointing speed correlated?).
Many different models available (e.g. linear, power, exponential), but always look at the graph to see if the model fits.
Common models:
- Pearson’s
(for linear correlation) - Correlation coefficient between -1 and 1
- Cohen’s rule of thumb:
- 0.1 - 0.3 is ‘small’
- 0.3 - 0.5 is ‘medium’
- 0.5 - 1.0 is ‘large’
- Spearman’s
(for ranked data)
Remember that correlation does not mean causation.
Regression: Relating Datasets
Predicting one value from another.
Line of best fit:
- Linear
-
: coefficient of determination - (same
as Pearson’s, but upper case for some reason)
- (same
- Between 0 and 1
- Proportion of variability explained by the model
- A value of 0.8 or larger is good for human performance
- Fitts’ law experiments usually give values around 0.95
Analysis of Variance (ANOVA)
T-tests allow us to compare between two samples with different values for an independent variable. But what about if the independent variable (factor) can take on more than two values?
We could simply exhaustively compare all pairs, but if the IV can take on
ANOVA supports factors with more than two levels of a factor and handle multiple factors, while reducing the risk of incorrectly rejecting the null hypothesis by asking if all conditions are from the same population:
If there is only one factor (independent variable), it is called one way ANOVA. Factors can be either within or between subjects (although you cannot do both within a factor).