Using Amazon's Mechanical Turk for Animal Advocacy Studies: Opportunities and Challenges

Peter Wildeford

MTurk, or Amazon’s Mechanical Turk, is an online platform where people (called “requesters”) post various tasks (e.g., surveys, manual currency conversion, spreadsheet formatting, audio transcription) for other people (called “Turkers”, “workers”, or “participants”) to complete for a fee.

MTurk is very cheap ($0.80 or less per participant in a 10-min survey), very easy to use (a task usually can be created and posted in under half an hour of work), and very quick (thousands of results can come back in under a day).

However, MTurk is limited by failing to recruit an ideally representative sample of the United States that may frequently misrepresent themselves, spam answers randomly to acquire money as fast as possible without caring about results, giving you junk data that requires significant work to determine meaningful results from. Turkers often have prior exposure to many study effects, which can bias your sample, and the minority of Turkers with a lot of prior exposure (so-called “superturkers”) are the most likely to take any particular study. There’s also a significant coordination problem in ensuring that those studying animal advocacy carefully use the large but limited Turker population.

Who uses MTurk for research?

MTurk is used frequently by academics for a variety of original research (e.g.,Lawson, Lenz, Baker, & Myers, 2010;Suri & Watts, 2011; Adkison, O'Connor, Chaiton, & Schwartz, 2015; Huber, Hill, & Lenz, 2012 ; Jonason, Icho, & Ireland, 2016; Frankowski, et. al., 2016). Stewart, et al. (2015) report that “hundreds of studies” use MTurk and Casey, et. al. (2016) state “the 300 most influential social science journals published over 500 papers that relied on Mechanical Turk data”.

MTurk has also been used by scientists to answer questions related to people’s views about animals (Rothgerber, 2014; Lyerly & Reeve, 2015), by people directly in the animal advocacy community -- MTurk has been used by Vegan Outreach twice to study leaflet effectiveness (Norris, 2014; Norris & Roberts, 2016) and other studies by other animal advocacy groups are forthcoming.

What is the problem of representativeness?

Imagine a gum company considering selling blue or red bubblegum wanting to know which color the United States (US) population prefers. Conducting an experiment on a college campus may give information about which is preferred, but it would only be information from a limited (much younger and more educated) segment of the US population, not information about the entire US and therefore could be misleading about how well the products would perform nationally.

This mismatch between those surveyed and the population of interest is a problem with the representativeness of the study. For example, an animal advocacy organization interested in assessing leaflet effectiveness may be interested in college populations because leaflets are usually handed out on college campuses -- college samples are likely much more representative than MTurk for this particular example. Similarly, online ads sometimes target females age 13-25 (Mercy For Animals, 2016), which may mean a representative study of online ads would look solely a internet-using young women instead of the broader US population.

How representative is the MTurk population?

50% of Turkers are from the US and MTurk is being used by an increasingly Indian population, who often use MTurk as their full-time job (Ross, et. al., 2010), however anMTurk demographic tracker provided by Ipeirotis (2010) which allows more in-depth monitoring suggests a return toward US workers, at roughly 70% of the population. Still, care must be taken to restrict to a US-only sample for US research, which can be done through a combination of IP geo-targeting and asking participants their country of origin.

US Turkers have been noted to be less African American (5% on MTurk compared to 12% in the US) and more liberal (53% liberal on MTurk compared to 20% in the US) than the US population generally (Paolacci, Chandler, & Ipeirotis, 2010;Ross, et. al., 2010;Burhmester, Kwang, & Gosling, 2011;Berinsky, Huber, & Lenz, 2012;Sides, 2012; Shapiro, Chandler, & Mueller, 2013; Huff & Tingley, 2015). The mean age was slightly younger than the average US population and education was also slightly higher than average ( Paolacci, Chandler, & Ipeirotis, 2010) Heterosexuality rates were also found to match the US average (Shapiro, Chandler, & Mueller, 2013, p216).

There is disagreement in the literature about whether US Turkers are more male or female heavy. Sides (2012) found that “the MTurk sample is younger, more male, poorer, and more highly educated than Americans generally” and Rouse (2015) also found a male-heavy sample (56% male). This is also supported by the latest results from Ipeirotis (2010)’s demographic tracker. However, a number of other studies have found MTurk to be more female (Paolacci, Chandler, & Ipeirotis, 2010;Burhmester, Kwang, & Gosling, 2011;Berinsky, Huber, & Lenz, 2012;Shapiro, Chandler, & Mueller, 2013). Ross, et. al. (2010) argues that there has been a demographic shift on MTurk toward a larger male population. This does not line up with the years of the studies found in this review, though perhaps because of random noise.

One particular concern is that while the income distribution shape of MTurk matched the US population, MTurk was biased toward having more lower income people -- 67% of Turkers report earning below $60K/yr, compared to 45% of a US internet population (Paolacci, Chandler, & Ipeirotis, 2010, p42). While Lusk (2014) finds that vegetarians tend to be higher income, it is unclear if there is a relationship between income and vegetarianism or the effect of MTurk on income would introduce any notable bias. Furthermore, Vegetarian Research Group (2015) finds the opposite effect, stating that lower income people are more likely to be vegetarian.

Kahan (2013) is particularly concerned that even while MTurk does not recruit a sufficient number of conservatives, it likely does not recruit representative conservatives -- even the conservatives that are recruited may be very unlike conservatives generally, due to having different motivations (and likely different incomes). However, this theoretical concern has not been found to be the case empirically -- Clifford, Jewell, and Waggoner (2015) found that conservatives recruited on MTurk matched those recruited by other sampling methods.

Another possible concern is that Turkers are much less likely than the US population to have ever been married and are significantly more likely to rent rather than own their home (Berinsky, Huber, & Lenz, 2012, p358). This should not be a severe detriment to animal advocacy research.

Lastly, another interesting problem is that representativeness may not be consistent between MTurk studies, as the samples can differ somewhat between different times of day and different points among a serial recruitment process without replacement ( Casey, et. al., 2016).

Overall, the research consensus is that US Turkers are noticeably more representative of the US population than a number of online sampling, college sampling, or convenience sampling methods (Paolacci, Chandler, & Ipeirotis, 2010;Burhmester, Kwang, & Gosling, 2011;Berinsky, Huber, & Lenz, 2012;Casler, Bickel, & Hackett, 2013;Huff & Tingley, 2015;Shapiro, Chandler, & Mueller, 2015;Rouse, 2015;Clifford, Jewell, and Waggoner, 2015), howeverFort, Adda, and Cohen (2010),Sides (2012) and Kahan (2013) dissent.

Notably, many common psychological, economic, and political science experiments have been replicated successfully on MTurk (Paolacci, Chandler, & Ipeirotis, 2010;Horton, Rand, & Zeckhauser, 2011;Berinsky, Huber, & Lenz, 2012;Goodman, Cryder, & Cheema, 2012; Casler, Bickel, & Hackett, 2013).

MTurk is significantly less representative than professional surveys (e.g., the Annenberg National Election Survey) that attempts to collect representative samples at significantly higher cost, but MTurk is still very usable when funding constrained.

What are the potential risks of Turkers in particular, beyond non-representativeness?

The largest concern people seem to have about MTurk, in my experience, is a concern about data quality. Specifically, if participants are all paid to take surveys, what incentivizes them to answer accurately and how do we know that they answer accurately? For example, Oppenheimer, Meyvis, & Davidenko (2009) found that online survey participants are often less attentive than those watched by experimenters in a lab, meaning they may pay less attention to the treatment and bias the experiment. Especially given that the amount of payment is small and fixed, the only way Turkers can increase payment per hour worked is working faster.

However, MTurk has a built-in mechanism for dealing with this effect -- the MTurk platform can be used to “reject” Turkers who fail to complete catch trials (Paolacci, Chandler, & Ipeirotis, 2010) and it’s common for people to only recruit from Turkers with a >95% task acceptance rate ( Berinsky, Huber, & Lenz, 2012; Peer, Vosgerau, & Acquisti, 2013 ). This provides an additional incentive for Turkers to maintain accuracy rates and not speed through surveys, lest they lose access to some MTurk revenue.

Additionally, accuracy can be monitored using the MTurk platform. “Catch trials”, or questions designed to test for reading comprehension, can be employed to identify and filter out respondents who are not paying attention ( Oppenheimer, Meyvis, & Davidenko, 2009). MTurk also makes it possible to monitor completion times of subjects, and inhumanly fast completion times can be filtered out.

Such catch trials are highly recommended by Goodman, Cryder, and Cheema (2013), though other researchers argue that careful attention should be made not to overuse worker filtering, as this often risks introducing more bias than it removes ( Peer, Vosgerau, & Acquisti, 2013 ; Chandler, Mueller, & Paolacci, 2014). Rouse (2015) also reported that filtering out people who failed checks did not significantly impact the reliability of any measures studied.

Counterintuitively, however, such catch trials may not be necessary. Most research has found that MTurk participants performed higher on reading comprehension questions than a sample from another non-MTurk online sample on the same questions ( Berinsky, Huber, & Lenz, 2012; Peer, Vosgerau, & Acquisti, 2013 ), which may indicate concern that participants on MTurk actually pay attention more closely than a typical person would, though Goodman, Cryder, and Cheema (2013) found that Turkers underperformed college samples. Peer, Vosgerau, and Acquisti (2013) also found that restricting to workers with >95% task acceptance rates outperforms comprehension check questions for improving data accuracy.

The possibly more attentive participation of Turkers could be a concern for animal advocacy research, since in reality people may not pay close attention to animal advocacy materials. This may mean that ensuring Turkers pay attention to materials and filtering out those who fail comprehension check questions, leads to a population that is far more attentive than the population an intervention is usually implemented on.

How many participants fail catch trials?

It depends on the difficulty of the catch trials. Rouse (2015) found that ~5% of his population did not pass checks, while Antin & Shaw (2012) found 5.6% of theirs. These numbers can vary widely -- in an experiment I personally ran, I found 10-30% of people would fail comprehension checks. More importantly, survey completion rates and catch trial pass rates have equaled or exceeded that of other online survey samples or traditional college student samples (Paolacci, Chandler, & Ipeirotis, 2010; Berinsky, Huber, & Lenz, 2012). However, care must be taken to selecting catch trials that participants do not have prior exposure to (see Kahan, 2013).

Is there any risk of strong social desirability bias?

Social desirability bias is a survey bias where participants tend to answer in ways they believe are desirable to answer, rather than attempting to answer truthfully. Berinsky, Huber, and Lenz (2012) speculate that because Turkers are paid with the possibility of having their answer rejected, Turkers may be especially susceptible to social desirability bias, a documented phenomena where participants try to answer in the way they expect the experimenter wants, rather than with accurate answers. In animal advocacy research, this creates a large concern that treatment participants may figure out the purpose of the study and indicate reduced animal product consumption.

Antin & Shaw (2012) investigated social desirability on MTurk explicitly and reported that participants indicated socially desirable responses. However, they did not compare the social desirability rates to any non-MTurk sample, so it’s unclear if social desirability bias is a particular problem on MTurk. Behrend, et. al. (2011) and Gamblin, et. al. (2016) both asked Turker participants to fill out the Marlowe-Crowne social desirability scale, a psychologically validated questionnaire designed to assess social desirability bias (see Crowne & Marlowe, 1960), and found that MTurk produced samples more prone to social desirability bias than college student samples.

Overall, it’s difficult to assess the degree social desirability bias may be a problem on MTurk, with some reasons to think it may be a particular challenge. Comparing social desirability bias between MTurk and non-MTurk samples could be a productive avenue for future research.

How valid are MTurk results, assuming representativeness and filtering out bad responses?

Even with these conditions, the fundamental limitation is that the MTurk laboratory is not the real world. MTurk participants are in a more controlled environment and while this is useful for ensuring experimental conditions are met, it means participants are interacting with materials differently than usual, and this means external validity is quite questionable.

Is there any risk of the same Turkers doing the same survey over and over for the money, biasing the results?

There’s usually low risk of receiving multiple responses from the same Turker, since each worker has a unique identifier (ID) that must be tied to a unique credit card (Paolacci, Chandler, & Ipeirotis, 2010) and these unique IDs can be monitored to avoid duplication.Berinsky, Huber, and Lenz (2012) and Horton, Rand, and Zeckhauser (2011) both found minimal evidence of duplicate work during their experimental replications.

However, Chandler, Mueller, and Paolacci (2014) pooled many MTurk experiments and found experimental and theoretical reasons for concerns over duplicate work. Kahan (2013) also suggests that many ways of detecting duplicate work could be defeated by sufficiently motivated Turkers, such as through using VPNs, though there is no way to know how many do this.

What other concerns might there be about the MTurk platform?

Huff and Tingley (2015) , Kahan (2013), and Chandler, Mueller, and Paolacci (2014) all mention a particular concern that since Turkers remain Turkers for months and because questions don’t vary much between different psychological studies, participants can end up pre-exposed to common survey questions, which can lower effect sizes ( Chandler, et. al., 2015). They suggest avoiding reusing common psychological scales and tracking worker IDs who have previously completed related surveys and be careful to not readmit them into future studies with the same questions.

Stewart, et al. (2015) find a power law to Turkers, where a large majority of tasks are done by a small minority of Turkers, commonly called “superturkers”. These superturkers are overwhelmingly likely to take your study and are also overwhelmingly likely to have prior exposure to previous studies

Goodman, Cryder, & Cheema (2013) call into concern questions on MTurk that ask participants about their personal knowledge (e.g., how many countries are in Africa?), because a sizable minority of participants appear to look up the answer on the internet, even if they are not incentivized for correct answers.

How many unique respondents are on MTurk?

MTurk is not bottomless in its participant pool. MTurk itself claims access to 500,000 people. The true number is likely significantly lower -- Stewart, et al. (2015) estimate that, at any given time, there’s approximately 7300 Turkers available to complete a given survey, with this pool being 50% refreshed in approximately seven months. Note, this estimate is restricted to US workers who have completed fifty or more prior tasks at 95% or more accuracy. It’s probably possible to obtain more participants by altering these restrictions. Additionally, a separate replication in Stewart, et al. (2015) found an estimated population of 16,306 Turkers using the same sample restrictions.

A separate, informal estimate by Ipeirotis , one of the leading MTurk researchers, suggests an availability of 1,000-10,000 Turker full-time equivalents at any given time.

While not intentionally trying to find a maximum sample size, Casey, et. al. (2016) was able to sample 9,770 unique respondents over eight weeks, using the same selection criteria as Stewart, et al. (2015) . Similarly, Huff & Tingley (2015) were able to collect a sample of 15,584 unique respondents “over time”, though it was not made clear how long of a timeframe this took.

Therefore, completing a 10,000 person study could take months or years, which could be a substantial concern given that these samples may be necessary for animal advocacy researchers attempting to detect small treatment effects.

This concern is further compounded by numerous animal advocacy researchers running animal advocacy studies on MTurk , creating serious concerns of participant contamination and competition over the limited MTurk population. This could be resolved by careful coordination between researchers using MTurk and attempting to identify Turker IDs who have previously participated in animal advocacy studies.

If an experiment wanted to zero in on a particular population, could it do so over MTurk?

Yes. Basically, MTurk allows broad data collection from the entire MTurk population for a fee per worker, and then researchers could filter the data down to the particular subset of interest. However, in a more advanced move, MTurk allows recording the unique Turker IDs and pre-screening them, giving virtual “qualifications” that only the researcher can see. Researchers could then very cheaply screen an MTurk population for the chosen demographic and follow up with the screened population, without the screened population being aware they were screened. An initial demographic survey could be completed for less than $0.10 per worker, allowing an initial population to be screened for a lower cost than data subsetting.

How would a typical animal advocacy experiment look on the MTurk platform?

A typical MTurk experiment might take the form of posting a job on MTurk (called a HIT or Human Intelligence Task), with a particular description and title. This description and title should be vague yet enticing, usually mentioning the particular length of the survey, since describing specific characteristics of the survey in the description can induce non-response bias as found by Ross, et. al. (2010).

A HIT collects responses from Turkers until the desired sample size is reached and then the HIT should be closed.

The HIT itself usually will contain instructions and link to an external survey, hosted on platforms such as Qualtrics or SurveyMonkey. Using these platforms, data is collected and Turkers randomly split into treatment and control groups. Qualtrics and SurveyMonkey can also automatically record the Turker IDs allowing tracking between experiments and follow ups while maintaining the anonymity of the Turker).

If a baseline-endline study is desired, follow ups with specific Turkers can be completed if their unique ID is known.

What kind of wage is offered on the platform?

Burhmester, Kwang, and Gosling (2011) find most participants on MTurk are not motivated by the money, though Paolacci, Chandler, and Ipeirotis (2010) finds contradictory evidence that the majority of participants do report being motivated by money. Furthermore, they find that 13.4% of Turkers report using MTurk as their primary source of income (Ibid.). Regardless, empirical evidence is clear that regardless of motivations, offering a larger wage does not seem to noticeably increase the quality of participant responses (Burhmester, Kwang, & Gosling, 2011; Rouse, 2015).

That being said, the speed at which participants sign up to take the study can be meaningfully affected by offering a higher wage, even if participant quality does not increase (Burhmester, Kwang, & Gosling, 2011;Berinsky, Huber, & Lenz, 2012). For example, Berinsky, Huber, and Lenz (2012) found that paying $2.4/hr (33% of US federal minimum wage (FMW)) was able to recruit 200 participants per day, and upping the pay to $4.2/hr (57% FMW) yielded 300 participants per day. Burhmester, Kwang, and Gosling (2011) found that offering $0.04/hr (0.5% FMW) yielded 127 participants per day and $6/hr (82% FMW) yielded 972 participants per day, with many ranges in between.

One may be tempted to pay well on MTurk, perhaps even offering US FMW which is rare on the platform (Fort, Adda, and Cohen, 2010, see also this TurkerNation forum thread). However, doing so could introduce a non-response bias -- Chandler, Mueller, and Paolacci (2014) were concerned theoretically, cautioning one to neither underpay nor overpay, since Turkers intentionally seek out lucrative studies and avoid underpaying ones. Also, Stewart, et al. (2015) found that, counterintuitively, increasing pay could decrease the size of the population willing to complete the study.

Therefore, I recommend offering a wage of $3/hr-$5/hr, which appears close to the mean wage offered by most studies and is respectfully above the average wage on the platform. Notably, this does conflict with Casey, et. al. (2016) who state “minimum acceptable pay norms of $0.10 per minute” ($6/hr or 83% FMW), but this appears to be a statement based more on ethics of justice (which are certainly important and could prevail depending on your point of view) than data accuracy.

Lastly, it’s not clear, but this offered wage seems inclusive of Amazon’s MTurk service fees, which are usually 20%, but can be gamed. This means that offering $5/hr will result in a true payment of $4/hr to the Turker after the service fees are taken into account. It’s also worth noting that MTurk fees doubled in early 2015 from 10% to 20%, which means it might not be straightforward to benchmark wages offered now by wages offered in studies prior to 2015.

What kind of response rate can be expected for the second wave of a baseline-endline study?

MTurk response rates to longitudinal studies can be low, especially for long time-frames. Chandler, Mueller, and Paolacci (2014) found the response rates to be higher than 60% for a contact frame under three months but the response rate dropped to 44% when trying to recontact one year later, a response rate worse than college student surveys (~73% response after one year) though still impressive given how easy it is to recontact. Additionally, Burhmester, Kwang, and Gosling (2011) reported a 60% response rate for recontacting after three weeks and Shapiro, Chandler, and Mueller (2013) reported an 80% response rate after one week. Norris & Roberts (2016), an animal advocacy MTurk study, found a 63% response rate after an initial period of three months.

Notably, incentives are typically doubled for the endline of a baseline-endline study ( Burhmester, Kwang, & Gosling, 2011) so as to increase response rate.

How much would a MTurk study cost?

Costs can vary alot depending on sample size, survey length, use of demographic targeting, and whether multiple waves are desired.

400 participants for a 5-min study would cost $134 if total costs are $0.067 per minute.

2000 participants for a 15-min baseline-endline study would require 3333 participants at baseline due to the expected 60% response rate. The baseline wage would be $4/hr and the endline wage would be $8/hr, inclusive of MTurk 20% service fees, for a total of $7333 ($1/person for the baseline for 3333 people and $2/person for the endline for 2000 people).

2000 women aged 18-25 for a 30-min baseline-endline study would require 3333 women aged 18-25 participants at baseline, which would take approximately 5,555 women aged 18-25 to respond to an initial demographic screen and over 33,000 people to take the initial demographic screen to get the 5555 women. The demographic screen can be $0.05/person, the baseline $1/person, and the endline $2/person, for a total of $8983. Such a screen may be very time intensive to conduct since you would have to wait for the Turker pool to replenish a few times.

Conclusion

While there are some dissenters, overall the consensus is that MTurk is worthwhile for social science research and has benefits that exceed the limitations. However, MTurk’s advantages and disadvantages should be balanced wisely.

MTurk is a convenient, reasonably representative, on-demand survey workforce, available at impressively low time and monetary cost. This is especially important for funding constrained animal advocates who require large sample sizes in order to discern the possibly very small expected effects of advocacy materials. This could make MTurk well suited for animal advocacy research, however, be aware that MTurk is not the same as the real world and this means results may not be externally valid.

In particular, when working on MTurk, I recommend that one:

Explicitly limit to US participants if studying the US population, by asking a nationality question and using geo-IP screening.
Monitor other key demographics, like gender, liberalism, and potentially race.
Consider using demographic pre-screening to ensure your sample matches the kind of population you expect to target with your animal advocacy.
Capture Turker IDs to avoid duplicate work and to recontact if desired.
Ask comprehension questions and document how many Turkers passed them, though consider carefully whether these participants should be dropped from the study.
Note how people may pay too much attention to your materials, which may be unlike real-life conditions.
Be careful of social desirability bias and consider disguising the intent of your surveys as much as possible.
Avoid putting study-specific language in HIT titles or descriptions.
Avoid re-using common psychological scales.
Avoid asking factual questions that can be looked up on the internet to test knowledge.
Don't stray too far from the $3/hr-$5/hr wage, even if justice tempts greater pay or budget restrictions tempt lesser pay.
Carefully record your entire MTurk recruitment process, including title, description, survey methodology, and wage offered, since these variables can influence survey responses.
Know what restrictions the reality of MTurk may place on your sample size, especially given the typical need for a large sample size in animal advocacy research.
Coordinate with others completing animal advocacy studies on MTurk, so as to not needlessly compete and contaminate a limited participant pool.

With non-response bias, social desirability bias, and Turkers paying more attention to study materials than normal, most of the biases on the MTurk platform point toward creating larger effects than one would expect outside MTurk. This means that if one cannot find an effect on MTurk, it’s more unlikely that one would find an effect in a setting where the effects are less magnified by bias. This could make MTurk a very effective initial screen for determining which effects to follow up on in a more expensive manner, which could be very good for initial animal advocacy results.

In short, you get what you pay for. MTurk seems to be a great testing ground for initial evidence of effects, but any detected effects need to be validated with more careful (and unfortunately more expensive) research methods.

Special thanks to Kieran Greig, Joey Savoie, and Marcus Davis for reviewing the initial draft and suggesting edits.

vipulnaikAug 3 20164

SurveyMonkey Audience is a little more expensive per response, and therefore probably doesn't make sense at too large a scale, but I've found that it's much easier to set up for a small scale. The cost is $1 per response (up to 4 questions and then $0.25 per additional question) in the United States, and the respondent quality is generally decent. For demographic, education, and other filters, the cost per response goes up based on the nature and number of filters.

SurveyMonkey Audience is, however, available in only two non-US countries (and quite expensive in those).

All in all, I'd say that if you are interested in conducting a quick survey of US-only populations, and don't need too large a sample, SurveyMonkey Audience is a good way to go. You can also later scale up the same SurveyMonkey survey using MTurk (i.e., you can use MTurk to get more participants on the same survey). And you can use the same survey (but a different collector) to get responses for free from friends or groups of people you know, to better do comparisons.

Issa Rice and I did our Wikipedia usage survey research using SurveyMonkey, and we used SurveyMonkey Audience for general population benchmarks:

http://lesswrong.com/lw/nru/wikipedia_usage_survey_results/

Another option worth looking at for single-question surveys is Google Consumer Surveys. For single-question surveys, it's pretty cheap (10 cents per response). Basically Google shows the survey in an ad-like context. Learn more at https://www.google.com/insights/consumersurveys/home

Effective Altruism Forum
EA Forum

Using Amazon's Mechanical Turk for Animal Advocacy Studies: Opportunities and Challenges

19

19

Reactions

More posts like this