Archive for the ‘Crowdsourcing’ Category

What languages are spoken by crowdsourced workers?

Monday, October 18th, 2010

Language and cognition research that used to take thousands of dollars over several months can now be completed in a matter of hours for a few dollars. This is thanks to the recent proliferation of crowdsourcing technologies like Amazon Mechanical Turk (AMT) and CrowdFlower which have opened up a new avenues of research by engaging a ready, online workforce through microtasking. But we haven’t yet seen much work in cross-language crowdsourced studies (with some notable exceptions from Johns Hopkins like Ann Irvine and Alexandre Klementiev‘s ‘Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages‘). There have been a few studies of the demographics of people using AMT, (esp The New Demographics of Mechanical Turk, by Panos Ipeirotis), but no-one had thought to simply ask the workers what languages they spoke. So I did.

Languages spoken by AMT workers

I asked a few hundred AMT workers what languages they spoke other than English. This was staggered over a number of tasks at different times of the day and week to get as much variety as possible. You are welcome to play with the results: crowdsourcing_languages.csv (contact me directly for expanded metadata). This gave about 2,100 data points in total.

The languages were (deep breath): Afrikaans, Albanian, American Sign Language, Ancient Greek, Arabic, Assamese, Badaga, Bengali, Bhojpuri, Bulgarian, Cantonese, Catalan, Caucasian, Cebuano, Chattisgarhi, Chinese, Coorgi, Creole, Croatian, Czech, Danish, Dusun, Dutch, Esperanto, Estonian, Farsi, Finnish, Flemish, French, Fuzhou, Galician, Garwali, German, Greek, Gujarati, Haryanvi, Hawaiian, Hebrew, Hindi, Hokkien, Hungarian, Icelandic, Ilocano, Ilonggo, Indonesian, Irish, Italian, Japanese, Kadazan, Kannada, Kiswahili, Klingon, Konkani, Korean, Kurdish, Kutchi, Latin, Latvian, Lithuanian, Macedonian, Maithli, Malay, Malayalam, Mandarin, Manipuri, Marathi, Marwari, Nepali, Norwegian, Orriya, Pa Dutch, Pig Latin, Plattduitsch, Polish, Portuguese, Punjabi, Pushto, Rajasthani, Romanian, Russian, Sanskrit, Serbian, Shanghainese, Sindhi, Slovenian, Sowrashtra, Spanish, Swedish, Swiss German, Tagalog, Tamil, Telugu, Thai, Tulu, Turkish, Ukrainian, Urdu, Vietnamese, Visayan, Yiddish and Yupik! By smoothing estimates, it is safe to predict that at least a few hundred more are spoken by AMT workers.

There are some fuzzy (and not so fuzzy) interpretations. Hindi and Urdu are one language with some minor dialectal variation, as are Indonesian and Malay. At the other end, a number of the participants who reported speaking ‘Chinese’ probably speak any number of related languages, as distinct languages are often called ‘dialects’ within China, especially in relation to the more prestige languages. ‘Pig Latin’ is not a language. The one person who claimed to speak Klingon … well, who knows, perhaps they do.

I combined the results with the WALS database to map the lineage and origin of many of the languages, showing a huge geographical bias in the distribution. The world’s languages are concentrated in or near the tropics but those spoken here were predominantly from European or non-tropical Asia in origin. Despite that, it is great to see a scattering of less widely-spoken languages like Kadazan (Austronesian) and Yupik (Eskimo-Aleut) showing that despite the biases in overall volume there is a very rich variety of languages spoken by AMT workers. Six of the ten most commonly spoken (Tamil, Malayalam, Telegu, Kannada, Marathi and Gujurati) do not yet have online translation tools via Google or Bing so there is clearly great scope to support online translation for new languages, too.

To populate the map in an interesting way, I also calculated the most frequent language reported at each hour of the day, restricting this to one language per timezone. This gives us 24 languages (see below); one for each hour of the day. I’ve added these to the map at midday for the timezone for which they were most frequently spoken. This is more for visual effect than anything else, but it does give an idea of the optimal time to run tasks for any specific language, and strongly correlates with the part of the world that the language originates in (there are surprisingly few crossing lines).

The most common language per timezone, limited to one timezone per language

In a recent paper we argued that ‘introspective’ analysis of invented sentences was no longer a required fallback for language studies as we can quickly obtain speaker judgments about sentences through crowdsourced experiments (with Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai, Robin Melnick, Christopher Potts, Tyler Schnoebelen and Harry Tily, “Crowdsourcing and language studies: the new generation of linguistic data“) . Given the variety of languages here it is safe to say that researchers also don’t also need to limit studies to their native languages. Even language researchers who choose not to undertake fieldwork can now contribute to our knowledge of the world’s linguistic diversity, making for an exciting future for our field.

There is the potential to be more proactive than this study in seeking out speakers of other languages through crowdsourcing platforms. Scott Novotney and Chris Callison-Burch recently found additional Korean speakers for one study by creating a new task asking people to ‘find a Korean speaker’ (‘Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription‘). They shared the income with those people who found the speakers of Korean, finding an unlikely combination of affordable outsourcing and pyramid schemes for the forces of good. I wish I’d thought of that. Another proactive approach would be to partner directly with organizations that establish microstasking centers around the world, like Samasource. Their workers are concentrated in areas of high linguistic diversity, and while most complete tasks for western businesses there is no doubt that many would enjoy contributing to research about their more local languages. The least well-studied languages are in less-resourced parts of the world, and so the wages typically paid on crowdsourcing platforms could provide a competitive income while contributing to an individual’s work experience and digital skills.

This was something I have found regularly in this and other studies: crowdsourced participants enjoy contributing to science and making use of their rich linguistic knowledge. I also asked the crowdsourced workers for thoughts on language studies. It is their linguistic knowledge and performance that we are benefiting from researching, so I’ll let them conclude:

It’s really nice to remind me what I am.

I am hoping that MTurk will provide more opportunities for translations in Spanish!

What about localizing mturk in different languages so that any one can easily work in mturk.

It is a very enjoyable thing to do research in linguistics.

I’m an outlier, because I love languages so much, but you can count me in, I suppose. Good luck with your research.

Used to be fluent in Latin, but it’s hard to stay in practice with a dead language.

Tamil is my mother tongue … the creativity will shine in mother tongue only.

This is not only work. It can improve our knowledge also.