Duolingo: A Case study in AI ethics implementation

Research paper summary

Sep 23, 2024

Hi AI ethics enthusiasts,

New paper out!

There are so many AI ethics frameworks out there. Most of them are high-level, abstract, and far from implementation. My new co-authored paper bridges this gap. It was written with the practitioners themselves, and it showcases how an organization can write practical AI ethics principles and then implement them.

The case study is Duolingo’s English Test. And my fabulous co-authors are part of the Duolingo English Test team:

Jill Burstein lead the paper, as well as the Doulingo English Test responsible AI efforts.
Alina (Olteanu) von Davier is the Chief of Assessment at Duolingo
Geoff LaFlair is an Assessment Scientist at Duolingo
Kevin Yancey is a Senior Staff AI Research Engineer

The paper is forthcoming in the Handbook for Assessment in the Service of Learning (Editorial Team: Eleanor Armour-Thomas, Eva L. Baker, Howard Everson, Edmund W. Gordon, Steve Sireci, and Eric Tucker).

In this entry, I summarize the paper and explain why it is important. For dessert, an AI-generated take on this post!

The Duolingo English Test (DET)

Duolingo’s English Test is a digital-first, high-stakes, computer-adaptive measure of English language proficiency, primarily used for admissions to English-medium higher education institutions (similar to the TOEFL).

It leverages various AI capabilities, including automated item generation, writing and speaking evaluation, and plagiarism detection.

Duolingo's AI Ethics Principles

Duolingo has developed four main AI ethics principles for the Duolingo English Test (DET). Each standard includes goals and subgoals, all geared towards implementation. You can read them in full here here, and these are the standards at a high level:

1. The Validity and Reliability standard - ensures the test is suitable for its intended purpose. The Validity standard evaluates construct relevance and accuracy, and the Reliability standard focuses on consistency.

2. The Fairness standard - promotes democratization and social justice through increased access, accommodations, and inclusion, representative test-taker demographics, and avoiding algorithms known to contain or generate bias.

3. The Privacy and Security standard - ensures (a) compliance with relevant laws and regulations governing the collection and use of test taker data; (b) test-taker privacy, and (c) secure test administration.

4. The Accountability and Transparency standard - aims to gain trust from stakeholders through proper governance and documentation of AI used on the test.

Implementation Example:

The Six-Step Process for Writing Exam Questions

The paper illustrates the implementation of standards using various examples. One of them is Duolingo’s six-step process to use generative AI to produce test items, like questions and texts. This process illustrates the application of the Validity and Reliability and Fairness principles.

Here are some highlights of the process (read more about it in the full paper):

Step 1: Articulate the target construct

Human subject-matter experts define what test will evaluate and how

This step implements the validity and reliability standard ensuring that test items are aligned with what should be measured.

Step 2: Specify tasks and scoring systems

Clearly specify a task and scoring system. This includes AI feature development and evaluation that operationalizes elements articulated by human SMEs.
Evaluate AI scoring system accuracy and fairness, leveraging human expertise.
Develop (a) explainable scoring methods, and (b) interpretable AI features used for scoring that have clear alignment with domain constructs
Identify AI methods for item creation, leveraging human expertise to efficiently create valid and reliable test items.

This step implements the validity and reliability standard in ensuring that test items are aligned with what should be measured, like the previous step. In addition, it implements the fairness standard in incorporating fairness into the scoring system.

Step 3: Develop prompts for AI content generation

Subject matter experts create prompts that elicit content and questions from a large language model that align with the test specifications articulated in the previous steps.

This step implements the validity and reliability standard, as subject matter experts are the ones developing prompts.

Step 4: Use AI for large-scale content and task generation

Use GPT for large-scale generation of content and tasks, such as potential answers for multiple choice questions, and evaluate their quality.

Step 5: Human review of generated content

Conduct fairness and bias (FAB) reviews and item quality reviews (IQR)
Ensure content is factually accurate and free from culturally sensitive or inaccessible topics

This step implements the fairness standard in proactively reviewing the generated content for fairness and bias issues. Moreover, it implements the validity and reliability standard in reviewing the content for factual accuracy.

Step 6: Improve the item design process based on feedback

Use information from reviews to enhance item generation procedures and prompts

Conclusion

You can read more about the six-step process and other examples of implementation in the full paper.

This case study provides a practical example of how AI ethics principles can be developed and implemented in a real-world. It shows how companies go beyond abstract principles and commitments.

Moreover, this paper demonstrates how companies can be transparent and accountable. Most companies provide little to no detail about how they use AI and which guardrails, if any, they implement to ensure positive outcomes. Papers like this one increase stakeholder trust and create opportunities for feedback on the AI ethics themselves, which is crucial for growth.

Get in touch if you’d like to have a paper like this about your own company!

Dessert

An AI-generated take on this post!

Ready for More?

Check out our comprehensive resources, workshops, and consulting services at www.techbetter.ai, and follow us on LinkedIn: Ravit's page, TechBetter's page.

Karen Smiley

Agile Analytics and Beyond

Sep 24Liked by TechBetter | Ravit Dotan

This is cool timing, Ravit! Just today I published an "AI, Software, & Wetware" interview with Dr. Mary Marcel. One of her stories in the interview is about using Duolingo, and why it's one of the few tools incorporating AI that she does use. https://sixpeas.substack.com/p/aisw-016-ai-software-wetware-dr-mary-marcel

Expand full comment

1 reply by TechBetter | Ravit Dotan

1 more comment...