AI-Generated Text Detection

People

Author: Anton Shapkin ( aa.shapkin.aa@gmail.com )

Scientific adviser: Akim Tsvigun

Description

This project aims to develop a tool for detecting AI-generated text/code. With the advent of ChatGPT and AI technology advances, distinguishing between human-written and machine-generated text becomes increasingly challenging. Generative models can return any sequence of tokens => any text that falls within their vocabulary. Even OpenAI is actively working on addressing this challenge, but achieving high accuracy in general cases remains difficult.

Task definition

Development and implementation of a classification model for identifying texts generated by artificial intelligence to enhance information accuracy.

Data

Dataset #1: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data

This dataset consists of approximately 10,000 essays written by humans and various generative models. It includes a prompt (instruction) for each essay. The goal is to identify whether a given text X, corresponding to prompt P, was generated by an AI or written by a human.

Dataset #2: https://github.com/vivek3141/ghostbuster-data/tree/master

In the GHOSTBUSTER paper (https://arxiv.org/pdf/2305.15047.pdf) researchers introduced three new datasets: a writing dataset (based on the subreddit r/WritingPrompts), a news dataset (based on Reuters), and a student essay dataset (based on IvyPanda). Note that the legality of the writing dataset may now be in question.

Dataset #3: from work DetectGPT (https://arxiv.org/pdf/2301.11305.pdf)

In this study, the researchers did not provide their evaluation dataset but described the methodology for its generation.

Plan

After reviewing the relevant literature, several weaknesses have been identified: the absence of a standardized benchmark, meaning that each method is evaluated on its own dataset. Furthermore, existing datasets contain only English data, whereas current models are also proficient in other languages and coding. Therefore, our plan includes:

Benchmark Preparation (Approximate Deadline: March 22)

Combine existing datasets
Generate new data for different languages: Russian, Chinese, German, French
Generate new data for code: Python, Java, Kotlin, C++

Evaluate Baselines and Existing Methods (Approximate Deadline: April 22)
Implement Our Approach (Optional)
Create UI (Telegram Bot)

The bot will return a label for a given text: 1 if the code is AI-generated and 0 if it is written by a human

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Generated Text Detection

People

Description

Task definition

Data

Dataset #1: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data

Dataset #2: https://github.com/vivek3141/ghostbuster-data/tree/master

Dataset #3: from work DetectGPT (https://arxiv.org/pdf/2301.11305.pdf)

Plan

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI-Generated Text Detection

People

Description

Task definition

Data

Dataset #1: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data

Dataset #2: https://github.com/vivek3141/ghostbuster-data/tree/master

Dataset #3: from work DetectGPT (https://arxiv.org/pdf/2301.11305.pdf)

Plan

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages