DeepClapback

A Reddit comment dataset which searches for 'clapbacks' (comments which are scored higher than the original comments) and set up CSVs for a classification model.

Article on V1: "Can DeepClapback learn to lol?" https://blog.goodaudience.com/can-deepclapback-learn-when-to-lol-e4a2092a8f2c

Setup

Install PostgreSQL, Python / PIP. Create a database called 'comments'.

Download a monthly archive from https://files.pushshift.io/reddit/comments

unzstd RC_2014-02.zst --memory=2048MB
rm RC_2014-02.zst
python3 clean_clapbacks_csv.py
psql -d comments

Import the CSV into the database:

CREATE TABLE comments_2014_08 (
    id VARCHAR(20),
    body TEXT,
    score INT,
    parent_id VARCHAR(20),
    author VARCHAR(30),
    subreddit TEXT,
    created_utc BIGINT,
    parent_score INT,
    cleantext TEXT
);
\copy comments_2014_02 FROM '2014_02.csv' CSV;

Scripts to index rows, find clapbacks, and label relevant parent comments:

pip3 install psycopg2
python3 label_db_clapbacks.py

Find top terms (appear in top 25 clapback-meme replies for the month, for 2 or more months):

python3 find_top_terms.py

Use the top terms to create one clapbacks table, with parent_body column filled (will remove some comments from early January where the parent comment is missing from the year).

python3 extract_clapbacks.py

Save the clapbacks to a CSV file:

\copy clapbacks to 'clapbacks.csv' DELIMITER ',' CSV HEADER;

To list seq2seq clapbacks, run this script which picks several freeform clapbacks, i.e. they are not repetitive meme replies. The number of comments picked from each month is monthly_seq (default of 20,000 / 12), with only the highest score replies within each month being selected. The IAmA subreddit is excluded, because answers generally score higher than questions.

python3 seq2seq_clapbacks.py

\copy seq_clapbacks to 'seq2seq_clapbacks.csv' DELIMITER ',' CSV HEADER;

Content Warning

Comments and responses in the Reddit archives and output datasets all include NSFW language and links!

When you use the dataset, you can use the subreddit and score columns to perhaps filter out NSFW and otherwise toxic content.

License

Reddit comments are properties of Reddit and comment owners using their Terms of Service

MIT license on this version

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
training-models		training-models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clapbacks.csv		clapbacks.csv
extract_clapbacks.py		extract_clapbacks.py
find_top_terms.py		find_top_terms.py
label_db_clapbacks.py		label_db_clapbacks.py
nomeme.sql		nomeme.sql
reddit_to_csv.py		reddit_to_csv.py
seq2seq_clapbacks.csv		seq2seq_clapbacks.csv
seq2seq_clapbacks.py		seq2seq_clapbacks.py
v2-notes.txt		v2-notes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepClapback

Setup

Content Warning

License

About

Releases

Packages

Languages

License

Georeactor/reddit-one-ups

Folders and files

Latest commit

History

Repository files navigation

DeepClapback

Setup

Content Warning

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages