Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue elimination - database migration #1011

Open
mvl22 opened this issue Jan 5, 2022 · 8 comments
Open

Issue elimination - database migration #1011

mvl22 opened this issue Jan 5, 2022 · 8 comments

Comments

@mvl22
Copy link
Member

mvl22 commented Jan 5, 2022

We discussed in depth the data migration for the elimination of issues to leave discussions only.

NB: Issue is a container but is not actually heavily used in the logic, so we can leave it lying about in parallel if necessary.

For each thread:

  • We add to the thread properties:
    geometry
    tags - merged and uniqued

  • We make a new first message, consisting of the following from the issue:
    "Describe the problem" as message
    Any photo as attachment
    Any link as attachment
    Any deadline as attachment
    Issue creator as the person
    Issue created_at as the time

  • The issue title is not used.

At end of message could inject a new paragraph:
[This message was auto-created from an overall description of this issue that was created in an earlier version of this website.]

If you can detect identical content between issue and first message then don't add the first message.

Data cleansing:

  • MLS and volunteers will add tags to issue/thread where there currently is none, so we can enforce tag being a required field
  • If an issue has multiple threads, MLS and volunteers will add a curated tag representing the issue, e.g. bradford-canal-route for an issue "Canal route could be upgraded"
@mvl22
Copy link
Member Author

mvl22 commented Jan 15, 2022

Just to say we now have a volunteer for the tagging work, so if you can generate the data for this I can get them started.

They'll also be working on dealing with variant tagging, which would be good to clean up. Is it easy to generate a list of all tags and their frequency? That should hopefully enable us to determine similar tags.

@nikolai-b
Copy link
Contributor

nikolai-b commented Jan 16, 2022

To get the issues with threads where neither the issue or the thread have tags I've used:

File.open("issues_without_tags.txt", "w") do |file|
  Issue.joins(:threads).where(<<~SQL
    not exists (select 1 from issue_tags where issue_tags.issue_id = issues.id) 
    and not exists (select 1 from message_thread_tags where message_thread_tags.thread_id = message_threads.id)
  SQL
  ).distinct.select("issues.id, issues.title").find_each { |iss| file << "https://www.cyclescape.org/issues/#{iss.to_param}\n" }
end

issues_without_tags.txt

I'd like to add a validation to ensure all future message threads have tags so we don't get new ones. Given the issue might be tagged I could do something like pre-fill the new thread's tags with the issues ones if you want but also starting with a blank doesn't seem unreasonable.

For issues with multiple threads:

headers = ["Issue URL", "Tags on the issue only"]
CSV.open("issues_with_multiple_threads.csv", "w", write_headers: true, headers: headers) do |csv|
  Issue.where(id: iss_ids).left_joins(:tags).group(:id, :title).select("issues.id, title, json_agg(tags.name) as tag_names").find_each { |iss| csv << ["https://www.cyclescape.org/issues/#{iss.to_param}", iss.tag_names] }
end;

issues_with_multiple_threads.csv

Finally the top tags:

headers = ["Tag name", "Tag frequency (including issues, threads and library items)"]
CSV.open("tag_with_freq.csv", "w", write_headers: true, headers: headers) do |csv|
  Tag.top_tags_fresh(100000).map { |tg| csv << [tg.name, tg.tag_count] }
end

tag_with_freq.csv

@mvl22
Copy link
Member Author

mvl22 commented Jan 17, 2022

Could we get the counts for those added to the stats page somewhere so I can keep an eye on that as we work through it at this end? I don’t mind running the queries on the DB manually from time to time to get the data output but a quick view for the three stats would be helpful.

I'd like to add a validation to ensure all future message threads have tags so we don't get new ones.

Yes, definitely. Soon as we have the current data cleaned up, we should apply that to editing also so that the constraint is in before we do the main site migration.

Finally the top tags

I see we have some tags with zero references to them. I suggest we zap those as they aren't useful and will just create autocomplete noise.

nikolai-b added a commit that referenced this issue Jan 17, 2022
@nikolai-b
Copy link
Contributor

nikolai-b commented Jan 17, 2022

I've added a new admin pages with the untagged issues and issues with multiple threads at /admin/home

I'll add the validation once all issues have tags otherwise it will get confusing.

I've removed the tags with no references, I'll make an issue to check this every so often.

@nikolai-b
Copy link
Contributor

Issues have voting 👍 and 👎 whereas threads have favourites ⭐ . We have 912 instances where a user has 👍 or 👎 and issue. Do we just ignore these? It matters for "popular" part of the new discussions design.

@mvl22
Copy link
Member Author

mvl22 commented Feb 19, 2022

Issues have voting 👍 and 👎 whereas threads have favourites ⭐ . We have 912 instances where a user has 👍 or 👎 and issue. Do we just ignore these? It matters for "popular" part of the new discussions design.

Popular discussions is simply the discussions with the most messages in.

Favourites are purely just a booking mechanism for the user. They are not used in determining popularity. Other users cannot see who has favourited what.

Issues have voting, but you can basically entirely ignore that now. That data was never very interesting and will just fade away.

@mvl22
Copy link
Member Author

mvl22 commented Aug 8, 2022

To get the issues with threads where neither the issue or the thread have tags I've used

Action: Nikolai to make an admin page to help this be worked through more quickly.

@mvl22
Copy link
Member Author

mvl22 commented Sep 4, 2022

To get the issues with threads where neither the issue or the thread have tags

/stats/issues_untagged

This report now has zero hits, so it would be safe to start enforcing at the database level that an issue/thread must have tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants