Skip to content

Sprint Updates

tarunima edited this page Nov 25, 2021 · 103 revisions

November 8 to November 19 2021

Memebox

OGBV Annotation

  • We developed a tool to annotate images for the OGBV project (Tattle - Tattle)
  • We vetted a few available tools (discussion here: (https://github.com/tattle-made/OGBV/discussions/2)), but went ahead with our bespoke tool mainly because of our need to have multiple users login and work on the project and support multiple languages within the software.
ann_UI2 ann_UI1
  • This would be our second iteration of a bespoke annotation tool. If anyone's working with or needs multi user annotation tools especially equipped for multimodal and multilingual data, please ping us. We should be able to help.
  • Added krippendorrf scores notebook implemented using simpledorf library on a sample annotators data.

OGBV Scraper

  • Updated the twitter scraper to upload additional fields such as whether post is reply or not, whether post is a retweet or not, language of the tweet, timestamp of scraping

October 25 to November 5 2021

OGBV project

  • Pushed website for OGBV project: https://tattle.co.in/products/ogbv
  • Made a proof of concept development for a custom multiuser multilingual annotation UI for annotating tweets to train ML model for OGBV detection.
  • Added Documentation for Twitter and Instagram Scrapers:https://github.com/tattle-made/OGBV
  • Updated the twitter scraper to scrape the tweets iterating along the dates in the given range

Search Updates

Our documentation for the search server existed as markdown files on Github. We used Gatsby to convert it into a dedicated documentation site. feluda_doc



October 11 to October 22 2021

Click to expand ### OGBV Added Twitter and Instagram Scrapers - it scrapes the posts, uploads the images/videos to s3 bucket and finally stores the post metadata to mongo-db: https://github.com/tattle-made/OGBV/tree/main/Scrapers

Search Updates

Added some end to end tests for the index and search end to end workflow. Changes haven't been merged to master yet but can be perused here - feat : add test for /represent image: https://github.com/tattle-made/tattle-api/commit/1e3d8d1b40c45e76afec0b299a41777b8e48d2f3

Deployed Metabase to Kubernetes

The official method for installing metabase on kubernetes does not seem to be supported anymore. We did a barebones install of metabase backed by our SQL database. The config files for this are available here: https://gist.github.com/dennyabrain/bfb01368f15fec57dd5c195ba6ecdbbb

---

Sep 6 to Sep 17 2021:

Click to expand ### Search Updates Continued work on the v 1.0 release of search engine. Refactored modules into core, features and operators. Began adding tests for the core modules. Documentation website describing the architecture and how to contribute to the search engine will be released later this month.
---

August 23 to Sep 4 2021:

Click to expand ### Fact check DB/Dashboard: * Made the number of articles in each week visible on the dashboard * Fixed a typo in the dashboard * Renamed tattle-research repository on GitHub to factchecking-sites-scraper. This was long due. Updated the documentation for the repository.

Search Improvement

Continued work on the v 1.0 release of search engine. Incorporated Gatsby to generate documentation for the project using the markdown files.

Other Tasks:

  • Converged on a GitHub issues conventions. Either closed or categorized old issues.
  • Preliminary work to collect OGBV content from Instagram at scale.
---

August 9 to August 20 2021

Click to expand *** ### Website Changes * Created a dedicated space on our website to host updates on our Research front * We've also added links to our contributor's websites/portfolios on our community page

Search Improvement

  • Work on the v 1.0 release of search engine is underway. Spent the week adding unit tests to the various media operators. Updated documentation for the project too. A dedicated webpage documenting the search server in detail to come soon.
---

July 26 to August 6 2021

Click to expand ***

This sprint was slow on tech development. We focused on publishing the report: (A Case Study of the Information Chaos During India's Second Covid-19 Wave)[https://tattle.co.in/articles/covid-whatsapp-public-groups/]

---

July 12 to July 23 2021

Click to expand ***

Improvements to Search Server

Made the search server's operation modes configurable via a .yml file instead of being hardcoded in the code. Mainly, this lets us selectively enable or disable loading of heavy ML models into memory depending on whether one intends to run the server with features dependent on those models or not.

OGBV project with CIS

WhatsApp group analysis report

Amongst other small tasks, we worked on creating a Web UI for the upcoming report that emphasizes the interactive visualizations.



June 28 to July 09 2021

Click to expand *** ### WhatsApp Scraper/ Data *Attempted:* * cross analysis of whatsapp data with external data containing fact checked information. * word frequency analysis of text in text and image items. * number of external links

Known Issues:

  • Google translate issues- for example, common emoticons translated to the word 'bamboo'. Context also sometimes lost in translation.
  • The WhatsApp export chat with media does not export reliably.

New Issues from Analysis:

  • Change functionality of WhatsApp scraper to also check for variation in in-between rows in different exports of the same chat.

Building a performant interactive T-SNE visualization for web

We needed to show ~2600 images on a webpage at once in a way that encouraged users to explore the underlying dataset. We used t-sne algorithm to layout these images as nodes in 2D space

We explored possibility of html canvas and svg along with react to render these nodes. We stuck to svg for now given the ease with which we could tap into html events to add interactions to it.

Screenshots from the Work in Progress visualisations here:

T-Sne Viz



June 11 to June 25 2021

Click to expand ***

Website Tweaks

  • Fixed website links. Updated privacy policy. Addressed certain open issues like this, this and this.

Search Service Optimizations

  • Refactored the source code to separate out reusable chunks of code meant to deal with different media types - text, images and videos.

Data Collection from Messaging App

  • We wrapped up a small scale 2 month data collection exercise. With the goal to learn more about the kinds of conversations happening around the second wave of Corona in India, we joined 20 public messaging groups and collected media from them. We'll share observations from this exercise in a forthcoming report.

  • Experimented with ML based and other image processing approaches to anonymise social media content for more ethical reporting on chat apps.



May 31 to June 11 2021

Click to expand ***

Search Server

  • Created an API endpoint to index media by sending files directly to our search server as opposed to doing this via a S3 bucket. This substantially reduced the indexing latency. Blog post detailing this to follow soon.

Data Collection

  • We continued data collection from public Covid relief groups on whatsapp, that we joined in late April.

Data vizualization exploration

  • Building on the work we did here, we spent time exploring ways to visualize clusters of large number of image data using algorithms like t-sne and data vizualization libraries like d3.js
---

April 15 to May 28

Click to expand ***

Realtime Collaborative Annotation Engine :

  • We improved upon our barebones annotation web UI. This time we added support for multiple people to work on a project simultaneously and annotate media posts. Since different teams might have different metadata they want to annotate on an image, this new engine provides the flexibility to define the annotation form schema dynamically. Source code : tattle-made/collaborative-media-annotator

Supporting dra.ft members with datasets

  • We queried our fact check article datasets and created custom subsets containing articles of certain themes. These were made available to members of the dra.ft festival.

Search Server optimizations

  • As we tacked on one feature after another on our search engine, it has grown to become big in terms of its storage and memory requirements. We've been digging deep into our dependency tree and docker layers to find ways to reduce the cost of developing, iterating and running this server. We will try to document this in a blog post later on.

Data Collection from WhatsApp:

  • We started Collecting Data from twenty Covid Relief groups on WhatsApp. The goal is to do a small scale study on the types of conversations happening around corona on WhatsApp during the second wave of the pandemic in India.

Vaccine Hesitancy Article:



Sprint March 15 2021 to April 15 2021

Click to expand ***

Kosh

Privacy and Security Audit

  • Signed off on privacy policy and security policy documents with our partner organisation to lay the foundation for implementing workflows and processes to prevent or mitigate any privacy or security related incidents.

Fact Check Article Dashboard

  • Refactored the dashboard code to make adding weekly data easier.
  • Added CI/CD workflow using Github actions to make production deploy automated

Scrapers



Sprint Jan 1, 2021 to Jan 30 2021

Click to expand ***

Scrapers

  • WhatsApp Scraper Tested and pushed WhatsApp Scraper https://github.com/tattle-made/whatsapp-scraper/tree/master/python_scraper… This is a handy tool for anyone who wants to archive their WhatsApp group content. It consolidates exported WhatsApp chats into one database.

    • Features tested: - deduplication across multiple exports of the same group. - manage time difference across multiple exports using correlation. - anonymize phone numbers and group names.
  • Restructured fact checking sites scraper, thanks to @su__deep

    We discovered that our fact checking article scraper was missing articles from some domain. We took that opportunity to update the way we structure our Scrapers.

DataScience

  • Participated in the SEMEVAL 2021 TASK 6 ON "DETECTION OF PERSUASION TECHNIQUES IN TEXTS AND IMAGES https://propaganda.math.unipd.it/semeval2021task6/https://twitter.com/proppy

  • Submitted results to the development and test set for the challenge for the following 3 tasks.

  • Given memes -
    1. Classified memes with propaganda techniques given only meme text

    2. Classified memes with propaganda technique and identified the matching span of text using only meme text

    3. Classified propaganda in memes using both meme image and accompanying text

  • This collaboration happened organically between Yohan, @su__deep and @KruttikaNadig on our slack group and we are very excited to see how this plays out.

  • Improvement to multi-lingual search Replaced word2vec word embeddings with a pre trained Sentence Transformer Embeddings (https://sbert.net/index.html)

  • This pretrained model helps generate a vector representation of input text in Indian languages like Hindi, Bengali, Gujarati, Marathi, Tamil, Malayalam etc. Full list can be found here https://sbert.net/docs/pretrained_models.html… This helps us improve our multi lingual search capabilities.

Infrastructure

  • We've been optimizing our infrastructure costs by trying to take advantage of kubernetes features. One such optimization came in the form of the elastic search operator that helped us host elastic search in our own cluster and not use 3rd party managed offerings.
---

Sprint ending Dec 4 2020

Click to expand ***

Engaging Everyday Chat App Users:

  • Recruited more users for the Khoj pilot
  • Continued forwarding existing fact checks to their queries and shared a digest of curated fact checking articles every other day.

Search

  • Implemented Kubernetes volume to persistently store word2vec vectors in our infrastructure
  • Deploy Tattle Search and ElasticSearch cluster in our Kubernetes Cluster

Data Science

  • Wrote a script to enhance our weekly data analysis that will cluster all the duplicate/near-duplicate images scraped during that week - tattle-made/data-experiments
  • Trained an XGBoost claims detection model on our annotated social media dataset with nearly 80% cross-fold validation accuracy - tattle-made/content-relevance
  • Modified our 'Themes in Factchecking Articles' dashboard generation script to store the English translations of article headlines in our database as we generate them - tattle-made/data-experiments
  • This will allow for better data analysis as the Indian language factchecking articles in Tattle's archive will become searchable with English queries

Research

  • Submitted a collaborative data annotation activity proposal to Mozfest

Infrastructure

  • Created a helper module for our various services to connect to their respective Mongo database with one line of code - https://github.com/tattle-made/sharechat-scraper/blob/master/db_config.py
  • This is part of our plan to have a common set of helper modules and functions for our web scrapers, search engines and WhatsApp chat archiver
  • Did a proof of concept for complex data routing with RabbitMQ, which sends different data from Pandas dataframes to different queues and consumers based on routing keys for each type of data - tattle-made/pipelines


Sprint Ending Nov 8 2020

Click to expand ***

Search

Data science

  • Deployed a Luigi data pipeline cron job to process multimedia posts from our database and flag them if they contain keywords that appear frequently in fact-checking articles

Engaging Everyday Chat App Users 

  • Recruit 3 users (7 more pending) for the Khoj Pilot Study. All users are in the demographic of 40-60 year old women. 
Khoj Ad english Khoj Ad english
  • Created whatsapp sharing friendly images. 
  • Created a Draft of the Guiding Document for the pilot 


Sprint Ending Oct 26 2020

Click to expand ***

Search

Data science

Engaging Everyday Chat App Users 

  • Brainstormed ideas with the team to understand everyone’s ideas about what the scope of the Khoj Pilot with 40-60 year old users should be. Defined categories of behavioural nudges that would be effective with our demographic to slow down the spread of misinformation 


Sprint Ending Oct 11 2020

Click to expand ***

Infrastructure

Engaging Everyday Chat App Users in Verification

Search

  • We committed scripts for a) bulk indexing media from datasets into our simple realtime search engine based on date range or random batch selection 
  • b) reporting the success/failure of each one back to the database via an additional Rabbitmq queue and receiver and c) retrying failures - https://github.com/tattle-made/sharechat-scraper/tree/development 

Data science

FOSS contributions



Sprint Ending 27th September 2020

Click to expand ***

Infrastructure

Search

  • We created manifest files for our real time search engine so that we could deploy it to our Kubernetes cluster. We will be uploading them to Github shortly after removing the confidential fields. Installed RabbitMQ into our cluster vial Helm Charts and created a custom service to expose its Web Management UI
  • Committed scripts to index images and videos from our database into our simple real time search engine. This will allow exact image / video search for media that has circulated on social networks. The search will also return information about the matched media’s source - https://github.com/tattle-made/simple-rt-search/tree/development

Whatsapp Scraper

  • We incorporated Service Accounts into our Whatsapp Scraper. This makes it easy for others to archive whatsapp groups that they are part of. Once someone exports the chat from a whatsapp group to their google drive, they can then share that folder with our service account’s email id. This will enable our bot to scrape the data and archive it. Email us at [email protected] us if you are part of any whatsapp group whose content you wish to archive.

Data science

  • Launched an interactive dashboard for exploring themes in factchecking articles we scrape each week - https://services.tattle.co.in/khoj/dashboard
  • Began testing and tuning our multimodal content relevance classifier on a dataset of 2000 Hindi social media posts that we annotated in-house

FOSS contributions

  • Shoutout to @duggalsu for fixing a bug in one of our social media scrapers that was causing ‘reposted’ media to be saved incorrectly


Sprint Ending 4th September 2020

Click to expand

Build a high quality Sharechat dataset

  • We annotated 2000 Sharechat posts for 7 categories with moderate to excellent inter-annotator agreement.
  • The annotated dataset will be released later along with details about the sampling strategy, categories and methodology.

Data science

  • We have started sharing weekly data insights in the form of data visualisations on our public channels. These show evolving trends and viral content (including some viral misinformation) on Indian social media and chat apps. GitHub commit - https://github.com/tattle-made/data-experiments/blob/master/eda_insights_templates.ipynb

  • We worked on a simple realtime search engine that can be used to index audio, video and images. There are a spectrum of search needs in the misinformation domain but the need to identify exact duplicates of images, videos and audios and the ability to retrieve metadata associated with them forms the bulk of the challenge. We’ve fine tuned our simple search project as a standalone repo and added documentation for it here - https://github.com/tattle-made/simple-rt-search

Engaging Everyday Chat App Users in Verification

  • We finalized our illustrations for the app. A sneak peek here :
Khoj english Khoj hindi


Sprint Ending 21st August 2020

Click to expand ***

Build a high quality Sharechat dataset

  • We expanded the Sharechat scraper’s scope to include tag creation dates, unique tag identifiers, reported/rejected post counts and verified account status. This allows deeper analysis of temporal trends, influencer generated content and the platform’s approach to content curation within News / Politics / Health related content tags.
  • We deployed a ‘virality scraper’ that tracks the likes, shares and views of fresh Sharechat posts from day 0 to day 5 of their lifespan and will provide insights into the life cycle of posts containing misinformation. For instance, this scraper would let us identify content that goes viral in 24 hrs.

Engaging Every-Day Chat App Users in Verification

  • Search plays a crucial role in reducing manual effort across different aspects of fact checking. Using Elasticsearch we are able to search across different textual fields across our app - user query, community response, metadata etc. Youtube Demo
  • We used Appbaseio to prototype quickly and efficiently.


Sprint Ending 7th August 2020

Click to expand

Stabilize Infrastructure

  • Archive server ReplicaSet has been deployed successfully on the k8s dev cluster, with a single redis pod
  • With this, all the primary PoCs for Kubernetes are completed, and the basic streamlined deployment pipeline is in place
  • We stumbled upon a strange bug wrt redis deployment that hasn’t been fixed yet. Details here.
  • We’ve setup Sematex to monitor our app logs and infrastructure health. This enables us to debug deployed software in real time and builds the foundation for optimizing the resources taken up by our software.
  • Monitoring of the ShareChat Scraper has been implemented in the Sematext app

Sematext

Engaging Every-Day Chat App Users in Verification

  • Further progress was made on the Community Response section of the app. We have great support for showing responses of the following type - text, image, url.
Khoj WIP
  • The query submission page underwent a visual design revamp. Usability concerns regarding the “Choose from screenshot” and “Use Recently Copied Text” persist and are up for feedback.
Submission
  • Corresponding to this is the Community UI that lets any verified member of the community answer user queries coming from khoj users. Youtube Demo The section of interest is the Response section that lets us respond to a user with text, image and a url of an article. We are opening up the community UI for a closed pilot. If you would like to join and be part of a community which responds to queries regarding whatsapp messages/misinformation for uncles and aunties around the country, ping us.

  • We have added the following illustrations in the onboarding screens for the Khoj App. They are still undergoing modifications

onboarding_1 onboarding_2

Create a High Quality WhatsApp Dataset

  • The WhatsApp data dump contains user phone numbers that we dont want to store in our database. We have implemented a technique to obfuscate phone numbers in a way that retains privacy but also allows so basic analysis

Build a high quality Sharechat dataset

  • We have integrated duplicate detection into our scraping cron jobs and are now archiving 5k unique posts per day. We have started tracking the sources of the duplicates to help improve our scraper targeting.
  • We are storing HTML previews of the daily scraped posts for the convenience of journalists, researchers and anyone else who’d like to work with our data. These include media thumbnails and metadata in a tabular format.


Sprint Ending 24th July 2020

Click to expand

Stabilize Infrastructure

  • Kubernetes was configured to trigger from Github actions
  • Sharechat Scraper Service now gets built, uploaded, and deployed to 2 k8s replica pods on commit
  • SCS cron job is deployed and tested (awaiting testing for re-deployment with static Docker image tag)
  • SCS REST server CICD on k8s is implemented and tested
  • Khoj API is deployed and CICD on k8s is implemented and tested
  • You can see the Github workflow files for the ShareChat scraper REST API and Khoj API here and here

Engaging Every-Day Chat App Users in Verification

  • We’re exploring different ways to engage users of the Khoj android app. The app now supports showing multiple type of community response. Here in the image you see a red box that we are calling the “summary_card”, that summarizes the query and response in a shareable byte. You also see a feedback section in the image that lets a user tell us if they are happy with our response.
Khoj UI
  • Introducing Ruhi Maasi We have been using this stakeholder internally as a stand in for the WhatsApp user we want to get to eventually. We now have a visual representation of her that we plan to use in our product illustration
Ruhi Masi

Create a High Quality WhatsApp Dataset

  • The WhatsApp scraper community UI now enables users to moderate scraped WhatsApp messages. This include deleting useless messages, tagging them. Youtube Demo

Build a high quality ShareChat dataset

  • We now have over 200k unique ShareChat posts along with their metadata in our database. This includes images, videos and text from tags related to Politics, Health and Patriotism in Hindi, Marathi, Bengali and Rajasthani.

Content Relevance Pipeline

  • We have completed building a multimodal machine learning pipeline that can classify if a piece of multilingual multimedia is relevant to the Tattle archive with ~80% accuracy. Relevant is defined as that which could be potential misinformation or is of historical value. This will help us triage new data we collect from WhatsApp and other sources.
Clone this wiki locally