- Abdul Sattar Mapara
- Saket Chopade
- Rohan Salvi
- Pritam Kumar Sahoo
- Dr. U.A. Deshpande Sir (VNIT, Nagpur)
- Dr. Sagar Sunkle Sir (TRDDC, Pune)
The aim of the project is to gather time-stamped factual information about a given topic/entity from a given set of documents (Brokerage Reports).
More precisely, given a set of documents (brokerage reports in PDF format), about a company or a bank (or any organization) published over a period of 1-2 years, it is expected that factual information about that company, or a bank (or any entity) to be extracted (in the form of semi-structured statements) and classified as an increasing or decreasing trend. The extracted facts are expected to be grouped by date/month.
-
Collecting and Processing the reports
-
Brokerage Reports collected from - trendlyne.com
-
PDF -> Text conversion
-
Text -> Sentence (Sentence Tokenization)
-
Pass through spaCy pipeline for tokenization (into tokens), lemmatization, Part of Speech Tagging, Dependency Parse tree generation, Named Entity Recognition
-
-
Extraction of Date/Timestamp
- Using Named Entity Recognition
- Using Metadata associated with the reports
-
Extraction of Facts in the form of Semi Structured Statements
- Using Textacy library
- Using Dependency Parse tree generated by spaCy (Custom Approach)
- Explored relation extraction using Stanford Open IE
-
Sentiment Analysis (Sentence Classification)
- Dictionary based approach
- Machine learning based approach (using Support Vector Machines)
- Deep learning based approach (using Convolutional Neural Networks)
Note: Conversion of words to numbers done using custom word2vec model
-
Application (using Flask framework) for demonstration of the project
This repository contains the source code written during the project for accomplishing the required tasks and experimentation.
This branch (master) contains the source code of the application developed for demonstration.
Video - Download Final-Year-Project-Demo