UCSD MAS DSE203 final project.
The project tries to answer questions/queries in the area of innovation by combining multiple data sources.
The focus of this project is to demonstrate ETL (Extract, Transform, Load) concepts. The goal of the project is to use multiple data sources of desparate formats (structured, semi-structred and unstructed).
Sagar Jogadhenu, Prakhar Shukla, Laben Fisher
Build a knowledge base to facilitate easy query of research areas and link companies
Partnership Targeting: Find synergies between companies working in similar areas/technologies Money flow/Research Trends: Understanding where investments are being prioritized Understanding effects of social policy impact on investments: How much funding are disadvantage or minority small businesses are receiving
It is recommended to just run the neo4j_graph_etl.ipynb file to generate the graph data base and to run the associated queries. If that is all that is needed, then it is recommended to skip this complete installation section and just run install the packages needed for that notebook. The reason for this is that this project developed a modular approach and had intermediate data files generated after each stage of the pipeline. All the files and intermediate files have already been run and the final module, the creation of the graph data base and running of queries can be executed withour rerunning these other files. If the desire is to run the complete pipeline from end to end, follow the rest of this readme file. THen execute the notebooks in the following order:
- Ensure all input files are available. Some input files such as patents XML need ot be downloaded using URLs provided above
- Run following notebooks to create technical dictionary in any order in the preprocessing folder:
- process_ieee_thesaurus_acm_terms.ipynb - Produces tech_terms.txt in preprocessed_files folder
- generate_nontech_terms.ipynb - Produces non_tech.txt in preprocessed_files folder
- Run tech_term_classifier.ipynb notebook in model folder. Produces a model file trained_tech_classifier_model.joblib. Note that we manually compressed this file to upload to github so you will only see zip file as the generated file is too huge to upload as is.
- Run following notebooks from preprocessing folder in any order:
- process_patent_xml.ipynb - Generates patents.json in preprocessed_files folder
- process_sbir_csv.ipynb - Generates sbir_1k_sample.csv in preprocessed_files folder
- llama_similarity.ipynb - Generates llama_similarity.csv in preprocessed_files folder
- Finally run the neo4j_graph_etl.ipynb from the top level folder. Generates a knowledge graph that can be accessed either via notebook or via neo4j desktop.
If it is desired to run each notebook to validate the notebooks or to update intermediate files, then running each of this installation scripts up front would be usefull. When running the preprocessing files, errors did arise when running files multiple times and having multiple kernels running as well. It is recommended that one file at a time is run and when done, the kernal is shut down before running the next file just to be certain there is no conflicts.
!pip install bs4 !pip install import-ipynb !pip install jsonpath-ng !pip install spacy
One of these two methods for isntalling en_core_web_lg should work for the target environment !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0.tar.gz !python -m spacy download en_core_web_lg to be installed
To download en_core_sci_lg language model used for the tests, uncomment and run the following line The en_core_sci_lg-0.5.3 model was run on spacy 3.6.1 and will provide a warning message saying it may not operate correctly with spacy 3.7.2 For this project spacy 3.7.2 is needed and the model operated without issue and can be run in this manner. Future iterations of this effort could work this descripency between versions of dependecies.
!pip install scispacy
!pip install --upgrade scipy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_lg-0.5.3.tar.gz
- ast
- bs4
- import_ipynb
- io
- itertools
- joblib
- json
- llama_index
- lxml
- nltk
- os
- Pandas
- py2neo
- Python 3.9
- re
- requests
- scipy 1.11.4
- sklearn
- spacy 3.7.2
- subprocess
- time
- en_core_web_lg
- en_core_sci_lg-0.5.3
Note: Other files will be needed if notebooks in the depricated folder are run. The depricated folder contains EDA and ML test that are not needed in the deliverable of the current baseline. They are there for reference purposes only.
This URLs are were the base level data comes from for this project. The patent data will need to be manually downloaded and placed in the input_files folder. The SBIR Award Data will be accessed at run time, so a internet connection will be needed. The IEEE and ACM data files are already stored in the input_files folder.
Patent Data: https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2023/ipa230720.zip
SBIR Award Data: https://data.www.sbir.gov/awarddatapublic/award_data.csv
IEEE Data: https://www.ieee.org/publications/services/thesaurus-access-page.html
ACM Data: https://csrc.nist.gov/glossary
A key component of this project is building a classifier that can take a term/phrase and identify whether it is a technical term/phrase or not. This is useful in processing abstracts from SBIR and patent database and filter out any non-technical terms. To build the classifer, we need
- a set of technical terms which are obtained by processing IEEE Thesaurus and vocabulary from ACM. The technical terms are extracted using the notebook process_ieee_thesaurus_acm_terms.ipynb from preprocessing folder.
- a set of non-technical terms. This is obtained by taking a sample of abstracts from the SBIR dataset and extracting entities using Spacy and filtering any technical terms. The non-technical terms are extracted using the notebookt generate_nontech_terms.ipynb from preprocessing folder. The notebook spacy_helper_methods.ipynb has Spacy based helper functions to lemmatize and extract entities from SBIR abstracts. The notebook is loaded by other notebooks.
- Finally a binary classifier using RandomForrest is created. This is implemented in tech_term_classifier.ipynb from the model folder.
- The trained model trained_tech_classifier_model.joblib.zip is stored as a zip file in model folder and can be used in subsequent stages.
Once necessary preparation is done, we are ready to do entity extraction. Entity extraction is done on patent dataset which is in xml format and SBIR dataset which is in csv format. For patents xml file, there is additional preprocessing done. The steps involved in technical terms is same for both datasets and as follows:
- Lemmatize the abstract field of each dataset. Additionally lemmatize the claims field in patent dataset - Spacy is used
- Extrac entities from the lemmatized text for each column - Scispacy is used
- Run the extracted entities in 2 through the binary RandomForrest classifier that was previously trained for identifying technical terms
- Save the filtered results in a csv file: patents.json and sbir_1k_sample.csv. Note that these files only contain 1k sample records. Uncomment cells in the notebook to get full set of records.
The notebooks used to generate technical entities are:
- process_patent_xml.ipynb
- process_sbir_csv.ipynb
Semantic similarity between abstracts of SBIR and Patent dataset is performed using LlamaIndex from OpenAI. The semantic similarity is implemented in the notebook llama_similarity.ipynb in the preprocessing folder. This notebook first create a set of tuples with one element of the tuple representing SBIR record and another element representing the patent record. Abstracts corresponding to each tuple is passed to the LlamaIndex Semantics similarity function. Only tuples that pass the 0.8 similarity threshold are retained. The output is stored in llama_similarity.csv in preprocessed_files folder.
Now that we have generated all necessary files, the knowledge graph in neo4j is created by further processing the data and creating nodes and edges. This is performed by neo4j_graph_etl.ipynb at the top level. For this notebook to run, neo4j desktop application may need to be open to get the connection details that need to be updated in the notebook.
- Ensure all input files are available. Some input files such as patents XML need ot be downloaded using URLs provided above
- Run following notebooks to create technical dictionary in any order in the preprocessing folder:
- process_ieee_thesaurus_acm_terms.ipynb - Produces tech_terms.txt in preprocessed_files folder
- generate_nontech_terms.ipynb - Produces non_tech.txt in preprocessed_files folder
- Run tech_term_classifier.ipynb notebook in model folder. Produces a model file trained_tech_classifier_model.joblib. Note that we manually compressed this file to upload to github so you will only see zip file as the generated file is too huge to upload as is.
- Run following notebooks from preprocessing folder in any order:
- process_patent_xml.ipynb - Generates patents.json in preprocessed_files folder
- process_sbir_csv.ipynb - Generates sbir_1k_sample.csv in preprocessed_files folder
- llama_similarity.ipynb - Generates llama_similarity.csv in preprocessed_files folder
- Finally run the neo4j_graph_etl.ipynb from the top level folder. Generates a knowledge graph that can be accessed either via notebook or via neo4j desktop.