Plugins and project ideas collection #4338

ines · 2019-09-29T16:34:21Z

ines
Sep 29, 2019
Maintainer

I was going though the existing enhancement issues again and though it'd be nice to collect ideas for spaCy plugins and related projects. There are always people in the community who are looking for new things to build, so here's some inspiration ✨ For existing plugins and projects, check out the spaCy universe.

If you have questions about the projects I suggested, or the spaCy plugin system in general, I should also be able to help. And if you're looking for collaborators or there's a plugin you'd love to see built, feel free to comment here as well.

Various ideas

Export spaCy models for use in Java environments? #2466: Export spaCy models for use in Java environments
Render dependency graph with graphviz #2264: Render dependency graph with graphviz
visualising NER activations inside spaCy models #2625: visualising NER activations inside spaCy models
Extracting Verb Phrases (VP) using spaCy #4441: extracting verb phrases similar to noun phrases, see also the docs on syntax iterators.

Visual Studio Code extension (#2969)

I started on a little spaCy snippets extension ages ago and never really quite finished it. But I always thought it'd be cool to have a spaCy extension with some helpers and maybe some deeper pipeline, data structures and model inspection tools. I haven't really worked with VSCode plugins (yet), but maybe someone from the community has an idea and/or experience? Would be cool to work on this together!

Wrappers for debugging pipeline components

Inspired by this Stack Overflow question: https://stackoverflow.com/a/57964354/6400719. Could be a helper that wraps the nlp object and logs processing time and other useful details. I also have a bunch of draft code I'm happy to share if someone wants to work on this. (Also see #3943 for related functionality we want to ship in spaCy.)

spaCy + Apache Beam

Thread with notebook and discussion: https://twitter.com/swartchris8/status/1194192895244480512 A package could, for instance, wrap the boilerplate code so all the user has to do is pass in an nlp object and config options (what should be extracted).

Translations of the spaCy course

The spaCy course is open-source and on GitHub and the content is released under a CC BY-NC license. Translating it to other languages could be really cool, to make it easier for people to get started 🙂

Chinese: in progress (@GoooIce)

Implemented

Pandas helpers and utilities (#3702)

I think some helpers for pandas could be a nice spaCy plugin? We wouldn't want to ship anything that depends on pandas in the core library, but I can totally see a little helper library that depends on spaCy and pandas and includes useful functions to represent a spaCy Doc as a dataframe.

✅ See: https://github.com/yash1994/dframcy

Project starter as GitHub repo template

GitHub now supports template repos, so it could be cool to have a "spaCy project starter" template that's set up as a Python package, includes some basic scaffolding around loading models and processing texts, and maybe exposes a small REST API using FastAPI.

microsoft/cookiecutter-spacy-fastapi by @kabirkhan
See also the project templates for spaCy v3 at https://github.com/explosion/projects, and in particular https://github.com/explosion/projects/tree/v3/integrations/fastapi

GoooIce · 2019-10-17T03:48:13Z

GoooIce
Oct 17, 2019

I am translating the course into Chinese:
goooice/spacy-course
course.spacy.cn.miantu.net

0 replies

yash1994 · 2019-10-17T08:47:08Z

yash1994
Oct 17, 2019

I've made a utility module to integrate Pandas Dataframe with spaCy. https://github.com/yash1994/dframcy

0 replies

ines · 2019-10-18T09:05:13Z

ines
Oct 18, 2019
Maintainer Author

@GoooIce Woooow, this is really cool! Let me know if you have questions or need help. If you give me the text, I can also make a Chinese version of the logo 😃

@yash1994 Nice, thanks for sharing! Do you want to submit it to the spaCy Universe (see here for details)?

Also, one small suggestion: I think it'd be cleaner if your custom classes like DframCy took a loaded nlp object instead of just the model name. Users often want to load their models in a custom way, decide what to enable/disable or use a blank language class instead. If you let the user load the nlp object themselves, they have full flexibility, and your wrapper won't have to consider all possible options under the hood. It also makes it easier to reuse the same nlp object.

0 replies

yash1994 · 2019-10-18T09:25:18Z

yash1994
Oct 18, 2019

Thank you @ines, for your suggestions. I understood the point you've made, will make necessary changes in the code and submit a pull request for spaCy universe submission. Thanks again for your time.

0 replies

kabirkhan · 2019-10-21T21:44:57Z

kabirkhan
Oct 21, 2019

@ines It's not a Github Template Repo but it's a pretty great start with Cookiecutter.
https://github.com/microsoft/cookiecutter-spacy-fastapi

The API follows the rather opinionated API request/response format of Azure Search Cognitive Skills cause Microsoft

PR to add to universe is here:
#4498

0 replies

kabirkhan · 2019-10-21T21:46:47Z

kabirkhan
Oct 21, 2019

If there's interest in a Template Repo I can also contribute that pretty easily.

@ines I'm super interested in working on the debugging of pipeline components. I wrote a quick wrapper around the Language class to time pipeline steps and I've found it to be really useful despite its hacky nature. Would love to see the draft code you mentioned and I can start working on that.

0 replies

ines · 2019-10-22T13:19:03Z

ines
Oct 22, 2019
Maintainer Author

@kabirkhan Thanks – just shared the cookiecutter template on Twitter!

And here's one draft of a SpacyDebugger – it's actually more comments and TODOs than actual code, but it outlines a few ideas I've had (like, having the debugger store the metadata via extension attributes on the Doc so you can process a bunch of objects and then analyse them later).

import copy
import datetime
from spacy.tokens import Doc


class SpacyDebugger(object):
    def __init__(self, nlp):
        self.orig_nlp = nlp
        self.nlp = self.wrap_pipeline(nlp)
        Doc.set_extension("debug_start_times", default={})
        # TODO: add extension method that calculated execution time based on
        # start and end for given component, e.g. doc._.debug_exec_time("ner")
        # TODO: method on Doc that writes everything to a log file?

    def make_debug_component(self, name):
        def debug_component(doc):
            # TODO: add option to not print but store timestamp in extension
            # attributes on the Doc (for each component) in 
            # doc._.debug_start_times
            # TODO: use logging module instead of print
            # TODO: option to generate visualization?
            print(f"Before '{name}'", datetime.datetime.now().timestamp())
            return doc

        return debug_component

    def wrap_pipeline(self, nlp):
        nlp = copy.deepcopy(nlp)
        # We don't want to modify this while we're looping over it
        pipeline = list(nlp.pipeline)
        for name, pipe in pipeline:
            debug_component = self.make_debug_component(name)
            nlp.add_pipe(debug_component, before=name, name=f"debug_{name}")
            # TODO: add component after and also log end times
        return nlp

0 replies

skvrahul · 2020-02-28T04:26:33Z

skvrahul
Feb 28, 2020

@ines I also would like to work on contributing to the Debugging Wrapper. Could I take that up?

0 replies

093093 · 2020-06-22T14:03:41Z

093093
Jun 22, 2020

Hi,
I am willing to contribute to this project . May I get a module to work opon?

0 replies

yash1994 · 2020-06-26T15:30:49Z

yash1994
Jun 26, 2020

Hi all,
To leverage spaCy's clean APIs for text annotation/processing in Golang. I've built a Golang wrapper module spacy-go using gRPC. Suggestions and PRs are welcomed.

0 replies

kr-prince · 2020-08-07T18:54:35Z

kr-prince
Aug 7, 2020

@ines I have some ideas of enhancement in the phrase and Token matcher part like using conditional matches and extending control to user to allow nested matches or not. I also have a fair experience in Python and would like to contribute. How do I take this forward?

0 replies

rohts-patil · 2020-08-26T13:44:25Z

rohts-patil
Aug 26, 2020

@ines
Is there any way to do export spacy models for use in Java as specified in #2466? I see that issue has been closed.

0 replies

svlandeg · 2020-10-21T14:09:21Z

svlandeg
Oct 21, 2020
Maintainer

Hi @rohts-patil: issue #2466 has been closed and merged with this master thread, to keep a better overview of projects that people could work on. Nobody has taken that particular challenge up yet though, as far as I know.

May I get a module to work opon?

@093093 We typically don't really "assign" issues to people. You can find various project ideas in this thread, and you can start working on a PR if you're interested in solving any of them :-)

0 replies

randomgambit · 2020-10-27T21:02:30Z

randomgambit
Oct 27, 2020

Hello there, I am following up on #4441 and I am happy to share my working code.

This seems to correctly extract the VP by exploiting the natural ordering of the tokens in the sentence as well as using the dependency tree. Please let me know if you find it useful or if you see some improvements!

Thanks!

import spacy

# Load the small English model
nlp = spacy.load("en_core_web_lg")

text = '''
After many years Spacy has suddently become a monster-package in the NLP world.
'''
doc = nlp(text)

getVP(doc, r'Spacy')
Out[58]: ['has suddently become a monster']

the annotated function is shown below:

def getVP(nlpdoc, mytoken):
    
    mylist = []
    
    patt = re.compile(mytoken)
    
    for token in nlpdoc:
        if token.pos_ == 'VERB' or token.pos_ == 'AUX':
            #print('    ')
            print(token.text)
            #print('    ')
            #get children on verb/aux
            nodechild = token.children
            getchild1 = []
            getchild2 = []
            #iterate over the children
            for child in nodechild:
                getchild1.append(child)
                #get children of children
                listchild = list(child.children)
                for grandchild in listchild:
                    getchild2.append(grandchild)
            #print('children are ' + str(getchild1)) 
            #print('grandchildren are ' + str(getchild2))
            #check if Spacy is a children or a children of a children
            test1 = [patt.search(tok.lemma_) for tok in getchild1]
            test2 = [patt.search(tok.lemma_) for tok in getchild2]
            #if YES, then parse the VP
            if any(test1) or any(test2):
                
                fulltok = token.text
                myiter = token
                #the VP can actually start a bit before the VERB, so we look for the leftmost AUX/VERBS
                candidates = [lefty for lefty in token.lefts]
                candidates = [lefty for lefty in candidates if lefty.pos_ in ['AUX', 'VERB']]
                #if we find one, then we start concatenating the tokens from there
                if candidates:
                    fulltok = candidates[0].text
                    myiter = candidates[0]

                while myiter.nbor().pos_ in ['VERB','PART','ADV','ADJ','ADP','NUM','DET','NOUN','PROPN','AUX']:
                    fulltok = fulltok + ' '+ myiter.nbor().text
                    myiter = myiter.nbor()
                mylist.append(fulltok)
    return mylist

0 replies

randomgambit · 2020-10-28T13:49:52Z

randomgambit
Oct 28, 2020

@svlandeg let me know if you have any questions. I know that for some convoluted examples the function can return VPs and sub-VPs (much like the NP) and I think this could be fixed by indexing the start-end of the largest VP.

0 replies

svlandeg · 2020-10-28T18:57:30Z

svlandeg
Oct 28, 2020
Maintainer

Hi @randomgambit! Would you feel like submitting a PR wiht this code? That is easier to review ;-) Perhaps run the algorithm on the example English sentences, and show the output?

0 replies

randomgambit · 2020-10-28T18:59:22Z

randomgambit
Oct 28, 2020

Hi @svlandeg I would love to but have never submitted a PR. Perhaps can you create one and I will be happy to run it on the example english sentences if you show me where they are!

0 replies

svlandeg · 2020-10-28T19:12:27Z

svlandeg
Oct 28, 2020
Maintainer

There's a first time for everything! ;-) If you feel like figuring out how it works, there's some documentation on contributing here: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#contributing-to-the-code-base

In a nutshell, you'll want to first create your own fork of spaCy, create a new branch from master, and start pushing your commits to that local branch of yours. The VP chunker code should probably live in https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py and could follow the general outline of def noun_chunks, and be stored in SYNTAX_ITERATORS["verb_chunks"].

Ideally, you could also extend the unit tests to include some verb chunking, cf for instance here: https://github.com/explosion/spaCy/blob/master/spacy/tests/lang/en/test_noun_chunks.py
I guess we could either rename that file to "test_syntax_iterators", or have a second file next to it that's called "test_verb_chunks", either way you could use it again as a sort of template.

Then when you've written all code and it looks good to you, you can go to your repo and create a PR against spaCy's master branch, and describe the algorithm you've implemented. That's also where you could include the output on the English example sentences, or any other examples you've tested this on. I typically find it easier to review with some examples & explanation.

If you're still a bit unsure whether everything is right, you can choose the option "Create draft pull request" instead of the default "Create pull request". As long as the PR is in draft, we'll assume you're working on it and let you finish that first.

I think it could be a nice learning opportunity, and it would make sure the contribution is properly attributed to you. But let us know if you don't have the time/means to create the PR right now ;-)

0 replies

randomgambit · 2020-12-09T17:22:45Z

randomgambit
Dec 9, 2020

Hello there @svlandeg and the great Spacy team!

I have been thinking about this, and I think I need some help for the following issue. Consider the following simple sentence:

expert spacy users, as measured by the recent polls, have been very kind

I would like to extract the following information

expert spacy users have been very kind

This is Subject + Verb essentially, but I am struggling a bit to write this correctly.

I think the idea would be to look for a VERB (here been) and then search for dependents of the VERB that are NOUN. Finally add all the dependents ADJ or PROPN of that noun. But that seems a bit clunky to me and I wonder if there something much simpler that can be done here with Spacy.

Any ideas?

Thank you!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugins and project ideas collection #4338

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 19 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Plugins and project ideas collection #4338

ines Sep 29, 2019 Maintainer

Various ideas

Visual Studio Code extension (#2969)

Wrappers for debugging pipeline components

spaCy + Apache Beam

Translations of the spaCy course

Implemented

Pandas helpers and utilities (#3702)

Project starter as GitHub repo template

Replies: 19 comments

ines Oct 18, 2019 Maintainer Author

ines Oct 22, 2019 Maintainer Author

svlandeg Oct 21, 2020 Maintainer

svlandeg Oct 28, 2020 Maintainer

svlandeg Oct 28, 2020 Maintainer

ines
Sep 29, 2019
Maintainer

ines
Oct 18, 2019
Maintainer Author

ines
Oct 22, 2019
Maintainer Author

svlandeg
Oct 21, 2020
Maintainer

svlandeg
Oct 28, 2020
Maintainer

svlandeg
Oct 28, 2020
Maintainer