Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of a Classifier that would have both WEKA and regex-based classifier in consideration #77

Open
1130695 opened this issue May 18, 2017 · 11 comments
Milestone

Comments

@1130695
Copy link

1130695 commented May 18, 2017

Title. It would be pretty sweet if we could build a Classifier that would classify a certain page using regex and WEKA. This would help us getting a more precise output.

We could start off by building one with the "baseline" behaviour and adapt it from there.

@aecio , you said this was somewhat easy to do, but I don't know very well how the weka classification options you have available work exactly. Which ones would work best together? Authority with BS?

@aecio
Copy link
Member

aecio commented May 18, 2017

Are we talking about a new link classifier? I assumed it was a target page classifier.

@1130695
Copy link
Author

1130695 commented May 19, 2017

My bad, We're talking about a new target page classifier, was kinda sleepy -.- Forget about the link classification example.

@aecio
Copy link
Member

aecio commented May 23, 2017

I was thinking in having a more generic classifier that can combine a list of any other existing classifier. For example, it would be configured like this:

type: combiner
parameters:
  boolean_operator: OR
  classifiers:
    - type: url_regex
      parameters:
        regular_expressions: [
          "https?://www\\.somedomain\\.com/forum/.*"
          ".*/thread/.*",
          ".*/archive/index.php/t.*",
        ]
    - type: weka
      parameters:
        features_file: pageclassifier.features
        model_file: pageclassifier.model

The key classifiers could be a list of any target page classifier available along with its parameters.

@julianafreire
Copy link
Member

I like the idea. This would be very useful.
Are you planning to support complex boolean expressions? Or just "OR" and just "AND"?

@aecio
Copy link
Member

aecio commented May 23, 2017

What do you mean by complex boolean expresions? Could you give an example?

@julianafreire
Copy link
Member

E.g., (A AND B) OR (C AND D)

I think we can start with just simple queries A AND B AND C ..., or A OR B OR C...
This would already be very useful.

@1130695
Copy link
Author

1130695 commented May 23, 2017

Yup, this would do the trick @aecio ! Adding this format means that the one you made a month ago wouldn't be necessary anymore since this one would have everything in consideration

Regarding the situation of @julianafreire , I think @aecio already implemented it if you check the regex he made a while back. And I'm not sure, but I think you can apply it in this scenario as well.

For example url_regex (parameter_1 OR parameter_2) AND body_regex (parameter_1 AND parameter_2 AND parameter_3) OR weka (...)

@1130695
Copy link
Author

1130695 commented May 23, 2017

#69

@aecio
Copy link
Member

aecio commented May 23, 2017

Got it. I think nesting combiner classifiers would enable complex boolean expressions and could be easily supported. It would also enable combination of weka models with arbitrary regex-based classifiers. For example:

type: combiner
parameters:
  boolean_operator: OR
  classifiers:
    - type: combiner
      parameters:
        boolean_operator: AND
        classifiers: 
          - type: url_regex
            parameters:
              regular_expressions: ["https?://www\\.somedomain\\.com/forum/.*"]
          - type: weka
            parameters:
              features_file: model_01/pageclassifier.features
              model_file: model_01/pageclassifier.model
    - type: weka
      parameters:
        features_file: model_02/pageclassifier.features
        model_file: model_02/pageclassifier.model

@1130695
Copy link
Author

1130695 commented May 23, 2017

@aecio , the boolean_operator inside the "classifiers" means it can pass the url_regex OR weka right? Could we still have the boolean_operator for the parameters/expressions inside the url/body/etc_regex like you did before? Not sure if you were having this in mind with that structure.. pointing it out anyway 📦

@aecio
Copy link
Member

aecio commented May 29, 2017

@1130695 boolean_operator in this case is how you combine the nested classifiers. You could nest ANY classifier like title_regex, url_regex, weka, and even the regex classifier, which accepts another boolean_operator parameter for combining the regular expressions for each field.

@aecio aecio added this to the next milestone Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants