Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot precision option #2

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Spot precision option #2

wants to merge 8 commits into from

Conversation

rabarona
Copy link
Owner

No description provided.


oni-ml main component uses Spark and Spark SQL to analyze network events and produce a list of least probable events
or most suspicious.
spot-ml main component uses Spark and Spark SQL to analyze network events and produce a list of least probable events

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

produce a list of those considered the most unlikely.


As a first step, users need to decide whether they want to change from 64 bit floating point probabilities to 32 bit floating point probabilities; if users decide to change from 64 to 32 bit, the document probability distribution lookup table will be half the size and more easily broadcasted.

If users want to cut the payload on half, they should set precision option to 32.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users want to cut payload memory consumption roughly in half, they should set the precision option to 32

* limitations under the License.
*/

package org.apache.spot.utilities.transformation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just have the utilities package; the things in transformation don't really interact with each other much

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was more for classification.

*
*/

sealed trait PrecisionUtility extends Serializable {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps more clear : FloatingPointPrecisionReducer

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we have toDoubles() method. It reduces and then increases back. What about FloatingPointPrecision or FloatingPointPrecisionUtility

* PrecisionUtility will transform a number from Double to Float if precision option is set to 32 bit,
* if default or 64 bit is selected, it will just return the same number type Double.
*
* This abstract class permits the execution of a single path during the entire analysis. Instead of checking what

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to provide "why" in the scaladoc


/**
* Converts a number into the precision type; it can be Float (32) or Double (64).
* For the Double implementation it will just return the same value without any transformation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strike this; it should go with the implementation of the Double version

def toDoubles[A <% Traversable[TargetType], B <% Traversable[Double]](targetTypeIterable: A): B

/**
* Converts a DataFrame column from Seq[Double] to a Seq[TargetType]. If the TargetType is Double, it will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment about what it will do in the case of Double should go with the Double implementation


/**
* PrecisionUtility implementation for Double.
* Users that don't want to reduce the workload, can continue working with Doubles. This implementation will receive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to explain use case beyond "no reduction in precision in done"

Copy link

@NathanSegerlind NathanSegerlind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer this version of the precision utility that better uses the scala type system, thanks for doing it :)

approved but for a few comment/naming things... to be honest, I don't like the "transformation" subpackage of utilities, but that is a trivial change

Ricardo Barona added 7 commits May 31, 2017 12:34
… Float and reduce the payload when joining original data and LDA results (document probabilities).

Added parameters for using Doubles (64 bit) or Float (32 bit), added parameter for spark autoBroadcastJoinThreshold, this parameter will change the default value for the job execution in runtime.
Added new properties to spot.conf file, one for the scaling option and one more for the autoBroadcastJoinThreshold.
Added unit test for new functionality.
Slightly changed the package structure for utilities. Moved all the transformation utilities to a new package called transformation.
Fixed a few markdown issues plus removed oni references and replaced with spot.
Added documentation for 2 new options, SCALING_OPTION and SPK_AUTO_BRDCST_JOIN_THR.
…ilities.

Changed command line option from scaling to precision.
Changed name of PrecisionUtility to FloatingPointPrecisionUtility.
Minor documentation fixes.
rabarona pushed a commit that referenced this pull request Jun 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants