Publicly Verifiable & Private Collaborative ML Model Training #6317

ewynx · 2024-10-22T18:17:26Z

ewynx
Oct 22, 2024

Summary

Verifiable & Private Collaborative Machine Learning (ML) Model Training will allow multiple parties to jointly train machine learning models while maintaining the privacy of their datasets. In addition, a collaborative zero-knowledge proof of the computation is delivered, giving public training verifiability. This approach can be applied in areas such as medical research and financial analysis, enabling collaboration without loss of privacy while maintaining compliance with regulations like GDPR.

This proposal aims to implement logistic regression in Noir and combine it with MPC to support collaborative training. Logistic regression is a widely used algorithm in Machine Learning for classification tasks. By integrating it into Noir, we will establish a fundamental building block for advancing Privacy-Preserving Machine Learning (PPML). Leveraging co-noir, we extend ZK proofs to a multi-party setting, enabling collaborative training and the collaborative generation of a proof.

The key deliverables include a fully optimized and tested logistic regression algorithm in Noir, performance benchmarks, and the open sourcing of all code and documentation to drive future development. As part of the project, we will apply the collaborative training to small datasets like the Iris plants dataset, and assess the performance and quality of the training. Based on these initial results, we will consider other datasets with more samples, features, and complexity.

Methodology

High-level strategy

There are three main components, combining them as follows:

For ZK proofs, we will write functionality in Noir.
For MPC, we will use use co-noir for a secure collaborative training of a ML model.
For ML, we will implement logistic regression.

All code deliverables will be written in Noir and can be executed with nargo instead of co-noir as well (which changes the setting from collaborative to individual).

We aim to be in close contact with the Taceo team, who is actively developing co-noir, to ensure the compatibility of our implementation and the support that co-noir offers. Given that co-noir is a rapidly evolving project, we see this as an opportunity to build a compelling application that leverages this new tool while simultaneously testing it and providing valuable feedback for its continued development. Furthermore, the development of the verifiable training using Noir is useful even without MPC which makes this tool valuable on its own.

Detailed approach

After initial research, we defined the building blocks needed, reviewed available libraries from the Noir community, and determined suitable training datasets for testing. This led to the following approach (for a complete list of tasks, see Timeline and Deliverables).

For logistic regression, we need fixed point arithmetic and matrix arithmetic, including matrix inversion, and support for matrices of fixed points. The currently available libraries for the necessary arithmetic are outdated or incomplete, which makes them unsuitable for direct use.

To implement the logistic regression training, we will use the arithmetic tools presented above to fit the logistic regression model by maximum likelihood. The maximum likelihood requires the maximization of the log-likelihood function, and this problem can be solved numerically by using the Newton-Rhapson method, which will be implemented in our deliverables.

To ensure the compatibility of our functionality with co-noir, we will continuously test our implementation with the mentioned tool, making sure all functionality is supported in both ZK and ZK + MPC setting.

As a test project, we will train a logistic regression model for the Iris plants dataset. This is a small dataset of 150 samples and three classes. We consider this a good starting point for evaluating the performance of the implementation with the ZK + MPC overhead.

To validate the accuracy of the obtained model, we will take the parameters of the trained model and measure the accuracy with a traditional fractional number representation used in libraries like numpy in Python. Checking the accuracy in this way gives us more confidence in the computation of the metric, given that the training is done using fixed-point, which has a lower precision compared to the built-in representation of fractional numbers in standard ML libraries.

The project's initial phrase will conclude with benchmarking, testing, and documentation. Following this, we will focus on optimizations that can be done in all areas of the code, particularly in areas such as fixed point/matrix arithmetic and logistic regression itself.

Assumptions

For this project, we depend on the support offered by the co-noir. Hence, we will use the protocol Rep3 (which refers to the protocol of Arraki et al. with modification presented in Eerikson et al.), and the protocol based on Shamir secret sharing. Given that co-noir currently supports a security model of honest-majority and semi-honest adversaries, we will use the same security model.
For the implementation of fixed-point arithmetic, we need support of range checks in the co-noir tool. According to the Taceo team, this feature will be added soon and within the time frame of the execution of this project.
We assume that there are three parties engaged in the protocol and one of them is corrupt. For the case of Shamir secret-sharing protocol, we could increase the number of parties (keeping the honest majority constraint) as we evaluate the performance of our implementation.

Timeline and Deliverables

The project is divided into four deliverables:

Deliverable 1: Fixed point arithmetic & matrix arithmetic libraries.
Deliverable 2: Logistic regression functionality, tested with co-noir.
Deliverable 3: Application on the Iris plants dataset, using co-noir. The deliverable will contain a report of the accuracy of the trained model using numpy.
Deliverable 4: Benchmarking & docs.

The expected timeline is as follows:

Week 1: Fixed point library
Week 2: Matrix arithmetic library
Week 3: Testing of libraries in co-noir
Week 4-6: Implement Logistic regression, testing usage with co-noir
Week 7: Create an example project that uses the Iris plants dataset.
Week 8-9: Assess the accuracy of the trained model using numpy in Python & create a report.
Week 10-11: Benchmarks, tests, docs.
Week 11-12: Optimizations for fixed point, matrix arithmetic & logistic regression library.

Team

The team consists of two cryptographic engineers and a project manager, all of whom have previously collaborated on Noir-related work. With their combined experience in Noir, MPC, machine learning, and working with emerging technologies like ZK and MPC, the team is well-equipped to execute this project.

Team members

Teresa Li is a Project Manager at HashCloak with a background in quantitative finance and experience in equity research, investments, and risk management. Amongst her contributions at HashCloak, she has managed the most recent Noir-related work that was delivered (see below), with the same team as proposed for this project. She will oversee this project and be the main point of contact.

Elena Fuentes (@ewynx) is a Cryptographic Researcher & Engineer at HashCloak, and has recently contributed to functionalities for noir support for zk-regex, as well as documentation and mobile benchmarking for the Noir bignum library, and docs for noir_base64 and string_search. In 2023, she won 2nd prize for Best Noir App at the ETHGlobal hackathon. Her other contributions at Hashcloak include implementing elliptic curve cryptography and SHA-512 in Sway, and working with other zero-knowledge languages and frameworks such as Plonky2, Halo2, and o1js.

Hernan Vanegas (@hdvanegasm) is a Cryptographic Researcher & Engineer at HashCloak and has extensive experience with both MPC and machine learning. He has worked with the MP-SPDZ framework to implement MPC protocols. He also applies MPC techniques to machine learning in research projects such as secure support vector machine training and dynamic programming problems like the edit distance problem. Also, he has experience with reinforcement learning techniques, and he has participated in research projects that combine information security and computer vision. Amongst his recent contributions at HashCloak, he has worked on benchmarking Noir libraries, bignum, rsa and zk-regex, as well as the assessment of threshold ECDSA protocols.

Relevant work

The following are relevant work done at HashCloak that involve Noir:

The development new functionalities in zk-regex library: Features/HC improvements for zk-regex Noir support zk-regex#8
The benchmarking of Bignum and RSA libraries: https://github.com/hashcloak/noir-bigint-bench,
The benchmarking and examples for zk-regex library: https://github.com/hashcloak/zk-regex-tests-and-examples/tree/main

Additionally, HashCloak has been actively involved in the MPC field in the following projects:

Threshold ECDSA protocols.
Application of MPC and dynamic programming problems to block building.
Enhancement of privacy in smart contracts using MPC.

Start Date

We would like to propose a start date of November 11th, 2024, which would allow us to begin promptly, if being selected following the announcements date.

Savio-Sou · 2024-10-24T14:59:05Z

Savio-Sou
Oct 24, 2024
Maintainer

Hi @ewynx, thank you for submitting the proposal!

Could you elaborate further on:

What are the benefits to training the model with ZK + MPC, versus just ZK (which can also achieve both privacy and verifiability)?
How will this project utilize private shared states, as in how will people benefit from continually interacting and updating some private states?
- Is the model training not a one-off operation?
How is the example application on the Iris plants dataset expected to work?
The potential applications that listed
Any preliminary estimations on practicality / feasibility of application?
- Hopefully reasonable concern as ZKML has gained limited traction due to I assume performance overheads

2 replies

ewynx Oct 24, 2024
Author

Hi @Savio-Sou! Thanks for your questions, here are the answers:

Training the model with ZK + MPC gives us two additional benefits / features. First, this setting allows for the model to be trained on data that is split amongst different parties. They can train the model in collaboration while maintaining data privacy. Second, while ZK generates just the proof, adding MPC also gives the output of the computation. In our case, this output are the weights of the trained model.
For this project, the training is not a continuous task. However, in modern machine learning applications the weights of the models are updated continuously as new data is available. In the current setting, this would mean running the training again. In the future, this project can be enhanced to support continuous update of the weights of the model using new data samples.
As mentioned before, for the current project training the model is a one-off operation, which uses a split dataset. For the example application on the Iris plants dataset we would manually split the data and divide them amongst the participating parties. The idea is that each party wants to keep their dataset part private, which we would simulate by splitting up the dataset ourselved. Running the application, the output consists of both the weights of the trained model and the (collaborative) proof of this computation. The trained model would be able to classify Iris plants into 3 classes based on the input features.
The aim for this project is to provide functionality that does classification as well as being a building block towards more complicated models. Example applications that use logistic regression are: fraud detection, spam detection, credit scoring, medical diagnosis and medical image analysis. Each application makes a predication on classifying input data in a specified class. Further, another application of logistic regression is as a building block to more complicated ML models like neural networks.
We have some data of training a model for a breast cancer dataset with and without MPC. This dataset contains 569 samples with 30 features and classifies into 2 classes:

Training using scikit-learn (without MPC):
Running time: 0.463 seconds
Accuracy of the model = 0.951

Secure training using MP-SDPZ (with MPC):
Running time: 567.009 seconds
Data sent = 3272.96 MB in ~921779 rounds (just one party)
Global data sent = 9758.52 MB (all parties)
Accuracy of the model = 0.944

Of course, this doesn't contain the ZP element, which is what this project aims to investigate. It serves as a foundation to further improve and enhance performance.

Savio-Sou Oct 24, 2024
Maintainer

Thanks for the clarifications, they are very helpful.

Savio-Sou · 2024-10-24T23:27:02Z

Savio-Sou
Oct 24, 2024
Maintainer

Week 1: Fixed point library

Week 2: Matrix arithmetic library

They are likely outdated at this point, but you might find the existing libraries on Awesome Noir helpful for shaving some development time off: https://github.com/noir-lang/awesome-noir/?tab=readme-ov-file#libraries

1 reply

ewynx Oct 24, 2024
Author

Thanks, yes, we looked at them and will revamp/add according to our needs 👍

Savio-Sou · 2024-10-25T13:23:49Z

Savio-Sou
Oct 25, 2024
Maintainer

Deadline to update proposals is now extended to October 28th, 2024 if it helps buy a bit of extra time to supplement details. Good luck!
https://github.com/orgs/noir-lang/discussions/6289#discussioncomment-11052718

0 replies

Mikerah · 2024-10-25T15:13:22Z

Mikerah
Oct 25, 2024

@Savio-Sou Thank you for the deadline extension on updating our proposal. Is there anything specific that you want us to add to the proposal itself that would clarify things beyond the questions you've already asked?

1 reply

Savio-Sou Oct 28, 2024
Maintainer

Not necessarily. The deadline was extended across the board, simply dropped the comment as an FYI in case it comes in handy 🙌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

noir-lang

Publicly Verifiable & Private Collaborative ML Model Training #6317

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

noir-lang

Publicly Verifiable & Private Collaborative ML Model Training #6317

ewynx Oct 22, 2024

Summary

Methodology

High-level strategy

Detailed approach

Assumptions

Timeline and Deliverables

Team

Team members

Relevant work

Start Date

Replies: 4 comments · 4 replies

Savio-Sou Oct 24, 2024 Maintainer

ewynx Oct 24, 2024 Author

Savio-Sou Oct 24, 2024 Maintainer

Savio-Sou Oct 24, 2024 Maintainer

ewynx Oct 24, 2024 Author

Savio-Sou Oct 25, 2024 Maintainer

Mikerah Oct 25, 2024

Savio-Sou Oct 28, 2024 Maintainer

ewynx
Oct 22, 2024

Replies: 4 comments 4 replies

Savio-Sou
Oct 24, 2024
Maintainer

ewynx Oct 24, 2024
Author

Savio-Sou Oct 24, 2024
Maintainer

Savio-Sou
Oct 24, 2024
Maintainer

ewynx Oct 24, 2024
Author

Savio-Sou
Oct 25, 2024
Maintainer

Mikerah
Oct 25, 2024

Savio-Sou Oct 28, 2024
Maintainer