Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application from NGPU #1796

Open
ThornbirdZhang opened this issue Sep 9, 2024 · 0 comments
Open

Application from NGPU #1796

ThornbirdZhang opened this issue Sep 9, 2024 · 0 comments
Assignees

Comments

@ThornbirdZhang
Copy link

Open Grant Proposal: NGPU -- AI DePin

Project Name: NGPU

Proposal Category: Integrations

Individual or Entity Name: Metadata Labs Inc.

Proposer: Alain Garner

Project Repo(s) https://github.com/NGPU-Community/ngpu-cli and . https://github.com/NGPU-Community/ngpu-business-backend, https://github.com/NGPU-Community/ngpu-contract, https://github.com/NGPU-Community/ngpu-node-client and so on.

(Optional) Filecoin ecosystem affiliations: None

(Optional) Technical Sponsor: None

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes

Project Summary

NGPU means Next-GPU. With the rise of the AI wave, a multitude of applications have emerged. However, the cost of AI inference powered by GPUs has skyrocketed, and these resources are largely monopolized by big companies, significantly hindering the equitable development of various applications, especially for creative startup teams. Meanwhile, many GPU resources remain idle.

NGPU, as a decentralized GPU computing network, is dedicated to providing accessible GPU nodes without any entry barriers, offering cost-effective, user-friendly, and stable GPU computing resources to various AI applications. This enables enterprise-level AI inference services while also offering idle GPU nodes an opportunity to generate income, making full use of every resource.

Impact

AI inference relies on large-scale models and massive datasets, with files often reaching tens or even hundreds of gigabytes in size. Ensuring the stable and reliable storage and management of these vast datasets has become one of the major challenges in the AI Depin industry. Traditional Web2 storage solutions often face issues such as data tampering, high costs, and access delays when handling such large-scale data. These problems not only affect the efficiency of AI inference but can also lead to data loss or corruption, posing significant risks to the stability and reliability of the entire system.

To address this challenge, we are exploring more reasonable storage solutions to support the application of large models and big data in AI inference. After thorough research, the unique technologies and services of IPFS and Filecoin have captured our attention.

We will be storing all the data of the Machine learning lifecycle onto IPFS to ensure that there is an audit trail, we would like to backup the Data onto Filecoin to make it more permanent, The Dataset tagged to the Machine Learning Lifecycle including it's featured will be stored as well to ensure that Models deployed in production can be audited for fairness further down the road.

Outcomes

With the rise of the AI wave, various applications are emerging one after another. However, the price of AI inference computing power centered around GPUs is soaring and monopolized by large companies, greatly hindering the equal development of various applications, especially creative start-up teams. At the same time, many GPU resources remain idle. NGPU, as a Web3 decentralized GPU computing power network, is dedicated to providing accessible GPU nodes without any entry barriers, offering cost-effective, user-friendly, and stable GPU computing power resources for various AI applications, thereby achieving enterprise-level AI inference services. Simultaneously, it provides idle GPU nodes with opportunities to earn money, fully utilizing every bit of resource.

NGPU main functions include:

  • Computing power providers can register, install, uninstall nodes, inquire about node information and incentives.
  • Computing consumer can inquire about the product list provided by the platform, upload their own images to generate Gen-Instances, create workspaces based on their own and platform images, dispatch them, check the issuance status, and prepare for subsequent AI service requests.
  • AI applications can request AI service APIs provided by Gen-Instances and check their status, returning results.

The followings are 3 parts correponding to the above functions.

  • Node part: If you want to provide GPU for NGPU, read this note
  • Cloud part: If you want to use the GPU computing power in NGPU, read this note
  • Caller part: If you want to call AI on NGPU, read this note

Compared to other Depin projects that serve as just renting separate GPU computing nodes, NGPU's main innovative technologies include intelligent allocating, pay-per-task GPU usage billing, and efficient HARQ network transmission. These technologies enable real-time perception of AI application load pressure, dynamic adjustment of resource usage, and achieve auto-scaling, security, high reliability, and low latency on GPU nodes with unrestricted access.

During the development of NGPU, we encountered the following major issues, and set up our own technologies.

  1. Instability of Decentralized Computing Nodes: Compared to the reliable service quality of high-grade IDCs built by big companies, GPU nodes with permissionless access might be personal gaming computers, which are highly likely to go offline randomly. To address this, NGPU developed the Smart Allocation framework, which on one hand, monitors the status of each GPU node in real-time and configures redundant nodes besides the working nodes to switch when a working node goes offline; on the other hand, it designed an incentive and staking mechanism to encourage stable online presence.

  2. Various Specifications of Computing Node Networks and Hardware: Facing various specifications of GPU computing nodes, NGPU measures the AI inference capability of the nodes and combines it with the measurement of storage and network bandwidth to form a unified computing power value. This standardizes the node computing power, providing a basis for Allocation and incentives. Additionally, NGPU utilizes HARQ (Hybrid Automatic Repeat reQuest) to optimize the efficiency of public network transmission, achieving over a 100-fold speed improvement under strong network noise, compensating for the network deficiencies of decentralized computing nodes.

  3. Significant Daily Fluctuations in AI Application Loads: Various AI applications, especially in the Web3 field, face load peaks and valleys. Typically, GPU nodes are rented based on peak loads to ensure service stability. However, NGPU calculates costs based on the actual usage of (GPU computing power * duration) through smart allocation, ensuring that every penny spent by the AI application provider goes towards their actual business needs. This not only enhances usage efficiency on the relatively low-cost decentralized GPU power but also significantly reduces GPU computing power costs, achieving fair access to GPU computing power.

Adoption, Reach, and Growth Strategies

The Most Valued Customers of NGPU are
1: Individual Developer and Small B-end Development Team who need AI computing power
2: Individual and organization who own computing power

As AI continues to develop, the demand for compute power in AI inference will increase significantly. However, most individual developers and small B-end development teams face the challenges of being unable to accurately estimate compute consumption and having to develop numerous non-core business modules when creating AI-based products. Building a decentralized elastic computing power network and offering a wide range of open-source model functionalities along with various SDKs can effectively address these pain points.

We will build a decentralized GPU network with pooled computing power to provide developers and projects requiring AI computing resources with low-cost and easily accessible power. At the same time, we will integrate Filecoin technology to achieve true decentralization for AI.

We have already built a decentralized GPU network and now need to deploy it on Filecoin for incentive system storage and distribution. In terms of go-to-market strategy, we will build a business development team to reach out to those in need of computing power. Additionally, we are collaborating with major mining hardware manufacturers and computing power providers to secure a substantial amount of GPU resources.
Profit is generated by charging individual developers and small B-end development teams for compute usage and SDK fees.

On the other hand, NGPU's competitors include decentralized computing cloud projects such as IO.net and Akash. The key difference between NGPU and these platforms is that IO.net and Akash require users to rent entire computing resources, while NGPU provides computing power at the AI task level. This helps address the "compute anxiety" faced by users who struggle to accurately estimate their compute needs. Additionally, NGPU offers various open-source AI agent interfaces to compute power users, transforming the decentralized platform from a traditional compute rental service into a computing power service provider.

Development Roadmap

Here are the 3 stages for integrating NGPU with Filecoin.

  • Stage 1, development, for 30 days, $10K
    The NGPU computing network needs to migrate data originally stored through nodes to Filecoin and IPFS storage, which requires modifications across the infrastructure, client, and AI containers. This migration will enhance the network's data accessibility, security, and scalability, leveraging IPFS's decentralized storage capabilities. The following outlines the detailed plan and necessary modifications to achieve this migration, requiring a team of four people working for one month.
  1. Infrastructure Changes
    The infrastructure must be adapted to integrate IPFS as the primary storage backend. This involves setting up IPFS nodes to ensure reliable and efficient storage, retrieval, and management of data. The team will configure IPFS gateways and APIs to facilitate seamless data interaction between the NGPU network and IPFS. This step includes ensuring that IPFS nodes are optimized for performance, have sufficient redundancy, and are correctly scaled to handle the anticipated data volume. The team will also need to establish monitoring tools to oversee IPFS storage health, data integrity, and performance.

  2. Client Modifications
    The client software must be modified to interact with IPFS instead of traditional node-based storage. This includes updating the data upload, retrieval, and management functions to work with IPFS protocols such as content addressing and peer-to-peer data exchange. The client will need enhancements to handle IPFS hashes, ensuring that data references remain consistent across the system. Additionally, security measures will be updated to ensure that data permissions and access controls are maintained within the IPFS environment. These changes will require rigorous testing to confirm that data handling remains efficient and error-free.

  3. AI Container Adaptations
    AI containers that perform training and inference tasks must be updated to fetch and store data directly from IPFS. This involves modifying the data ingestion pipelines within the AI workflows to interact with IPFS nodes seamlessly. The containers will need to be reconfigured to ensure compatibility with IPFS’s content-addressable storage model, allowing for efficient and secure data retrieval during AI computations. The team will also optimize data access patterns to reduce latency and improve overall system performance, especially during high-load scenarios where rapid access to large datasets is critical.

  4. Team Allocation and Timeline
    To complete these modifications, the team of four—comprising infrastructure engineers, software developers, and AI specialists—will work in parallel across the different components. Each team member will be responsible for a specific area: one will focus on infrastructure integration, one on client software changes, one on AI container adjustments, and one on testing and quality assurance. The timeline of one month will be structured into phases, including initial planning, development, integration, testing, and deployment. Regular syncs will ensure that all components are compatible and that the migration to IPFS is smooth and successful.

  • Stage 2, Start to allocate AI model and data to Filecoin storage, in next 30 days, $10K
    After development, a strategic migration plan will be implemented on the current software version. This plan involves utilizing the existing node storage system while simultaneously integrating Filecoin to create multiple redundant backups. The primary goal is to validate Filecoin’s network transmission efficiency and storage reliability, ultimately achieving a scenario where 33% of the network's data is securely stored on Filecoin.

    Migration Strategy and Execution

  1. Parallel Data Migration: Data will be transferred in parallel from the original node-based storage to Filecoin. This dual storage approach ensures data redundancy and provides a seamless fallback to the existing system if any issues arise during the transition phase.

  2. Multi-backup Strategy: To test Filecoin’s reliability, data will be stored redundantly across multiple nodes within the Filecoin network. This setup will allow for real-world testing of Filecoin’s storage capabilities, including data retrieval speeds, fault tolerance, and the integrity of stored data.

  3. Performance and Reliability Testing: Throughout the migration process, continuous monitoring and performance testing will be conducted to assess Filecoin’s transmission speeds and storage consistency. This evaluation will include stress tests under various network conditions to ensure Filecoin meets the NGPU computing network's performance requirements.

  4. Progressive Data Allocation: The migration will be carefully managed to incrementally increase the amount of data stored on Filecoin, with the target of 33%. This phased approach allows for continuous validation and adjustment, ensuring that any challenges are addressed without compromising data accessibility.
    By implementing this strategy, the NGPU computing network aims to leverage Filecoin’s decentralized storage advantages while ensuring data reliability and performance, laying the groundwork for broader future adoption.

  • Stage 3, Optimization for performance, cost, reliability, in next 30 days, $5K
    After validating Filecoin’s storage and transmission capabilities, the NGPU network will enter a phase of optimizing its storage strategy based on various factors such as node performance, cost, and reliability. This phase aims to establish a balanced and cost-effective storage model that leverages the strengths of both Filecoin and traditional node storage, moving towards routine utilization of Filecoin combined with existing node storage.

    Storage Strategy Optimization

  1. Performance Assessment: Nodes will be continuously monitored and evaluated based on their performance metrics, including data access speeds, latency, and storage efficiency. High-performance nodes will be prioritized for critical data that requires frequent access, ensuring optimal compute and retrieval times.

  2. Cost Optimization: Storage costs across different nodes and Filecoin will be analyzed to identify the most cost-effective configuration. By comparing the price-per-gigabyte and associated transaction costs, the system can dynamically allocate data to the lowest-cost options while maintaining the required performance standards.

  3. Reliability Tuning: The reliability of each node and Filecoin’s network will be assessed by tracking historical uptime, error rates, and successful data retrievals. Data redundancy will be strategically applied to less reliable nodes, whereas more reliable nodes will handle critical or sensitive information, reducing the overall risk of data loss.

  4. Dynamic Storage Allocation: The storage system will be adjusted dynamically based on real-time performance and cost analytics. The system will implement automated policies that distribute data between Filecoin and node storage based on the current network conditions, ensuring an optimal balance between speed, cost, and reliability.

  5. Routine Use and Scaling: As the optimized strategy proves effective, the NGPU network will integrate Filecoin into its routine operations, scaling up its usage as part of a hybrid storage model. This approach will ensure a scalable and flexible storage environment capable of adapting to changing needs and workloads.

By refining the storage strategy, finally NGPU network will achieve a more efficient, reliable, and cost-effective system, fully utilizing the benefits of Filecoin alongside traditional node storage.

Total Budget Requested

| Milestone # | Description | Deliverables | Completion Date | Funding |

  • In 30 days,
    Re-architecture NGPU's storage on the Filecoin network.
    New version after testing, ready for deployment
    $10K

  • In 60 days
    Start to allocate AI model and data to Filecoin storage
    More than 33% data has been storaged on Filecoin
    $10K

  • In 90 days
    Form closer customization with the Filecoin storage mechanism, while selecting preferred node providers to establish long-term and stable partnerships.
    $5K

Maintenance and Upgrade Plans

We are keeping NGPU Maintenance and Upgrade from the following 10 aspects.

  1. Community Feedback and Collaboration: Actively gather feedback from the user community, incorporate suggestions, and collaborate with stakeholders to refine and improve the network continuously.

  2. Regular System Updates: Ensure continuous updates of system software, GPU drivers, and libraries to maintain compatibility, performance, and security.

  3. Scalability Enhancements: Implement upgrades to support scaling the network efficiently, accommodating more GPUs, nodes, and computing tasks as demand grows.

  4. Performance Optimization: Continuously optimize the network's algorithms and workload distribution strategies to improve computational speed, reduce latency, and maximize GPU utilization.

  5. Fault Tolerance and Redundancy: Enhance fault tolerance mechanisms, including automated failover, load balancing, and data redundancy, to ensure high availability and minimize downtime.

  6. Security Upgrades: Regularly update security protocols, including data encryption, access controls, and monitoring, to protect against cyber threats and unauthorized access.

  7. Resource Management Improvements: Upgrade resource allocation and scheduling systems to better manage GPU workloads, prioritize tasks, and maximize efficiency.

8, Monitoring and Analytics Tools: Develop advanced monitoring and analytics tools to provide real-time insights into network performance, detect anomalies, and predict maintenance needs.

9, User Interface and Accessibility: Continuously improve user interfaces and APIs to enhance accessibility, ease of use, and integration capabilities for developers and end-users.

10, Energy Efficiency Initiatives: Implement energy-saving strategies, including optimizing power usage across the network to reduce operational costs and environmental impact.

Here is the our plan.

  • 2024 Q4:
    Active Development: storage reconstruction based on Filecoin and ipfs
    Safety and Stability
    instance more, LLM chat (text2text)
    Scheduling capacity 5000/s, delay in 8s
    vCluster and Fallback
    Robust validating
    USB key of DePIN

  • 2025 H1:
    Efficiency and Affordability
    Strategic Partnership with Top AI Developers
    More Multimodal instances
    Scheduling capacity 20000/s, delay in 5s
    Heterogeneous GP

  • 2025 H2:
    TGE
    Scheduling capacity: 100M/s, delay in 3s
    Zone-based operation
    GPU Integration with more other GPU networks

Team

Team Members

  • Alain Garner: Co-founder @symphony, Offchain.
  • Dr. Gene Wu: Sr. Software Engineer @ Google, AI research & Development
  • Ivan Mu: Sr. Software Engineer @google, ML infra & scheduling

Team Member LinkedIn Profiles

Alain Garner: https://www.linkedin.com/in/alaingarner/

Team Website

Website is https://ngpu.ai/.

Relevant Experience

The NGPU team members possess in-depth knowledge of GPU clusters and comprehensive project experience. Over the past six years, Alain has been involved in multiple Web3 startups, successfully leading projects such as Comtech. Meanwhile, Gene and Ivan, as core members of the Google TPU team, have successfully built large-scale AI training and inference compute clusters. These are essential elements for the success of the AI Depin project and will undoubtedly drive NGPU toward its goals.

Team code repositories

Here are repositories of some competitors' projects,

  1. IO.net, https://io.net/, https://github.com/ionet-official
  2. Tao bittensor, https://bittensor.org/, https://github.com/opentensor
  3. Akash, https://akash.network/, https://github.com/akash-network

Additional Information

We learnt about your great Open Grants Program from our social network.
Here is the email, [email protected].
Our twitter is @ngpu_ai, where we post our active info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants