Publications

AcclMT: A Highly Resource-Efficient and Flexible Poseidon Hash-Based Merkle Tree Architecture

Published in Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

Abstract: Merkle Tree is a fundamental cryptographic primitive in Zero-Knowledge Proof (ZKP) protocols, sharing significant computational workloads with the Number Theoretic Transform (NTT) in zk-STARK schemes. Merkle Tree is a tree structure where nodes are primarily generated through hash computations. Among them, Poseidon Hash, as a ZK-friendly hash function, has emerged as one of the most widely adopted choices. Therefore, hardware acceleration of building Merkle Tree based on Poseidon Hash can significantly enhance the performance of ZKP protocols. We propose AcclMT, a highly resource-efficient and flexible Poseidon Hash-based Merkle Tree architecture. Our design employs hardware-software co-design and optimizes the hashing data flow, resulting in an area-efficient Poseidon Hash engine that improves modular multiplication resource utilization. Furthermore, AcclMT uses these engines alongside hierarchical on-chip cache and optimized task scheduling for building large Merkle Trees. It also supports flexible parameter configurations for various requirements. Experimental results show that our proposed Poseidon Hash engine achieves a 14.3 × speedup compared to the latest FPGA-based work. By improving resource utilization, it also reduces area usage by 14.8% compared to unoptimized design. AcclMT achieves up to 1665 × speedup over software implementations in building Merkle tree, with average utilization of 95.9% and 99.2% for the two hash engines.

Recommended citation: Changxu Liu, Hao Zhou, Lan Yang, Yifei Feng, Zheng Wu, Zhuoyuan Yang, Yinlong Li, Shiyong Wu and Fan Yang. "AcclMT: A Highly Resource-Efficient and Flexible Poseidon Hash-Based Merkle Tree Architecture." In Proceedings of the 62nd ACM/IEEE Design Automation Conference, pp. 1-6. 2025. Waiting Process...

Myosotis: An Efficiently Pipelined and Parameterized Multi-Scalar Multiplication Architecture via Data Sharing

Published in IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024

Abstract: Zero-knowledge proof (ZKP) is a widely used privacy-preserving technology, where Multi-Scalar Multiplication (MSM) accounts for over 70% of the computational workload. The acceleration of MSM can enhance the overall performance of ZKP, making it a focal point of community attention. However, in practical applications involving the deployment of multiple MSM accelerators, existing designs often overlook strategies for optimizing bandwidth and area efficiency. To address this, we propose Myosotis, an efficiently pipelined and parameterized Multi-Scalar Multiplication architecture. By sharing input data and allocating cache effectively, it mitigates average transmission bandwidth in runtime. Myosotis also supports the use of multiple Point Addition (PADD) units to achieve performance gains, balancing area overhead and latency for improved area efficiency. Different parameter selection enables a trade-off between the performance, area, and bandwidth of the MSM accelerator. When benchmarking with MSM degrees between 2^18 and 2^26, our proposed baseline design achieves up to 3.32× and 6.72× speedups over state-of-the-art FPGA and ASIC designs. Compared to the baseline, Myosotis with two window MSMs and one PADD unit reduces bandwidth demand by 43% while maintaining similar area and latency. On the other hand, Myosotis with three window MSMs and two PADD units decreases latency by 43% and bandwidth by 17%, with only a 9% area increase.

Recommended citation: Liu, Changxu, Hao Zhou, Lan Yang, Zheng Wu, Patrick Dai, Yinlong Li, Shiyong Wu, and Fan Yang. "Myosotis: An Efficiently Pipelined and Parameterized Multi-Scalar Multiplication Architecture via Data Sharing." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2024). https://ieeexplore.ieee.org/abstract/document/10818748/

ReZK: A Highly Reconfigurable Accelerator for Zero-Knowledge Proof

Published in IEEE Transactions on Circuits and Systems I: Regular Papers, 2024

Abstract: Zero-knowledge proof (ZKP) plays a significant role in privacy protection technology. However, the proof generation phase requires considerable time and hardware resources. In this phase, Number Theoretic Transform or Inverse Number Theoretic Transform (NTT/INTT) in polynomial computation, as well as Multiple Scalar Multiplication (MSM), are bottlenecks that dominate the execution time. In this paper, we propose a highly reconfigurable accelerator ReZK to accelerate ZKP proof generation phase, focusing on NTT/INTT and MSM. According to the configurations, ReZK can be configured as NTT, INTT, and MSM with variable sizes and bit-widths by adjusting the data path between on-chip memories and arithmetic cores. As the basic unit of arithmetic cores, the reconfigurable processing element (PE) in ReZK is composed of pipelined modular multipliers and modular adders that support variable bit-widths. It can perform butterfly or arithmetic operations. Based on the reconfigurable PEs, the ReZK core can implement NTT/INTT with different sizes and bit-widths, or a fully pipelined point adder (PADD). Additionally, we propose a modularized MSM scheduling architecture to support various bit-widths. The on-chip memories are also well organized for reuse. In NTT/INTT mode, 4-way 256-bit or 2-way 384-bit NTT/INTT can be computed in parallel. In MSM mode, for different elliptic curves, ReZK is capable of processing 4-way 256-bit or 2-way 384-bit MSM in parallel.

Recommended citation: Zhou, Hao, Changxu Liu, Lan Yang, Li Shang, and Fan Yang. "ReZK: A Highly Reconfigurable Accelerator for Zero-Knowledge Proof." IEEE Transactions on Circuits and Systems I: Regular Papers (2024). https://ieeexplore.ieee.org/document/10714365

PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication

Published in ACM Trans. Des. Autom. Electron. Syst., Vol. 29, No. 5, Article 77, 2024

Abstract: Multi-scalar multiplication (MSM), crucial for zero-knowledge proof (ZKP), is computationally intensive, and this paper introduces PriorMSM, an efficient acceleration architecture that utilizes a Priority-Based Scheduling Mechanism (PBSM) and a parallel bucket aggregation algorithm to improve performance. Evaluations show that PriorMSM achieves up to 10.9× speedup over previous hardware implementations and 3.9× over GPU implementations.

Recommended citation: Liu, Changxu, Hao Zhou, Patrick Dai, Li Shang, and Fan Yang. "Priormsm: An efficient acceleration architecture for multi-scalar multiplication." ACM Transactions on Design Automation of Electronic Systems 29, no. 5 (2024): 1-26. https://dl.acm.org/doi/10.1145/3678006

Gypsophila: A Scalable and Bandwidth-Optimized Multi-Scalar Multiplication Architecture

Published in Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024

Abstract: Multi-Scalar Multiplication (MSM) is a fundamental cryptographic primitive, which plays a crucial role in Zero-knowledge proof systems. In this paper, we optimize the single MSM Process Element (PE) utilizing buckets with fewer conflicts, enhanced by Greedy-based scheduling, to achieve higher efficiency. The evaluation results show our optimized single MSM PE achieving a speedup of over two times on average, peaking at 3.63 times compared to previous works. Furthermore, we introduce Gypsophila, a scalable and bandwidth-optimized architecture for implementing multiple MSM PEs. Leveraging the characteristics of the bucket method, we optimize the data flow by balancing the throughput of bucket classification, bucket aggregation, and result aggregation in MSM. Simultaneously, multiple PEs with different data access patterns share a universal point input channel and post-processing unit, which improves the module utilization and mitigates the bandwidth pressure. Gypsophila with 16 PEs, accomplishes 16 MSM tasks in a mere 1.01% additional time, showcasing an approximate 7.8% reduction in area, with only about 1/16 of the bandwidth requirement, compared with 16 PEs without input channel and post-process unit sharing.

Recommended citation: Liu, Changxu, Hao Zhou, Lan Yang, Jiamin Xu, Patrick Dai, and Fan Yang. "Gypsophila: A scalable and bandwidth-optimized multi-scalar multiplication architecture." In Proceedings of the 61st ACM/IEEE Design Automation Conference, pp. 1-6. 2024. https://dl.acm.org/doi/10.1145/3649329.3658259

HMNTT: A Highly Efficient MDC-NTT Architecture for Privacy-preserving Applications

Published in The 34th edition of Great Lakes Symposium on VLSI (GLSVLSI), 2024

Abstract: In privacy-preserving applications like Post-Quantum Cryptography (PQC) and Fully Homomorphic Encryption (FHE), polynomial multiplication is common, and the Number Theoretic Transform (NTT) is a key algorithm for reducing its complexity. In this paper, we present HMNTT, a highly efficient MDC-NTT architecture. Utilizing the four-step NTT algorithm and a pipelined transpose module, HMNTT offers a highly efficient and scalable architecture for handling NTT with large degrees. We optimize the processing element (PE) to alleviate backpressure and data conflicts in data flow. Leveraging FPGA characteristics, we construct a modular multiplication module to reduce resource usage and improve operating frequency. Evaluation results indicate that HMNTT achieves an average of 2.34× and 1.26× reduction in Area-Time Product compared to the latest pipelined NTT architectures.

Recommended citation: Liu, Changxu, Danqing Tang, Jie Song, Hao Zhou, Shoumeng Yan, and Fan Yang. "HMNTT: A Highly Efficient MDC-NTT Architecture for Privacy-preserving Applications." In Proceedings of the Great Lakes Symposium on VLSI 2024, pp. 7-12. 2024. https://dl.acm.org/doi/abs/10.1145/3649476.3658734

A Fully Pipelined Reconfigurable Montgomery Modular Multiplier Supporting Variable Bit-Widths

Published in IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024

Abstract: Recently, there has been increased emphasis on privacy-preserving computation technologies, such as homomorphic encryption (HE) and zero-knowledge proof (ZKP). Modular multiplication is a critical component for both HE and ZKP. Variable bit-width is a must for many applications of privacy-preserving computation, due to variable bit-width requirements for different cryptography schemes. However, the majority of modular multipliers that support variable bit-width configurations exhibit relatively low throughput. This work presents a fully pipelined Montgomery modular multiplier with variable bit-width support. Truncated multipliers are introduced to reduce the resources of modular multipliers in our approach. In order to meet different bit-width requirements, the proposed modular multiplier can be dynamically reconfigured. The proposed design can support widely used bit-width configurations, specifically, 384-bit, 256-bit, and 128-bit. 256-bit and 128-bit modes support parallel computation of 2 and 6 sets of operands, respectively. Compared with existing variable bit-width modular multipliers, the proposed reconfigurable modular multiplier significantly improves the throughputs with even lower resources.

Recommended citation: Zhou, Hao, Changxu Liu, Lan Yang, Li Shang, and Fan Yang. "A Fully Pipelined Reconfigurable Montgomery Modular Multiplier Supporting Variable Bit-Widths." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2024). https://ieeexplore.ieee.org/abstract/document/10551385