Moon Gu
Choi
,
Jae Hyun
In
,
Hanchan
Song
,
Gwangmin
Kim
,
Hakseung
Rhee
,
Woojoon
Park
and
Kyung Min
Kim
*
Department of Materials Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea. E-mail: km.kim@kaist.ac.kr
First published on 15th October 2024
A memristive crossbar array can execute Boolean logic operations directly within the memory, which is highly noteworthy as it addresses the data bottleneck issue in traditional von Neumann computing. Although its potential has been widely demonstrated, achieving practical levels of operational reliability and computational efficiency remains a challenge. Here, we introduce a three-input majority logic gate supported by near-memory operations, serving as a universal gate and achieving both robust reliability and high efficiency in versatile logic operations. We fabricated a highly reliable HfOx-based memristive array, incorporating a series resistor to increase the reset voltage of the memristor, thereby increasing the operational voltage margin of the gate operation. This ensured reliable operation of the majority gate, resulting in successful experimental proof of combined 1-bit full adder and subtractor operations performed in 5 steps using 7 cells. Additionally, we propose that an N-bit parallel prefix adder (PPA) operation is possible in O(log2N) steps, by taking advantage of the parallel operation capability of the majority gate. This achieves 8.5× higher spatiotemporal efficiency than the previously reported NOR-based logic system in 64-bit adder operation. Moreover, as N increases, the spatiotemporal efficiency further improves, which significantly enhances the applicability of memristive logic-in-memory.
New conceptsThe practical implementation of memristive logic-in memory (LIM) technology in a crossbar array (CBA) is still challenging since there is no optimal solution that can achieve the highest computational efficiency while overcoming the memristor's intrinsic switching voltage variations. In this study, we present a comprehensive understanding of the conditions under which a logic gate can be both reliable and efficient at the same time, which has not been studied before. We fabricated a reset voltage-modulated memristor to ensure sufficient voltage margin, achieving the reliability of logic operation. Then, we designed a 3-input majority (MAJ) logic with a near-memory approach. The spatiotemporal efficiency was increased 4.5 times in a 1-bit full adder and 8.5 times in a 64-bit parallel prefix adder compared to the 2-input NOR logic system. The results highlight the capability of parallel operation of our MAJ logic and the strength of effective utilization of near memory circuits. This study proposes a completely new approach and overcomes the limitations of conventional LIM techniques. It is expected that our approach will generate further new ideas and studies in the future. |
Since the first stateful logic via material implication was proposed in 2010 by Borghetti et al.,11 many studies have demonstrated various methods to implement various logic gates, highlighting the importance of LIM in the CBA and its widespread interest. In the early stages of studies, the focus was on finding ‘effective’ new gates leading to improved computational efficiency. Notably, Sun et al. reached a pinnacle by proposing the Carry gate (utilizing three input cells and one output cell) and the Sum gate (utilizing four input cells and one output cell).15 However, these seemingly effective gates assumed constant switching voltages, i.e., ideal operation of memristors, leading to consensus on their limitations in practical applications, where the switching voltage variations are inevitable. As a result, subsequent research on memristive LIM has primarily focused on ensuring operational reliability while satisfying improved computational efficiency.
The operational reliability of the given logic gate can be quantified by the variation tolerance factor (VTF).14 This refers to the acceptable maximum switching voltage variation, which always guarantees a correct gate operation. Most logic operations have been implemented through conditional set switching; therefore, the VTF can refer to an acceptable set voltage (VSET) variation. The VTF strongly depends on the ratio of the maximum set voltage and the maximum reset voltage (ρ = |VRESET,Max/VSET,Max|); the higher the ρ value, the higher the VTF. Meanwhile, the set voltage variation (ΔSET) is defined as (VSET,Max − VSET,Min)/VSET,Max. In this context, variation-tolerant gate operation means that VTF is larger than ΔSET. Therefore, decreasing VRESET,Max is an effective method for achieving a high VTF and variation-tolerant gate operation. The detailed explanation and calculation process of the VTF value of various logic gates can be found in Supplementary Note 1 in the ESI.†
Meanwhile, logic gates can be classified into two types: set-inhibition gates and reset-inhibition gates. The set-inhibition gates, including NOR and NOT gates, require inhibition cells to suppress set switching as they are subjected to experience a high voltage drop during gate operation. These gates have a constant VTF regardless of the reset switching behavior. In this context, it is worth emphasizing the study previously published by Kim et al.14 In the study, the authors suggested that NOR, NOT, and BUFFER gates (namely, NOR-based logic systems) theoretically have the highest VTF of 0.333, enabling the implementation of the most robust LIM. Using these gates, a 1-bit full adder operation could be executed in 12 steps over 13 cells only using those gates. This can serve as a benchmark, and further research is required to explore more efficient methods.
Unlike set-inhibition gates, reset-inhibition gates, such as the AND gate, have the VTF that depends on the reset switching voltages; as the reset voltage (VRESET) increases, the VTF increases. Fig. 1a represents three examples of typical bipolar resistive switching curves with a VRESET to VSET ratio (ρ) of 0.5, 1.0, and 1.5. The set and reset voltage ranges are indicated as blue and yellow boxes, respectively. Fig. 1b conceptually illustrates the VTF for the three cases. In the reset-inhibition gates, as ρ increases (i.e., as VRESET decreases), the VTF increases. When ρ = 0.5 or 1.0, the VTF is not enough to cover the full range of the VSET distribution, meaning that there is a chance of gate operation error. Whereas when ρ = 1.5, error-free gate operation is guaranteed.
Fig. 1c compares the VTF dependency between set-inhibition and reset-inhibition gates with respect to ρ. Here, when ρ increases, the reset-inhibition gates can be more practically viable, as the VTF is higher than the VSET ranges. (A comparison of the VTF value of other logic gates with different ρ values is summarized in Fig. S1 in the ESI.†) Consequently, reset-inhibition gates can achieve higher robustness as ρ is tuned to higher levels. However, few studies have demonstrated this approach of concurrently tuning the memristive device and developing the gating strategy.
In this study, we propose a memristive majority (MAJ) gate for the first time, which achieves the highest computational efficiency among the memristive LIM technologies under the robustness-ensured condition. The proposed MAJ gate is a three-input logic gate whose output is identical to the carry-out bit, generating output ‘1’ only if more than two inputs are ‘1’.24,25 The MAJ gate requires near-memory circuits to supply one input in the form of a voltage. While this requires some additional cost, it may offer a more effective way than the cost, which will be systematically demonstrated later.20 The MAJ gate falls into the reset-inhibition gates. Therefore, to experimentally validate the efficiency of the MAJ gate, we developed a HfOx-based memristor array with an increased VRESET (giving VRESET/VSET = 1.21), which achieved the VTF of 0.4, which is the highest VTF ever reported. More discussion on the superiority of the HfOx-based memristor compared to other memristors can be found in Supplementary Note 2 in the ESI.† Moreover, the MAJ gate allowed us to execute a combined 1-bit full adder and full subtractor using 7 memristors in 5 steps, resulting in a 4.5× increase in spatiotemporal efficiency compared to the benchmarking NOR-based logic system. Furthermore, we applied our MAJ-based logic system to the N-bit parallel-prefix adder (PPA) and demonstrated the required number of steps of O(log2N), achieving 8.5× the spatiotemporal efficiency in 64-bit adder addition compared to the benchmarking. This remarkably increased efficiency can be explained by extremely low data manipulation steps and parallel operation of the MAJ gate, which will be further discussed later. The efficiency increases as the bit size increases, highlighting its strong potential for large-scale and energy-efficient Boolean computing.
We demonstrated the MAJ gate operation on a memristive CBA. Here, we designed a series resistor (RS) incorporated CBA. The incorporated RS plays a crucial role for two key reasons. Firstly, it ensures the switching endurance of memristor cells. The RS establishes a self-limited switching configuration during set switching, thereby reducing the chance of over-setting and enhancing cyclic reliability.26,27 Additionally, the RS acts as a voltage divider in the LRS, decreasing the VRESET while not affecting the VSET.28 This decreased VRESET reduces the probability of unintended reset switching during logic operations, thereby ensuring a sufficient operational voltage margin.
To harness these advantages, we fabricated a Ta/HfO2/Pt memristor array in which the RS was carefully controlled. An optical microscope image of the RS-controlled array and a TEM image of the HfOx memristor can be found in Fig. S2 in the ESI.† The array was designed to precisely control the line resistance working as the RS. In the array, the wire part of the top Ta electrode had a resistance of ∼200 Ω. The bottom Pt electrode was designed with a diamond-shaped bridge structure to minimize the wire resistance. As a result, the Pt wire resistance was ∼50 Ω for the cell closest to the contact pad and ∼150 Ω for the cell farthest from the pad. Consequently, the cells in the array had the RS in the range of 250 to 350 Ω. Note that this level of the RS variation does not significantly affect the switching characteristics of the device (see Supplementary Note 3 in the ESI,† for a more detailed discussion on the RS variation on the gate operation). The HfO2-based memristor has been widely investigated for its high endurance and stable retention properties.29–31 More detailed device fabrication procedures can be found in the Experimental section.
Fig. 2b shows the comparison of I–V curves of two devices, one with 300 Ω of RS (blue lines) and the other one without the RS (gray lines), for comparing the VRESET change by the RS. After including the RS due to the voltage divider effect, the average value of VRESET was changed from −0.85 V to −1.2 V, while VSET was unchanged, resulting in an increase in their ratio (ρ = |VRESET,Max/VSET,Max|) from 0.9 to 1.21 (see Supplementary Note 4 in the ESI,† for a more detailed explanation regarding the difference in the voltage divider effect during set and reset switching). The coefficient of variation (σ/μ) was 7.8% for VSET and 3.6% for VRESET. (The raw I–V curves for 50 switching cycles, and their VSET and VRESET distribution, can be found in Fig. S3 in the ESI.†) Notably, the device exhibited a self-limited switching behavior, where set-switching no longer progresses beyond a certain current level. This occurs due to the presence of a series resistance component, even without applying compliance current, ensuring the device's endurance and retention reliability.26 Consequently, the device shows robust endurance up to 106 cycles (data shown in Fig. S4 in the ESI†), stable results up to 104 seconds at room temperature (data shown in Fig. S5 in the ESI†), and excellent device-to-device uniformity in the array (uniformity results shown in Fig. S6 in the ESI†). The main reason for incorporating the RS is to increase the VTF. The VTF is defined as an allowable maximum ΔSET, where ΔSET = (VSET,Max − VSET,Min)/VSET,Max. Fig. 2c shows the VTF of the MAJ under different ρ values. After incorporating the RS, the ρ was increased from 0.9 to 1.21, and consequently, the VTF was increased from 0.3 to 0.4. Considering the ΔSET in our device was 0.28, measured from Fig. S3 in the ESI,† the MAJ gate operation can be error-free with more room for accepting additional variations, such as those that may originate from device-to-device differences.
Fig. 2d shows the basic unit of the MAJ gate, comprising a 3 × 1 array configuration. Note that our MAJ gate operation can be divided into AND gate (when the input C = 0) and OR gate (when the input C = 1) depending on the value of VC. The optimum MAJ gate operation voltages are VCOND = −1.8 V, VPGM = −0.35 V, and VC = −0.65 V (for C = 1) or 0 V (for C = 0). Here, we denoted the VCOND and VPGM as VP and VQ, respectively. (A detailed operating voltage calculation process for the BUFFER, NOT, and MAJ gates can be found in Fig. S7–S9 in the ESI.†) Fig. 2e shows the successfully demonstrated MAJ gate operation results from the prepared array for all of the eight (3-bit) input conditions. The left panel shows the results for four input conditions when C = 0 so VC = 0 V, which is identical to the AND gate. The right panel shows the results for another four input conditions when C = 1 so VC = −0.65 V, identical to the OR gate. Here, the initial states of RA, RB, and RY0 and the final state of RY (blue line) were read at 0.25 V (gray line). The final states of RA and RB are not shown but their values remain unchanged after the gate execution. After reading the initial states, MAJ gate operations were executed by applying the operating voltages of VC (0 V when C = 0 and −0.65 V when C = 1, green line), VP (−1.8 V, red line), and VQ (−0.35 V, yellow line). The pulse rising time, width, and falling time were 1 μs, 10 μs, and 1 μs, respectively.
The MAJ gate by itself is not functionally complete, meaning its cascading cannot produce all Boolean logic. But it can be completed with a NOT gate.32 In addition, we demonstrated the BUFFER gate to manipulate data location. (Experimental demonstration of the NOT gate can be found in Fig. S10 in the ESI,† and the BUFFER gate in Fig. S11 in the ESI.†) The VTF of NOT and BUFFER gates are 0.333 and 0.877, respectively, which are larger than the value of 0.28 for the ΔSET. Therefore, they are also practically viable gates with our device.
Additionally, a resistance-to-voltage converter (RVC) circuit is required in the peripheral area. This RVC circuit, consisting of a comparator, D flip-flop, inverter, and transistors, functions to read the data from the memristor cell, temporarily store it, and then generate a voltage signal corresponding to the read data. (The RVC's full circuit design and timing diagram can be found in Fig. S13 in the ESI.† A feasible array and circuitry architecture are shown in Fig. S14 in the ESI.†) The RVC consumes negligible energy, suggesting that using the RVC does not harm the performance in terms of energy consumption (a comparison of energy consumption between the RVC circuit and the memristor is calculated in Fig. S15 in the ESI.†) More efficient use of the RVC will be discussed in the next section.
In this standby state, the first step of the logic operation is to fetch the data of a using the RVC. The RVC stores the data temporarily in its D flip-flop, which can be used multiple times during the operation. Moreover, we can use the opposite value of a (i.e., ā) from the negative output of the D flip-flop. The second step is executing the MAJ gate for calculating cout using Rb, Rc, and Va as inputs and T1 as the output, giving a logical equation of MAJ3(a, b, cin) → T1. The third step is executing the MAJ gate again for calculating bout using Rb, Rc and as inputs and T2 as output, giving MAJ3(ā, b, cin) → T2. The fourth step is, inverting cout using the NOT gate, NOT (T1) → T3. The last step is executing the MAJ gate using T2, T3, and Va as inputs and T4 as the output. MAJ3(a, T2, T3) → T4, which results in sum (= difference bit) (more detailed logical expressions for this calculation process can be found in Note 5 in the ESI†).
Fig. 3d shows the experimental results of the combined 1-bit FA and FS for all of the eight input conditions. The HRS (logical ‘0’) and LRS (logical ‘1’) are shown in blue and red colors, respectively. The first three columns (a, b, cin) are input cells for the eight input cases, and the other four columns (T1, T2, T3, and T4) are output cells during the logic operation. In summary, both 1-bit FA and FS could be achieved in 5 steps with 7 cells, giving a spatiotemporal cost (STC) of 35. (The STC is a measure of the computational efficiency of the memristive LIM, which can be calculated by multiplying the number of required steps by the number of used cells.) Our proposed method is 4.5× more efficient than the NOR-based 1-bit FA, whose STC is 156.14 (A comprehensive specification comparison of 1-bit full adder is summarized in Table S1 in the ESI.†) Furthermore, outputs of the FS can be obtained simultaneously, which was not possible in the NOR-based logic system, making our system more feasible for the ALU.
Such parallel operations can be effectively used in multi-bit adder operations. While there are various methods to implement the multi-bit adder,34–38 we have chosen to demonstrate the carry lookahead adder (CLA) operation, which significantly utilizes parallel operations and therefore maximizes the feasibility of the MAJ gate. Fig. 4b shows the initial data map for the 4-bit CLA operation. Here, the diagram shows that four cout bits (MAJ3(ai, bi, ci) → ci+1, where i = 1, 2, 3, 4) are sequentially obtained via the MAJ gate. Notably, even though ci is not aligned with ai and bi, the logic operation is possible without data copy operation, because the RVC can fetch the ci from any location. Consequently, only 2 steps (data fetching and MAJ gate execution) are required to obtain a single cout, therefore, 2N steps are required for all cout generation during an N-bit CLA. After all of cout are generated in series, all sum bits can be calculated in parallel using MAJ and NOT gates through the approach of Fig. 4a. It requires only three steps (step 1: MAJ3(ai, bi, ci) → mi+1, step 2: , step 3: ) regardless of the number of bits (N), thanks to the advantage of the parallel operation. The data mi is logically the same with the borrow-out bit (bi) in FS operation, however, we denoted it as mi to prevent any confusion (more detailed logic calculation process can be found in Supplementary Note 5, ESI†). In summary, with our proposed logic system, an N-bit adder operation can be achieved in 2N + 3 steps. (The entire step-by-step operation procedure of 4-bit CLA can be found in Fig. S16 in the ESI.†)
PPA is well-known for high parallelism during its operation, calculating both cout and sum bits in parallel.32,39–41 In conventional N-bit adders, such as CLA, cout are obtained sequentially (in series) so that it is the bottleneck of the entire process. However, PPA executes all operations in parallel, giving the number of required steps to follow a logarithmic function (∼log2N). This makes it highly efficient, particularly as the number of bits increases, compared to other conventional adders, whose number of required steps follows a linear function (∼N). In 2017, Pudi et al. proposed the circuit design of the PPA using the MAJ gate, where a recursive majority logic formula was used to maximize parallelism.41
Fig. 5a illustrates the overall schematic of the cout generation of N-bit PPA. All of the input sets are computed simultaneously in each logic round. Therefore, PPA circuits require O(log2N) rounds to add N-bits. While the logic steps required for each round differ in several types of PPA such as Ladner–Fischer adder (LFA),32 Kogge–Stone adder (KSA),39 and Sklansky tree adder (STA),40 the fastest one is the KSA due to greater utilization of parallel operations. This characteristic, specialized for parallel operations, makes it optimal for utilizing the MAJ gate, while the KSA has yet to be evaluated in stateful logic. The circuit diagram of the 8-bit KSA constructed solely using MAJ gates can be found in Fig. S17 in the ESI.†41 It comprises three levels to obtain all cout. Fig. 5b shows a data map in the array (left) and a logic operation procedure (right) for 8-bit cout generation. In the KSA, logic operations are conducted at the round unit, where each round includes a set of parallel MAJ gate operations, and at the k-th level, 2k−1 of cout are obtained. Consequently, the N-bit KSA log2N logic rounds (i.e., where l is the final round) are used to calculate all cout. In our logic system, 4 steps are required at each logic round except for the final round (2 steps), as shown in Fig. 5b. As a result, calculating all cout for an N-bit KSA using the MAJ gate requires 4log2N − 2 steps.
The next process is to relocate all cout into a one-dimensional line to calculate sum bits in parallel, where the previously obtained cout are stored in various rows, as shown in Fig. 5b (c2 at the 3rd row, c3 and c4 at the 5th, and c5, c6, c7, and c8 at the 7th). This relocation process utilizes the BUFFER gate, which is also viable for parallel operation, and thus it requires log2N − 1 steps for moving the cout from round 1 to round l − 1 to the final row. (Fig. S18 in the ESI,† shows schematically the cout generation and relocation process of KSA.) Lastly, all sum bits are calculated in parallel within 4 steps, which is the same process as the CLA in Fig. 4b. (Fig. S19 in the ESI,† shows a step-by-step sum calculation procedure.) As a result, the total required step for an N-bit KSA operation is 5log2N + 1 (i.e., 4log2N − 2 + log2N − 1 + 4). (An 8-bit and 16-bit KSA block diagram is shown in Fig. S20 and S21 in the ESI,† respectively.)
Fig. 5c and d compare the required steps and cells, respectively, for the N-bit KSA using the MAJ gate proposed in this study and for the N-bit CLA using the NOR gate for ref. 14. Although the MAJ-based operation requires more cells as N increases, the number of steps decreases drastically, making it much more efficient. For the 64-bit adder computation, the number of required cells is 1024, which is 23% more than for the NOR-based operation, but the total required steps are 31, which is 90% less. As a result, as shown in Fig. 5e, the STC of the 64-bit adder operation using the MAJ gate is only 12% of the NOR-based one, which is an 8.5× higher spatiotemporal efficiency, suggesting its high feasibility for use in large-scale computation. Table 1 represents a comprehensive benchmarking of various memristive logic approaches for implementing N-bit adders. Meanwhile, the energy consumption using the MAJ gate is about 10% higher than using NOR-based logic. A comparison of energy consumption in 64-bit adder between the NOR-based and MAJ-based logic system can be found in Fig. S22 in the ESI.† The MAJ-based logic requires slightly more switching events compared to the NOR-based systems because it only utilizes set switching, whereas NOR-based systems use both set and reset switching to perform the same tasks. However, this difference is minor, and in logic applications that demand fast operation, the advantage of MAJ-based logic in terms of higher spatiotemporal efficiency should be emphasized.
Adder type | Logic state variable | Reliability ensured? | Steps (X) | Cells (Y) | 8-bit adder STC (= X × Y, N = 8) | |
---|---|---|---|---|---|---|
Kim et al.14 | CLA | Resistance | Yes | 5N + 3 | 13N | 4472 |
Cheng et al.12 | CLA | Resistance | No | 6N | 11N − 11 | 3696 |
Song et al.18 | CLA | Resistance & voltage | No | 3N | 9N − 9 | 1512 |
Ruben et al.32 | PPA (LFA) | Resistance & voltage | No | 4log2N + 6 | 48N + 96 | 8640 |
This work | PPA (KSA) | Resistance & voltage | Yes | 5log2N + 1 | 2Nlog2N + 4N | 1280 |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4mh01196a |
This journal is © The Royal Society of Chemistry 2025 |