Le
Zhang
a,
Todd B.
Krause
c,
Harnimarta
Deol
a,
Bipin
Pandey
a,
Qifan
Xiao
a,
Hyun Meen
Park
a,
Brent L.
Iverson
*a,
Danny
Law
*bc and
Eric V.
Anslyn
*a
aDepartment of Chemistry, The University of Texas at Austin, TX 78721, USA. E-mail: iversonb@austin.utexas.edu; anslyn@austin.utexas.edu
bDepartment of Linguistics, The University of Texas at Austin, TX 78721, USA. E-mail: dannylaw@austin.utexas.edu
cLinguistics Research Center, The University of Texas at Austin, TX 78712, USA
First published on 11th March 2024
Sequence-defined polymers (SDPs) are currently being investigated for use as information storage media. As the number of monomers in the SDPs increases, with a corresponding increase in mathematical base, the use of tandem-MS for de novo sequencing becomes more challenging. In contrast, chain-end degradation routines are truly de novo, potentially allowing very large mathematical bases for encoding. While alphabetic scripts have a few dozen symbols, logographic scripts, such as Chinese, can have several thousand symbols. Using a new in situ consecutive click reaction approach on an oligourethane backbone for writing, and a previously reported chain-end degradation routine for reading, we encoded/decoded a confucius proverb written in Chinese characters using two encoding schemes: Unicode and Zhèng Mă. Unicode is an internationally standardized arbitrary string of hexadecimal (base-16) symbols which efficiently encodes uniquely identifiable symbols but requires complete fidelity of transmission, or context-based inferential strategies to be interpreted. The Zhèng Mă approach encodes with a base-26 system using the visual characteristics and internal composition of Chinese characters themselves, which leads to greater ambiguity of encoded strings, but more robust retrievability of information from partial or corrupted encodings. The application of information-encoded oligourethanes to two different encoding systems allowed us to establish their flexibility and versatility for data storage. We found the oligourethanes immensely adaptable to both encoding schemes for Chinese characters, and we highlight the expected tradeoff between the efficiency and uniqueness of Unicode encoding on the one hand, and the fidelity to a scripts' particular visual characteristics on the other.
The prior use of SDPs for information encoding has focused on text passages in languages written in alphabetic systems.20,25 Expanding the palette of encoding monomers available allows an exploration of novel strategies for encoding different writing systems. Binary encodings are natural in the context of computers' recognition of a simple on/off distinction. Alphabetic scripts typically consist of a small number of characters and little meaningful information based on visual similarities between those characters. However, many East Asian writing systems are logographic, where the symbols can represent whole words. The characters in such systems often number in the tens of thousands. Morpho-syllabic Chinese characters each represent a syllable with distinct meanings, but also contain visual elements that meaningfully relate those characters to other visually similar characters. Further, the symbols historically cue aspects of the character's meaning or pronunciation, and in some cases visually disambiguate words that have the same pronunciation (homophones). Different methods of encoding and decoding Chinese characters make different decisions about what meaningful aspects of these visual relations between characters are encoded or ignored.27
Here, we apply our chemical methods to two existing encoding schemes that are attuned to different characteristics of logographic writing systems26–30 to establish the SDPs' adaptability, by encoding and decoding a confucius proverb (Fig. 2). In advance of creating SDPs for encoding Chinese characters, we reviewed several linguistic approaches and selected two representative encoding schemes: Unicode and the Zhèng Mă (ZM) method, which privilege informational efficiency and visual fidelity, respectively.
Unicode intends to identify individual characters uniquely and efficiently across all major World scripts. Though issues surrounding an original symbol's subsequent variation in distinct milieus persist in Unicode, its adoption presents a notable expansion beyond the more limited ASCII-based encoding used in our previous work.19 However, Unicode does not encode visual information about characters or their internal composition, but instead arbitrarily assigns codes within a particular range. Thus, similar codes rarely imply similar characters, and vice versa. If even one element of the code point is lost or corrupted, an incorrectly identified character will be unrelated to the intended character. By contrast, a system based on the visual composition of characters encodes meaningful information with each element of the code string, so that mistakes in the encoding or decoding process may still yield characters similar to the intended target. While earlier work explored efficiency, e.g., through Huffman encoding,19 the present work seeks instead to explore the range of encoding styles supported by SDPs in an effort to spur novel approaches to preserving unique characteristics among World writing systems.
To explore SDPs' range of applicability using the same monomers, we sought a character encoding scheme capturing visual characteristics and internal composition of Chinese characters. Historically several such schemes have been used. Among the earliest, the Four-Corner (FC) method,33 devised in the 1920s, distinguishes 10 basic stroke shapes. It encodes each character by 4 digits recording the stroke shapes in a character's four corners. However, the resulting codes are far from unique: the FC method's conflict code rate (CCR, roughly the percentage of Chinese characters whose code corresponds to more than one character) approaches 85%.34 Thus, we felt the FC method was not optimal for using SDPs, where the symbol's context is unknown.
Several approaches reduce such ambiguities.29,35 The ZM method, from the early 1990s, reduces ambiguities36 to a CCR of just over 9%.34 In contrast to the FC method, ZM foregrounds characters' internal structure and maps Chinese characters to the standard QWERTY keyboard. ZM decomposes a character into distinct compositional elements (similar but not identical to traditional ‘radicals’), known as roots: groups of strokes that always appear as a unit, whether as a standalone character, or as a component repeated within numerous other characters. ZM divides roots into two classes, primary and secondary (Fig. 3). It maps primary roots to 1-letter strings of the QWERTY keyboard (the 26 letters of the english alphabet), and secondary roots to 2-letter strings. The method then decomposes a character into a sequence of primary and secondary roots in left-to-right, top-to-bottom order, and encodes the character by the sequence of strings corresponding to the roots. But ZM also imposes a set of rules to stipulate that no complete character code exceed 4 letters on the keyboard.36 With a list of the predefined correspondences between QWERTY letters and primary or secondary roots, a user can generate the 4-letter ZM code for any Chinese character. Thus, with 4 elements over a 26-symbol base, this allows 264 = 456976 potential codes, roughly 10 times the current number of Chinese characters. This makes the ZM method a visually attuned system for encoding Chinese characters which can be stored as individual 4-mer oligourethanes using a base-26 encoding capacity.
Because, unlike Unicode, ZM does not achieve total uniqueness (∼9% CCR),34 a single Chinese character might not correspond to a single code, and vice versa. Some individual characters correspond to a variety of codes simply due to ambiguity in the order for listing the roots comprising the character: e.g., (jìn, be near) corresponds to ZM codes PDW and WPD. Considering this potential for ambiguity, one benefit of an encoding scheme motivated by the visual layout of a character is that incorrectly identified characters will likely be visually similar to the intended character. Thus, while errors are more likely using ZM than Unicode, ZM errors will plausibly involve visually similar characters, rather than an entirely unrelated (and possibly not even Chinese) character, as might be the case with a Unicode error. We selected a quote (see below) to illustrate these redundancies, and we explore different heuristics needed to incorporate ZM into a viable SDP data storage workflow using OUs as the example.
To demonstrate our synthetic approach and its ability to encode the complexity of a logographic writing system, we chose an eight-character proverb from the Analects of confucius: ,38,39 roughly “By nature [people] are near each other; by habitual action they become farther apart”.40 To further probe the versatility and adaptability of the approach, we encoded the proverb in both traditional and simplified Chinese characters (Table 1). The former appear in manuscripts through the centuries, but also find current use in Hong Kong, Taiwan, and other diaspora communities; the latter stem in part from earlier informal writing practices but were formalized over the 20th century into a system streamlined for modern writing needs in the People's Republic of China. While either system could in theory encode either script, we utilized the ZM method to encode the more recent simplified characters and Unicode for the more numerous traditional characters.
Starting with our published conditions for oligourethane synthesis,19 monomer 2 was first appended to the resins. However, instead of deprotecting the Fmoc to then add another monomer, we exposed the resin (1 eq.) to 0.25 equivalent CuI, 0.5 equivalent sodium ascorbate and 0.5 equivalent tri(benzyltriazolylmethyl)amine (TBTA) for CuAAC click, along with 5 equivalents of the specific alkyne desired for encoding (see below). Then, following Fmoc deprotection, a second monomer 2 was coupled, and so on (Fig. 5), until completing the synthesis of the entire oligomer. In this manner, we could achieve an oligourethane capable of carrying as many different R groups as necessary for the mathematical base we are writing in (base-16 for Unicode, base-26 for ZM). Considering the number of mass differentiated terminal alkynes that exist in the chemistry world, the mathematical base can be substantially increased, which is a significant advance for the field of digital polymers and information encoding, because larger bases allow for denser information storage.
In our very first test of the CuAAC click reaction on resin, an acceptable yield of 94% was achieved. However, our biggest concern was that accumulation of CuI and ascorbic acid over multiple cycles of organic solvents and reagents would damage the resin, possibly via Fenton-type chemistry. Because we needed to run several consecutive click reactions to have multiple different R groups on the oligourethane string, we anticipated that several exposures to CuI and ascorbic acid in conjunction with repeated swelling and shrinking of the resin throughout the steps could lead to loss of function. However, based on our results, the resins are robust enough to tolerate the repeated exposures, highlighting the power and utility of the CuAAC click chemistry.46 At the end of the synthesis, as we have previously published, the chromophore NBD was appended for analysis by LC-MS.
With the reaction condition described above, we successfully synthesized 12 urethane-based oligomers (2 dimers, 2 trimers and 8 tetramers). The initial synthesis step (both coupling and click reaction) consistently proceeds smoothly. We attribute the high conversion to the absence of inorganic salt accumulation and the ready accessibility of the short chain on the resin. Generally, the conversions decrease as the number of steps increases while some truncated oligomers are observed. The conversions of the 12 oligomers ranged from 38% to 90%. Out of the 12 synthesized oligomers, 9 yielded more than 60% conversion, while only 3 urethane oligomers resulted in less than 50% conversion, which, unsurprisingly, were all tetramers. However, one of the tetramers (oligomer 2) yielded an 81% conversion, which is notably close to the conversions observed for dimers and trimers. This illustrates that the reactivity of different click reaction partners (alkynes) influences the conversions, in addition to the number of steps. As this is a consecutive reaction without stepwise purification needed in the process, the stepwise conversions are not calculated. It's worth noting that only small amounts of materials (<1 mg) are required for sequencing step after the target oligomers are made. Hence, we do not collect the entire sample from the HPLC, nor do we calculate a yield because the resin loading is often variable and imprecise, just as with solid-phase peptide synthesis where yields are routinely not reported.
We first used Unicode for traditional Chinese characters. Molecular-level encoding in hexadecimal required that each symbol be represented by appending a single alkyne, of sixteen, as a coupling partner on the azido side chain of a monomer along the oligomer backbone. Therefore, a library of sixteen different commercially available mass-separated terminal alkynes was identified (Fig. 6). Two chemical principles were used to guide library design. First, the masses of all the terminal alkynes differed by at least 2 atomic mass units to enable robust differentiation by LC-MS. Second, no reactive nucleophilic functional groups were present, thereby avoiding side-reactions during urethane coupling. During the building of this library, it was quite easy to identify 32, 64, and 128 commercially available alkynes that fit our criteria, which speaks to the future possibilities for highly dense information storage using this approach to writing.
Table 2 shows the hexadecimal Unicode code points for the traditional Chinese characters of the proverb discussed above. The individual hexadecimal symbols were assigned in 1-to-1 fashion to sixteen different alkynes (Fig. 6). After assigning monomer to code points, we successfully synthesized the required eight oligomers (see ESI III(d)(1)†) via a combination of consecutive solid phase CuAAC clicks and urethane coupling reactions, followed by prep-HPLC for purification. The O-terminus of each OU starts with the resin preloaded alaninol (labeled with # in the sequence) or phenylalaninol (labeled with * in the sequence), which we have reported acts as a convenient indexing tool (Alaindex or Pheindex) to start reading of the mass spectra.19
The eight oligomers were sequenced in a 2:
1 MeOH/H2O mixture with K3PO4 at 70 °C and submitted to LC-MS analysis at specific intervals for a period of 4 h. As a representative example, Fig. 7 shows that chain end degradation removes each monomer from the O-terminus, thus truncating the oligomers iteratively. 27 out of 32 masses were observed clearly and distinctly. The precursor 4 mers #8fd1, #7fd2, #9060 overlapped with one of their truncated oligomers in the low-resolution LC-MS conditions due to their similar polarity. It is worth noting that the length of the truncated oligomers does not correlate with the polarity, resulting in disordered retention times for each moiety from an LC trace. However, one can easily identify which LC peaks grew and diminished in sequence over time. Using mass spectrometry, we could easily observe +1 and +2 charged moieties, facilitating identification of all the moieties by intensity difference of oligomers/truncated oligomers and the mass differences (see ESI III(c)†).
![]() | ||
Fig. 7 (a) The Unicode code point for the corresponding Chinese character, and the associated oligourethane. (b) The LC trace of sequencing oligourethanes with K3PO4, reaction was heated at 70 °C in a microwave. (c) The corresponding exact masses of the oligomer each truncated oligomer (see corresponding mass spectra in ESI†). |
Having thus decoded the stored Unicode code points, we notate the hexadecimal codes in a Python list. We then feed this to a short Python function in a Jupyter notebook developed in house which prints the characters corresponding to the Unicode code points, reconstructing the original text with no errors nor any biased foreknowledge of the proverb, as in our previous work.19,20
With the success of our encoding of traditional Chinese characters with Unicode, we moved to encoding the same proverb in simplified Chinese characters with the ZM method.47 Thus, the proverb was converted to a base-26 symbolic system simply by increasing our library to 26 terminal alkynes (Fig. 8). In addition, ZM only requires four letters as a maximum code length but permits shorter codes. This provides opportunities for employing single and short string oligourethanes (e.g., 2-mers, 3-mers, or 4-mers) to encode a single character. When including the indexing monomer, this led to three 2-mers, two 3-mers and three 4-mers (Table 3), corresponding to the eight simplified Chinese characters (see ESI III(d)(2)†). The synthesis of the OUs was performed as for Unicode encoding, by iterative couplings, deprotections, solid phase CuAAC clicks, and capping with NBD. Cleavage from the resin was performed with 1% trifluoroacetic acid (TFA) in dichloromethane (DCM) for 10 min. Purification with HPLC was performed before sequencing.
As with the Unicode oligomers, we sequenced these oligomers concurrently via chain-end degradation in a 2:
1 MeOH/H2O mixture with K3PO4 at 70 °C in a heated shaker. These reactions were monitored by LC-MS every 60 min for 4 h. 23 of 24 masses (three 2-mers, two 3-mers and three 4-mers) were observed clearly and distinctly in the 470 nm channel under the generalized low-resolution LC-MS conditions (Fig. 9). The precursor 4-mer #bdrw overlapped with one of its truncated oligomers. As we discussed above, a lack of resolution between the precursor and a truncated oligomer does not cause any issues. We ran the decoding process with the in-house software to uncover the information stored within the oligourethanes. Specifically, the resulting ZM codes were passed to a Python list and fed to a specific function in the Jupyter notebook to render the appropriate Chinese characters, with additional heuristics described below to deal with ambiguous or non-unique ZM codes. Once again, the workflow returned the proper Chinese text with no errors with no foreknowledge of the proverb.
![]() | ||
Fig. 9 (a) The ZM code and corresponding Chinese character, and the associated oligourethane. (b) The LC trace of sequencing oligourethanes with K3PO4, reaction was heated at 70 °C in a microwave. (c) The corresponding exact masses of the oligomer and each truncated oligomer (see corresponding mass spectra in ESI†). |
A principal motivation of our foray into the ZM encoding was to open the door to applying visually based encoding systems to information storage in a chemical modality, irrespective of the use of oligourethanes. In this regard, ZM's occasional lack of uniqueness provides a novel challenge to chemical encoding. To overcome the obstacles posed and create a straightforward map from text to chemical storage and back, we explored the use of heuristics. Where a single code does not uniquely specify a single Chinese character, the redundancy can derive from the encoding of multiple-character phrases. We therefore only chose single-character correspondences, omitting multiple-character strings. And when a single character corresponds to more than one code, this often derives from “shortcuts”: i.e., additional shorter codes to represent a character. We therefore restricted consideration to the longest code available for any given character: e.g., BRW, WBR, and BDRW can all represent (yuăn, be far), and so we choose the longest, BDRW. This remained practical because the oligourethane synthesis routine is so simple. Nevertheless, ZM also retains same-length ambiguities such as PDW and WPD for
(jìn, be near), and BDRW and WBRD for
(yuăn, be far). Resolution of such cases required an additional heuristic, applying alphabetical order and choosing the first code: thus, we chose BDRW over WBRD for
(yuăn, be far).
With these heuristics, we succeeded in closing the encoding loop: text is input and converted uniquely to ZM codes, which are then converted to unique OUs. Conversely, upon chain-end degradation a sequence of ZM codes arises, these codes are then converted to unique Chinese characters to reproduce the original text (harken back to Fig. 2). While our heuristics allowed correct identification of all characters, the lack of uniqueness introduces the possibility of incorrect identification of characters in the decoding process. But a distinct advantage of a visually based encoding scheme like ZM is that such errors will often visually approximate the target character: considering the code BDRW for (yuăn, be far), if we had misread the final W as D, we would have obtained BDRD for
(yuán, first); or misreading the final W as G gives BDRG for
(wán, obstinate), containing the same central element
present in
. Thus, if the wrong character is selected, that character may share visual similarities with the target character, helping a competent reader to infer the correct intended character (though, as Table 3 shows with codes YI and YT, respectively
and
, this similarity has limits). Finally, we note that the Python scripts automated the process of sifting through correspondence tables, matching Chinese characters with the corresponding Unicode or ZM codes; this step could be performed manually, avoiding the computer's binary altogether. Only the Unicode and ZM encodings are inherent to the procedure.
Footnote |
† Electronic supplementary information (ESI) available: Detailed experimental procedures, sequencing experiments, supplementary data, and spectral data for all new compounds. Detailed instructions for the interpretation and user manual of Python script. See DOI: https://doi.org/10.1039/d3sc06189b |
This journal is © The Royal Society of Chemistry 2024 |