Drew A.
Leins
*a,
Steven B.
Haase
b,
Mohammed
Eslami
c,
Joshua
Schrier
d and
Jared T.
Freeman
a
aAptima, Inc., USA. E-mail: drewleins@gmail.com
bDuke University, Depts. of Biology and Medicine, USA
cNetrias, LLC., USA
dFordham University, Department of Chemistry, USA
First published on 29th November 2022
Many domains across physical, life, and social sciences suffer from incomplete models of constructs (e.g., organisms, environments, behaviors), which hinders reproducibility and the pace of discovery. Critically, the prevailing research paradigm, of individuals or small groups working within the resource constraints of their own organization, does little to support model completion and discovery. It does not integrate capabilities, enable investigators to generate data at scale, or offer a path to sharing knowledge at the level of data (versus at the level of conclusions). To develop and deploy a new paradigm for conducting science, The Defense Advanced Research Projects Agency (DARPA) created the Synergistic Discovery and Design (SD2) program. The SD2 program proposed a novel method of conducting science that (1) integrates the capabilities of multiple organizations across multiple domains and (2) makes implicit knowledge explicit at all stages of the scientific process. It assembled and integrated a sociotechnical system that aimed to overcome the limitations of conventional scientific methods. In this paradigm, scientists and technologists collaborated to develop technologies, share data, and conduct science in ways that were faster, more efficient, more complete, and more productive than was possible outside of this program. This paper describes the SD2 approach to developing that sociotechnical system—of selectively applying conventional methods of science, embracing a more collaborative paradigm, and establishing an infrastructure—to drive discovery.
To develop and deploy a new paradigm for conducting science, The Defense Advanced Research Projects Agency (DARPA) created the Synergistic Discovery and Design (SD2) program. The SD2 program proposed a novel method of conducting science that (1) integrates the capabilities of multiple organizations across multiple domains and (2) makes implicit knowledge explicit at all stages of the scientific process. It assembled and integrated biologists, chemists, computer scientists, and social scientists to develop and embody a sociotechnical system that aimed to overcome the limitations of conventional scientific methods. This paper describes the SD2 approach to developing that sociotechnical system—of selectively applying conventional methods of science, embracing a more collaborative paradigm, and establishing an infrastructure—to drive discovery.
This paradigm can also be operationally efficient because it allows investigators to leverage their own strengths, knowledge, and previous methods and results. Such an approach allows for continuity in developing, conducting, and disseminating a line of research. For example, iterative or derivative methods that address the same research question may yield results that are comparable across experiments and thus easily aggregated and interpreted. The individualist paradigm also consolidates scientific knowledge so that the individuals who know the most about a dataset are those who generated it, thus accelerating the process of reporting results. In short, the individualist paradigm allows investigators to conduct principled, efficient, and sometimes impactful science. However, it imposes constraints that put grand discovery out of reach, slows the pace of progress, and causes confusion generated by a lack of reproducibility.
The prevailing approach to authoring protocols constrains scientists to designing experiments that feature and leverage their own organization's resources. As noted previously, capabilities may not be compatible across organizations; thus, even in response to a common problem, different labs often author protocols that call for different experimental materials, instruments, and resources. These differences necessarily render the datasets incompatible and ill-suited for integration. Instead, aggregation of relevant scientific knowledge must then occur at the level of conclusions. This may be acceptable when conclusions are compatible;10 however, it becomes problematic when conclusions are incompatible because they are based on different findings yielded by different experimental methods. Identifying the precise point in the research cycle where variance was introduced and influenced the results is critical, but very difficult to pinpoint across organizations working in siloed, parallel fashion.
Within the individualist paradigm, it has become common to collaborate in ways that separate data-collection groups and activities from data-analysis groups and activities. This style of collaboration is prevalent where the data collected are large (e.g., ‘omics data sets). As experimentalists advance from design to experiment to data collection, implicit knowledge (e.g., the intent of an experiment), if not made explicit, may fail to propagate to the data analysis groups (or if it propagates, it does so ambiguously, leaving room for confusion). The lack of integration between experimentalists and analysts in this scenario can result in conclusions that are misaligned with experimental intent and methods. Consequently, it becomes difficult to reconcile conclusions derived from analysis across experiments.
Moreover, a methodological paradigm that fails to support the aggregation of knowledge before the level of conclusions is necessarily slow and difficult to accelerate. When national- or global-level problems require rapid, large-scale responses, interested and capable organizations cannot simply join the cause and contribute their data unless those data are compatible. Instead, responding organizations typically operate in parallel, siloed fashion, executing full experimental cycles before sharing potentially incompatible conclusions. For example, in response to the COVID-19 pandemic, there was a large-scale globally distributed effort to generate data relevant to pathogen genomics, clinical outcomes, and epidemiological factors. Recognizing the value of rapid dissemination of information, many publications provided open access to COVID-19 research and released data prior to peer review.13 Useful as these measures were, the data that were made available were not necessarily easily aggregated for analysis, not least of all because data collection methods were not standardized or transparent and there was no globally accessible infrastructure for transforming and aggregating idiosyncratic data for analysis.14 Thus, it required extensive manual labor to reconcile and compile these data. Such outcomes are suboptimal. By contrast, a paradigm that aggregates knowledge at the level of data supports scaling and acceleration to address a wide range of problems more quickly and comprehensively.
In short, the individualist model of science fails to support reproducibility, data sharing, advances in experimentation and analytics, and rapid progress to common conclusions. Hurdles to generating, analyzing, and sharing data at scale prevent scientists from accelerating model development and discovery. A new paradigm is needed to overcome these hurdles.
One can look to a recent example in the study of protein structure—the success of AlphaFold—to illustrate the benefits of collaborative science and, in particular, standardized data formats and advances in data analytics.16 For decades, numerous researchers worked individually to document and collect many high-resolution structures of diverse proteins. To be publishable, these data had to be in a standard format and posted to a shared repository, the Worldwide Protein Data Bank (wwPDB).17 This data bank, which currently supports large-scale data sharing and reuse, took more than 20 years to gain traction with scientific, publishing, and funding communities. It was founded in 1971 by a relatively small community of scientists who were willing to share their data. At that time, the larger scientific community was less inclined to share data, at least in part because of unresolved questions regarding the scientific and commercial value of withholding data (e.g., to improve their accuracy or realize financial gains).18 It was not until 1989 that the International Union of Crystallography's (IUCr) Biological Macromolecule commission addressed these concerns and articulated its policy that structures be deposited in the PDB prior to publication and that their public release could be delayed for a limited time.19 Then, in 1992, more than 20 years after the PDB was founded, the NIH and other funding agencies began to adopt formal policies mandating deposition of research results in the PDB.20 Since the year 2000, over 209000 data sets have been deposited to the PDB. These data are freely and publicly available. Consequently, they have been used to inform a vast array of prospective solutions associated with predicting protein structure. For example, the ten most cited structures in this database had been cited, collectively, over 31000 times through 2018.21 Most recently, wwPDB datasets fed novel data-driven neural-network models of machine learning to provide a computationally efficient method of predicting structures that is as accurate as experimental methods that are far more costly and less efficient.22 The SD2 program aimed to accomplish a similar outcome, though more quickly, in synthetic biology and materials chemistry, as these fields have suffered from a paucity of large, accessible datasets conducive to applications of advances in AI/ML (except see resources such as SynBioHub,23 the Cambridge Structural Database24 [CSD] and the Inorganic Crystal Structure Database25 [ICSD], which provide access to large repositories of structural data for some biological and chemical constructs). Here, we describe components of a framework for collaborative science that aggregates and integrates scientific insight at the level of data and generates datasets of sufficient quantity and quality to support better experimental decision-making, leverage advances in analytics, and accelerate discovery.
• Discovery, responsible for applying machine learning and other sophisticated analytics to data, to inform model development.
• Design, responsible for designing model constructs and developing protocols for testing those constructs.
• Experimentation, responsible for executing protocols and generating raw data.
• Data Infrastructure, responsible for providing and maintaining the hardware and software for transferring, storing, curating, and processing data at Petabyte scale.
• Socio-technical Integration, responsible for facilitating, monitoring, and measuring collaboration in, and the evolution of, the sociotechnical system.
Separating design, discovery, experimentation, and data infrastructure created loops of data flow that required the scientists and technologists of SD2 (designated “program performers”) to collaborate to identify the most propitious ways to generate, share, and operate on data. Performers collaborated through several methods, including attending an in-person hackathon and quarterly program-wide integration meetings, participating in small groups to define and refine workflows and roles, interfacing on collaboration-focused platforms (e.g., Slack, Google Drive, GitHub, GitLab), and leveraging a shared infrastructure for housing data and software (the SD2 Environment or “SD2E”).26 Through these methods, performers negotiated the scientific questions they would address, the methods they would use to address them, and the division of labor across teams addressing a challenge. Simultaneously, performers identified friction points that slowed communication or the transfer or use of data and developed solutions to mitigate those friction points. For example, during a week-long program-wide hackathon at the start of the program, existing datasets relevant to each scientific topic were delivered to teams of performers clustered by each topic. As is typical of most research datasets, these were labeled and organized idiosyncratically, making it difficult for performers to understand the intent of the activities that generated the data and to identify methods for analyzing them. It became clear that if multiple teams were to operate on the same datasets, those teams needed a mechanism for understanding the data. This led to negotiating methods for describing activities, data, and metadata in a standardized way, so that any performer on the program could theoretically understand and operate on any SD2 dataset (akin to implementing FAIR data standards).27 Thus began development of the sociotechnical infrastructure that evolved into a system of people, technologies, and processes (the sociotechnical system, or STS) that guided the work within, and the collaboration across, each technical area. This system allowed design teams to send models to experiment planners who compiled and pushed protocols to experimental labs that generated and shared data with analysts who operated on it and fed results back to designers for model refinement (see Fig. 2). This became known as the design-build-test-learn (DBTL) loop. The general workflow of this loop has its roots in Walter Shewhart's specification–production–inspection cycle for quality control and iterative product development in manufacturing (1939). This was popularized by Edward Deming in the 1950s as the plan-do-check-act (PDCA) cycle. This is essentially a formalized version of the scientific method applied to manufacturing. The broad contours of iterative ligand synthesis and validation have existed for some time in pharmaceutical research, and certainly have been practiced without giving it an explicit name. Early invocations refer to this as cycles of design, synthesize, test, interpret28 or design, test, make, analyze,29 which consciously adopts quality management terminology for the discovery process. As implemented in SD2, it was an architecture for collaboratively and iteratively generating and analyzing data at scale and enhancing scientific models in ways not achievable through individualist science.
To facilitate protocol authoring and sharing across organizations, the SD2 program developed a suite of standardization tools (e.g., document templates and software applications) that integrated with commonly used document and spreadsheet formats. One of those tools, the Experiment Request, is a Google Docs-based template for organizing the prose of a protocol—the details that characterize a protocol—into a machine-readable format for driving an experiment.30 Critically, it is a dynamic document that records and timestamps changes to the experimental goal, rationale, execution plan (including a matrix of materials, parameters, and measurements), and analysis plan. It provides users with references to relevant artifacts, matrices of experimental materials and variables, and parameters for candidate analytics (e.g., expected data types and assertions of measurement dependencies). It also identifies metadata requirements, potential or expected challenges to running the experiment, and links to newly generated data. This template evolved with use over time, standardized the language of experimental protocols, and resulted in a more streamlined DBTL workflow. Consequently, performers spent fewer hours searching for clarification of critical details such as experimental parameters or data and metadata structures.
To further streamline the cross-organization workflow, program performers developed the Intent Parser,31 an application that processes the Experiment Request to further standardize the language in a protocol by mapping constructs in the request to canonical definitions in a shared database (e.g., a “Data Dictionary”)32 and linking these constructs to descriptions in a widely accessible resource (e.g., SynBioHub).23 When the Intent Parser encounters a novel term—for example, an idiosyncratic shorthand label for a material or procedure—it flags it for clarification. This allows scientists to either adopt others' (perhaps more common) terms and revise their protocol or to add new terms to the shared database. Thus, the Intent Parser helped standardize protocols while also allowing flexible protocol authoring. Critically, it reduced the need for all collaborators to know and use the same language during planning, as it discovers language discrepancies and incrementally nudges collaborators toward a shared lexicon.
This approach to protocol authoring defines the materials and methods of an experiment with sufficient precision to ensure reproducibility (or at least to raise the likelihood of reproducibility), yet it grants scientists license to use some local terminology by automating the standardization of terms. It also allowed a level of efficiency and productivity not realized prior to developing these applications: In a four-month period following development of this approach, 19 users from various organizations generated 34 experiment requests across three different protocols, yielding over 16000 experimental samples generated across multiple sites. Moreover, the standardized language and formats of these protocols allowed simpler and faster analysis and more effective data sharing across the program31 It is important to note, however, that these applications were born of a struggle to develop a single, automated tool for ingesting the prose of any synthetic biology experiment protocol and translating it into a protocol that is executable by one of multiple laboratories with different infrastructures and capabilities. The work of developing such an automated tool—for example, of reconciling the different methods, materials, and lexicons of over a dozen organizations—proved to be too costly in time and resources to pursue. Instead, the SD2 community decided that the efficiency afforded by the combination of small modifications to the content and structure of written protocols and the relatively simple automation of the Experiment Request and Intent Parser tools outweighed the cost of developing a fully automated protocol translation tool.
Whereas the synthetic biology working groups maintained and reconciled existing languages for representing new protocols, the materials chemistry working group had to invent new processes for representing protocols because the methods for testing their prescribed reactions had never before been automated. Hence, this group adopted a co-design model in which it simultaneously developed the underlying experimental laboratory protocol (designated as a “workflow” of reaction and characterization steps) and the software to support experimental specification and reporting. The experimental development was geographically distributed, with one team working on bench-scale development and the other team focused on automating those processes. A third team of software developers served as a conduit between the two experimental groups, capturing and documenting the necessary specifications and metadata required to perform the experiment. This informed both the software development as well as the evolution of the protocol and ensured that the protocol could be executed at different sites with different capabilities.
Aquarium is a lab-management software application designed to support human-executed experimentation at the benchtop (though it can also drive automated experimentation). It is a user-friendly application that offers predefined, common experimental procedures and allows users to define their own protocol templates. Every defined, executable protocol step is represented as a unit that authors can select and chain to generate simple or complex protocols. This standardization generates protocols that are easy to repeat and share. By installing and using Aquarium at multiple performer sites, the SD2 program developed a network of laboratories that can leverage each other's capabilities to generate complementary datasets and test for reproducibility.
Autoprotocol is an experiment-building language designed to produce machine-executable protocols that enable automated experimentation. Similar to Aquarium, Autoprotocol allows an author to identify experimental parameters, compile a protocol of executable steps, and share the protocol with an appropriately equipped lab. Autoprotocol differs slightly from Aquarium in that it requires greater knowledge of coding; however, with modification, it can offer a user interface that allows scientists to select protocol elements and configure them in a workflow that can be easily read and interpreted by a human lab tech or an automated agent.
The software supporting materials chemistry experiments, ESCALATE, originated from cross-organization work and thus emphasizes cross-lab interactions. Because the labs participating in SD2 had different levels of automation, ESCALATE was developed to support both human and machine instructions. It was originally developed as a lightweight framework to prototype designs using a shared file system and spreadsheets. It was forged in the context of designing experiments to support metal halide perovskite crystallization. Using the lessons learned from the co-design process, a more general version was developed to allow for arbitrary experiments. Borrowing lessons learned from the development of Aquarium and Autoprotocol, the underlying model of experiments in ESCALATE allows for import and export of Autoprotocol experiment specifications. In principle, this can also be adapted for import and export of other emerging standards for describing reaction data, such as those found in the Open Reaction Database37 and those using the Universal Chemical Description Language (χDL).38
Aquarium, Autoprotocol, ESCALATE, and similar applications support collaborative science and reproducibility in several ways. First, these applications help standardize technical language, resulting in protocols that can be read and understood by users who share that language. Second, they allow scientists to author a protocol to be executed in any compatibly equipped lab. Collaborations can exist in which partner labs offer collaborators a menu of experimental capabilities from which to choose. Thus, for example, a scientist with no access to measurement devices could author a protocol to be run in a fully instrumented lab because both the author and the lab run instances of Aquarium. These applications can also be force multipliers. A scientific question requiring quick production of large datasets (e.g., data characterizing a novel virus at the start of a pandemic) could be addressed via multiple labs running the same protocol, thus consolidating an otherwise distributed workforce. A protocol that might have been executed by few can now be executed by many.
To support capability integration even further, at the time of this publication, an application—the Open Protocol Interface Language (OPIL)—is in development to translate protocols between Aquarium and Autoprotocol.39 With a common protocol language, the data and workflows from labs with different operating systems (e.g., human operated or semi-automated) will be interoperable. This will allow more efficient vetting of protocols and convergence on candidates for automated or high-throughput execution. As more labs integrate in this way, they can scale up responses to emergent problems without requiring investment in new infrastructure for automated workflows.
Performers on SD2 also used metadata to describe analyses. Downstream, at the “learn” stage of the DBTL loop, many parameters were required to configure analytics. These parameters became metadata that described an analysis and allowed for easier sharing and interpretation of results. For example, when ensuring the quality of data and results, analysts could refer to metadata representing analytic choices (e.g., to use one control sample rather than another), to consider the validity of those choices and determine the quality of the results. Thus, by developing tools and processes for capturing metadata at various stages of the DBTL loop, SD2 performers could share data and compare apples to apples at each stage and be confident that they understood what the data and results represented.
The materials chemistry working group capitalized even further on metadata representations, eventually using them to refine their experimental search space. They used statistical analysis of metadata to identify anomalies associated with variations in laboratory conditions, which were in turn used as hypotheses for subsequent experiments.40 This type of “automated serendipity” enables flexible experimentation in which variations are permitted, documented, and analyzed for insights that inform model (or search space) refinement.
In addition to developing metadata tools and processes, SD2 performers leveraged a centralized, performer-maintained infrastructure to facilitate collaborative science and engineering, and in particular, data sharing. This infrastructure provided access to a research computing ecosystem with tremendous capacity to host software and to store and process data at scale.41 Importantly, it provided access to resources to allow a technically diverse user-base (from novice, point-and-click users to expert software developers) to participate in collaborative science and engineering. For example, it served as a platform for hosting the previously described data integration and metadata representation tools, as well as a diverse set of tools for analyzing data. Performers could access a dataset, a reference file for tracking metadata, an array of tools for analyzing the data, and tutorials or documentation for using the analysis tools. It was the core of development activities and a space where performers could contribute to the standards, methods, and tools that advanced a flexible and extensible approach to collaborative science.
Critically, the SD2 program leveraged a central authority to develop and impose mandates for sharing data. This was a key programmatic feature whose consequences—namely, quick negotiation and adoption of processes for sharing data—can be contrasted with the consequences of lacking a central authority. Recall that the PDB was conceived by practitioners in 1971,42 but was not enforced by some journals and funding institutions until 1992.20 In contrast, SD2 performers developed and adopted standards and a formal agreement for data sharing relatively quickly within the program. The program then promoted these standards in outreach to publishers, funders, and military research laboratories. However, it is important to note that the program reached consensus on some standards only after experiencing significant challenges to sharing and operating on data. Early in the program, the data scientists who were responsible for discovery through application of novel analytics, and who relied on the data to be FAIR (findable, accessible, interoperable, and reusable),27 were unable to apply their analytics across early datasets because those datasets were not necessarily interoperable or reusable. It took requests for better data “FAIRness” and quality to induce program performers to develop the processes and leverage the resources described in this section.
With standardized, shareable protocols, distributed labs can run the same experiment and generate multiples of data. Collaborators who share an authoring and execution platform can leverage excess capacities for generating data in their partners' labs. Thus, if one lab is running experiments at full capacity, it can request to use spare capacity at a partner lab. In addition, distributed labs with complementary capabilities can collaborate to design and run experiments they would not be able to run alone. This enabled execution of different types of experiments—for example, single-crystal versus thin-film synthesis of perovskite crystals—across laboratories to study common phenomena.40
With automation, collaborators can scale their workflows to generate more data more efficiently. Leveraging automation to enhance data generation can take at least two forms. It can be as simple as adding automation to support human operations at the benchtop, or it can represent a fundamental departure from the benchtop approach, by replacing human operators with machinery. The SD2 program introduced automation to support data generation in both ways: (1) we paired software applications with human technicians (the semi-automated lab) and (2) we paired software applications with machinery (the fully automated lab).
In the semi-automated SD2 lab, applications such as Aquarium support competent, reliable execution and documentation of protocols. They provide human lab technicians explicit instructions for executing every step of a protocol. Lab techs can explore information associated with each step, preview future steps, and review past steps, to enhance their understanding of the protocol. During execution, the application prompts the lab tech to indicate when a step is complete and whether the step was modified in any way. Thus, the lab tech can annotate a step, for example, to record observations, changes to steps, durations of processes, or tips for implementing techniques. After a lab tech has completed a step and recorded any necessary metadata, the application advances to the next step and provides associated instructions. Thus, the lab tech will complete the protocol stepwise, receiving prompts to ensure that the actual conditions of execution are documented at each step. This approach allows the lab tech to exercise discretion in executing the protocol, to modify steps when appropriate, and to document the modifications. Of course, allowing lab techs discretion over how to execute a protocol can be problematic, as it requires competency. Applications such as Aquarium address this concern in two ways. First, they reduce the baseline level of competency required by lab techs by including contextual details (e.g., tips for techniques) and alternative operations. Consequently, lab techs do not have to rely on their own, potentially limited, knowledge of scientific principles or experience at the bench to determine whether an operation should be executed as written. Second, in cases in which lab tech competency is deficient or unknown, protocol authors can disable the features of the application that allow lab techs to exercise discretion (e.g., suggestions of alternative materials or procedures). In these cases, the application simply guides the lab tech rigidly through the protocol. This can alleviate some of the cognitive demand experienced by lab techs, for example in evaluating options or executing complex actions, and free up resources to focus on the important aspects of the protocol, thus reducing errors.13
Software applications that reduce the competence needed to execute experimental protocols should, however, be implemented strategically. Although they support efficient, reliable data generation, they represent an ambiguous good from an educator's perspective. On the one hand they may allow students or other workers with lower levels of training to participate in the research process; on the other hand, they reduce agency and the development of higher order skills associated with entry-level scientific apprenticeships. This tradeoff is not necessarily detrimental. As more scientific tasks become automated, the need for scientists to develop expertise relevant to formerly manual tasks decreases and is replaced by a need to develop expertise and knowledge relevant to automation. A discussion of the opportunities and training needs associated with automation in the chemical sciences can be found in a recent report by the US National Academy of Sciences.43 Incorporating automation technologies into pedagogical training is not inherently new,44,45 but recent efforts have focused on training students in the combination of experimental hardware and planning algorithms46,47 and on closely adjacent enabling technologies such as computer vision48,49 and speech recognition.50 In addition to producing a more technologically skilled workforce, this also provides an opportunity to create a more inclusive scientific workforce, as laboratory automation can remove barriers for students with visual or physical disabilities.51
In the fully automated lab, scientists design and submit protocols that can be read and executed by machines. In a lab with closed work cells, there is typically no human–machine interaction beyond uploading and selecting a protocol. The progress of an experiment is tracked through automated documentation of the movement of materials through a workflow. This is generally an efficient means of generating data, as it is not constrained by human resources (e.g., time, attention, availability). However, protocol flexibility and modification opportunities are limited, as changing hardware and software parameters can be resource intensive. Hence, automated workflows are ideal for protocols that have been pilot tested and vetted. This concern can be partially mitigated by an open, modular infrastructure in which workflows can move from module to module, or machine to machine, with humans transferring materials from one module to another. For example, one agent works on strain construction then a lab tech delivers the product to another agent that works on incubation and then on to another agent that draws and measures samples. Here, the progress of an experiment is tracked by human observers who can flexibly input metadata for each observation. In this modular workflow, small changes can be made at different points in a protocol (e.g., at strain construction or at sampling) if the downstream effects are acceptable and documented. A practical (but not insurmountable) technical challenge is developing automated modules that span the diverse range of activities (e.g., perturbation or materials handling) that are characteristic of biological and chemical protocols.
After the initial investment of writing and pilot testing a protocol, an automated, high-throughput approach offers significant gains in efficiency. Writing code to instruct an automated lab agent is more efficient than training a human lab tech to execute the same protocol. Moreover, an automated agent should, on average, offer greater reliability and availability than a human operator. Given a reliable source of power and regular maintenance, a machine's output should far surpass that of a human operator over increasingly longer intervals. Thus, optimized efficiency and maximized output are the primary benefits to this model. However, just as one must conduct a cost-benefit analysis before leveraging full automation, one must also conduct a cost-benefit analysis of exporting protocols to high-throughput labs. In SD2, scientists weighed the error rate of low-throughput, benchtop testing against the cost of that error rate amplified by high-throughput testing. In many instances, it was worth advancing a protocol to high-throughput testing; however, in some cases, it was justifiable to generate small datasets for emerging unvetted protocols or protocols that required extensive human operations. This was particularly the case when advancing a protocol to automated or high-throughput testing required extensive modifications either because the automated infrastructure lacked the machinery to perform a traditionally manual operation (e.g., mixing a sample by using a vortex device), thus requiring identification and description of a replacement procedure, or because the automated infrastructure had to be recoded to perform the required procedure. To guide decisions on whether to pursue high- versus low-throughput testing, performers developed a decision tree that prompted them to consider the capability needed to execute the protocol, how likely the protocol was to require modification, how much data were needed to answer the scientific question, and the cost of using each approach (e.g., benchtop vs. high throughput; see Fig. 3).
By and large, the approach toward automation on SD2—develop and deploy automation strategically and judiciously—is compatible with recent social critiques suggesting that aspirations toward full automation are suboptimal.52 To transform certain types of laboratory tasks through automation creates a need for highly technical development and maintenance (requiring specialized expertise) and precarious “piecework” to deal with errors in automation and interfaces between automated agents. Thus, full automation is likely to be unnecessary, potentially doomed to failure, and (if successful) exclusive of certain types of scientific investigation, thus landing scientists back in the confines of individualist science (albeit an automated individualist science). Instead, a better strategy may be to develop “islands of automation” in the laboratory, surrounded by manual tasks, thus enabling more flexible and inclusive science.
To address genetic-design challenge problems on SD2, experts in ML collaborated with experts in synthetic biology. In these collaborations, synthetic biologists explained the specific discovery objectives that could benefit from ML techniques, while ML experts explained the data requirements for training relevant ML algorithms. Together, they developed experimental designs and analytic workflows to comprehensively predict the outcome of an experiment, given data at scale. As a specific example, applying four inducers to an organism to stimulate a response requires data to be generated in five conditions (one control + 4 inducers) at the various time points of interest (e.g., log and stationary phase). At a sample level, this yields a total of 10 conditions for which “responses” can be observed, which is too few to train any ML model. By contrast, by scoping observations from the sample level to the level of genetic transcriptome—specifically, the response of a particular gene's expression given the features of the gene—the amount of data becomes the product of the gene's transcriptome (∼4000) across the conditions (×10), which yields a total of ∼40000 observations. When run across three replicates per condition, these conditions yield a total of 120000 observations. This is enough data to train a variety of models, given accurate identification of genetic features that can be linked to the gene's expression, such as its role in a network of genes. Using machine learning, the experimental conditions, and a vectorized representation of a gene's role in the network that represents the organism, SD2 performers were able to achieve greater than 90% accuracy in predicting whether the gene would be dysregulated, and an R2 ∼ 0.6 in quantifying the level of the gene's dysregulation.53
In the materials chemistry thrust, work focused on accelerating the Edisonian trial-and-error process by using data at scale. A variety of experiment planning algorithms were tested for their ability to support interpolation of results,54 extrapolation to new chemical systems,36 combination of model predictions to identify anomalies,55 as well as active-learning54 and active-meta learning approaches56 for crystal growth control. These activities culminated in a competition between algorithms developed in the different problem domains.57 Automation was necessary to accumulate the initial datasets needed for algorithm development and testing, as well as to define statistically significant performance baselines. However, it should be emphasized that many of these methods are applicable to manual experimentation now that these initial data exist.56
The SD2 community applied collaboratively developed tools and methods—for example, automated data generation, pre-processing, normalization, and analysis—to several scientific questions. Among other advances, these efforts led to faster and more accurate predictions of protein stability,58–60 faster discovery of perovskite crystals,54,55,61 and more accurate predictions of the impact of synthetic biological circuits on host organisms.62
This supports scalable, reproducible data generation and analysis by multiple groups. Protocol sharing and harmonization supports high-throughput experimentation. Data sharing at all levels supports experimental design conducive to AI/ML analytic approaches. Automated pipelines for pre-processing, normalization, and quality control increase the speed with which raw data become ready for analysis.30 Standardized data frames enable automated batch analysis of new data along with old data, substantially reducing the time and resources otherwise needed to complete an experimental cycle. In contrast to the scalability of the individualist approach, the speed of discovery scales with the addition of new laboratories in a collaborative paradigm because shared data accumulates more rapidly and accelerates all processes downstream of data production. At the end of the 4 year program, SD2 performers generated measurements of progress relative to an estimated pre-SD2 baseline.63 Analysts classified and normalized this diverse set of performer-generated, often domain-specific statistics. They reported that SD2 performers increased the number of constructs (e.g., perovskite crystals) designed or discovered by 12×; increased the complexity of model designs (e.g., genetic networks) by 20×; increased the speed of model design, build, test, or analysis by 10×; increased labor efficiency (e.g., output per individual) by nearly 4×; and increased the accuracy of model predictions (e.g., protein stability) by 3× (see Fig. 5).64 Critical to supporting rapid responses to emergent large-scale problems, the sociotechnical system developed in SD2 yielded notable gains in the speed to complete experimental cycles. Cycle speeds for designing proteins, engineering genetic components and networks, and synthesizing perovskite crystals increased dramatically (from 3–81×) over pre-SD2 rates.
In convening an STS, it will benefit the enterprise to engage in activities—for example, knowledge capture activities—that help identify the details of collaborators' methods that are critical to supporting collaboration and discovery. In the same way that making implicit knowledge explicit is critical to sharing knowledge and enabling discovery at the level of science, making methods and metamethods explicit is critical to sharing the knowledge (e.g., procedural, semantic, and institutional knowledge) that enables collaboration.
Throughout the collaboration, and especially early in the process, collaborators should be allowed to fail and encouraged to fail fast. Many professionals across industry and academia perceive failure as a negative outcome. This may obstruct their ability to view failure as an opportunity to improve or advance their work; consequently, rather than see and embrace failure, they push back and persist in an unrecognized failure mode. SD2 experienced both conditions. Not surprisingly, when SD2 performers recognized failure, accepted it, and sought opportunities to improve, they experienced success much more rapidly.
Collaborators must be flexible about retaining versus omitting metadata processes. Metadata collection, representation, and availability ended up being hallmark characteristics of the SD2 STS. However, it was challenging to identify what metadata to collect and how to organize them, because the metadata requirements depended on how the data would be used. Metadata uses are not always discernible early in research and they may rarely be anticipated for future research that will be conducted beyond the time and sociotechnical boundaries of the immediate STS. Moreover, some metadata requirements cannot be easily identified until a problem is discovered. Thus, flexibility is critical to managing metadata processes.
Regarding metadata and automation, if metadata processes—for example, collection and management—are candidates for automation, collaborators should consider automating them prior to running experiments. This may require the temporal staggering of work. Thus, it may not be possible to generate data while engineering data-management solutions, but this might avoid confusion and wasted work.
Regarding automation in general, collaborators should carefully consider the goals of introducing automation as a solution. It can enhance efficiency and free humans to engage in more fruitful activities, but it comes at a cost in set-up time, diagnosis, and flexibility. SD2 performers were initially focused on automating the full DBTL loop, which turned out to be too large and risky of an undertaking for processes that were still evolving, as there was high potential for wasting time automating the nascent, volatile processes. The balance struck by performers was to attempt to automate processes that were well defined and did not demand high flexibility. This purchased speedy execution for a large subset of data-production processes. In addition, performers worked to identify the processes that required the most human effort and to automate those, which resulted in reduced processing time and human errors. Ultimately, automation should support scientific outcomes, and it should support rather than fully replace the work of relevant human operators.
Finally, to engage potential users of emergent methods, technologies, and standards—and to avoid simply creating a new silo of information—collaborators should conduct outreach activities with relevant stakeholders. SD2 performers did so to promote the use of their scientific approach and the standards and tools that they developed. For example, the program sponsored a meeting with funders and publishers of science to address issues of open science, such as the publication of datasets. SD2 performers also engaged stakeholders, such as U.S. military research laboratories, in nearly all of their program-wide meetings, in part to elicit guidance that could shape the products of the program. In addition, SD2 performers published their research methods and findings extensively; and many of the tools and datasets developed in SD2 were made public at https://github.com/SD2E.65 However, like all DARPA programs, outreach activities were not formally funded after the end of the program. Thus, continued outreach, training, and maintenance of the SD2 infrastructure, tools, and data will likely require development of an entity (e.g., similar to U. S. Department of Energy laboratories or NIST) or a long-term program to fund research infrastructure. In the absence of such an entity or program, outreach beyond the lifespan of a government-funded program is left in the hands of individuals and groups who are invested in transitioning program products into adjacent research and development spaces.
Although the work of SD2 was domain-specific, the collaborative paradigm it adopted can be generalized to support multidisciplinary work across several domains in life, physical, and social sciences. The SD2 sociotechnical system can serve as a model for developing similar systems that support a national infrastructure that is equipped and ready to respond to emergent, high dimensional problems that require diverse resources and capabilities to generate and analyze data at scale. To retain the status quo means failure to optimize resources, capabilities, and output at any stage of the scientific method, and hence missed opportunities to maximize discovery.
The discoveries that will advance science in profound ways will be made possible by collaborative, multidisciplinary efforts. These efforts require practices and incentives for sharing methods and data, and for identifying and leveraging complementary capabilities. This will allow for efficient generation and analysis of quality data at scale. This will lead to discovery.
This journal is © The Royal Society of Chemistry 2023 |