Sabrina
Hamla
*a,
Pierre-Yves
Sacré
a,
Allison
Derenne
b,
Kheiro-Mouna
Derfoufi
b,
Ben
Cowper
c,
Claire I.
Butré
d,
Arnaud
Delobel
d,
Erik
Goormaghtigh
b,
Philippe
Hubert
a and
Eric
Ziemons
a
aUniversity of Liege (ULiege), CIRM, Vibra-Sante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium. E-mail: sabrina.hamla@uliege.be; Tel: +3243664316
bCenter for Structural Biology and Bioinformatics, Laboratory for the Structure and Function of Biological Membranes, ULB, Campus Plaine CP206/02, 1050 Brussels, Belgium
cNational Institute for Biological Standards and Control, Blanche Lane, South Mimms, Potters Bar, Hertfordshire EN6 3QG, UK
dQuality Assistance, Techno Parc de Thudinie 2, 6536 Thuin, Belgium
First published on 17th February 2022
Almost 60% of commercialized pharmaceutical proteins are glycosylated. Glycosylation is considered a critical quality attribute, as it affects the stability, bioactivity and safety of proteins. Hence, the development of analytical methods to characterise the composition and structure of glycoproteins is crucial. Currently, existing methods are time-consuming, expensive, and require significant sample preparation steps, which can alter the robustness of the analyses. In this work, we suggest the use of a fast, direct, and simple Fourier transform infrared spectroscopy (FT-IR) combined with a chemometric strategy to address this challenge. In this context, a database of FT-IR spectra of glycoproteins was built, and the glycoproteins were characterised by reference methods (MALDI-TOF, LC-ESI-QTOF and LC-FLR-MS) to estimate the mass ratio between carbohydrates and proteins and determine the composition in monosaccharides. The FT-IR spectra were processed first by Partial Least Squares Regression (PLSR), one of the most used regression algorithms in spectroscopy and secondly by Support Vector Regression (SVR). SVR has emerged in recent years and is now considered a powerful alternative to PLSR, thanks to its ability to flexibly model nonlinear relationships. The results provide clear evidence of the efficiency of the combination of FT-IR spectroscopy, and SVR modelling to characterise glycosylation in therapeutic proteins. The SVR models showed better predictive performances than the PLSR models in terms of RMSECV, RMSEP, R2CV, R2Pred and RPD. This tool offers several potential applications, such as comparing the glycosylation of a biosimilar and the original molecule, monitoring batch-to-batch homogeneity, and in-process control.
Glycosylation is considered a critical quality attribute because it impacts the stability, pharmacokinetics, pharmacodynamics (PK/PD), and the safety (immunogenicity) of the product.5 It is a complex process involving hundreds of enzymes specific to each cell. This PTM is influenced by many parameters during the manufacturing process, including host system type (mammalian cells, yeast strains, plant cells, insect cells or genetically modified animals) and environmental conditions of culture (bioreactor type, culture media and process parameters).6–8 Even though these parameters are controlled during the production, the process generally generates micro- and macro-heterogeneity in the glycosylation of proteins. Micro-heterogeneity corresponds to heterogeneity in the structures of glycans occupying a particular glycosylation site, at the same time macro-heterogeneity can be defined as the difference in frequencies of occupation of glycosylation sites in a glycoprotein batch.1,8 These differences significantly affect the protein glycosylation profile, and therefore the quality and safety of the final therapeutic product. Hence, the regulatory authorities require the systematic characterisation of the composition and structure of glycoproteins throughout the drug development and manufacturing processes to ensure the quality and consistency of the final drug product.9,10
Current glycosylation analysis methods are complex and time-consuming processes, consisting of long sample preparation protocols requiring a minimum of three steps: glycan release, labelling and glycan purification. Each step can suffer from several potential sources of error.11 Therefore, the efficiency is not always consistent and can lead to variations in the results from one laboratory to another.7 To overcome these limitations, it is proposed to use a vibrational spectroscopy technique – FT-IR spectroscopy. Firstly, this technique has already been used to analyse biomolecules for a wide range of applications. For example, it is a tool of choice for the structural characterisation of proteins.12–15 It has also been used to monitor the carbohydrate content, notably in algae cultures.16 The study by Khajehpour et al. demonstrated the correlation between the intensity of the spectral bands in 1200 and 1000 cm−1 and the amount of carbohydrates in glycoproteins. These absorption bands are mainly due to the stretching of the C–C and C–O bonds of the carbohydrate skeleton.17 Moreover, a recent study by Derenne et al.5 showed that the use of FT-IR for the analysis of glycosylation profiles presents several advantages. Firstly, it is a simple, fast, and non-destructive tool. Secondly, FT-IR spectra of glycoproteins generate an average but the accurate fingerprint of the glycosylation profile. This signature reflects not only important differences such as the presence or absence of certain monosaccharides but also small modifications of the global content of glycans or monosaccharides.
The objective of the present work is twofold. First, the study aims to demonstrate the use of FT-IR spectra to analyse the total quantity of carbohydrates present in various types of glycoproteins. Second, the study intends to extract from FT-IR spectra the relative amount of each major monosaccharide (mannose, N-acetylglucosamine, galactose, fucose and sialic acid). Therefore, several regression models were calibrated to assess the global rate of glycosylation and determine each monosaccharide's relative quantity. In this work, the global rate of glycosylation is defined as the ratio between the mass of glycans and the total mass of the protein. To build such predictive models, infrared spectra of reference samples were measured. Two databases of FT-IR spectra were constituted: the first one included 18 proteins and was used to calibrate the model for the global rate of glycosylation. The second one gathered 32 proteins and was exploited to calibrate the models used to predict the relative quantity of each monosaccharide. The global rate of glycosylation and the composition in monosaccharides were determined using reference MALDI and LC-FLR-MS methods respectively.
Two regression algorithms were considered and compared to correlate the spectral information and the quantitative reference values. These methods differ in their ability to model complex, potentially nonlinear, relationships. The Partial Least Squares Regression model (PLSR), one of the most used regression models for analysing spectroscopic data, was applied. PLSR, whose main strengths are its simplicity and interpretability,18 is often used to process high dimensional data.19 It deals with a linear relationship between the parameter to be predicted and the intensity of the spectral absorption bands. Therefore, it is appropriate for the analysis of chemical processes that follow the Beer–Lambert law.19–21 PLSR regression was applied to manage some highly correlated and possibly noisy predictor variables. It is based on a dimension reduction process in order to deal with the strong collinearities of the spectral data.22 In most cases, PLSR methods are able to cope with low degrees of non-linearity, including additional latent variables. However, spectra can contain highly nonlinear effects for various reasons, such as differences in viscosity, temperature, or chemical biological composition of the sample matrix.19 Thereby, PLSR may not predict the parameter of interest well enough. Support Vector Regression (SVR) method has emerged as a powerful alternative to PLSR thanks to its many attractive features. First, it can find nonlinear global solutions and thus properly model complex nonlinear relationships and select only the samples representative of the problem, which are then called support vectors. Second, this method gives high prediction accuracy18,19 and is less sensitive to spectral noise. Moreover, SVR models are robust models that can handle possible spectral variations due to nonlinear interference.23
- Infliximab (Remicade® 100 mg, Janssen Biologics)
- Cetuximab (Erbitux® 5 mg mL−1, Merck KGaA)
- Nivolumab (Opdivo® 10 mg mL−1, Bristol Myer Squib)
- Aflibercept (Zaltrap® 25 mg mL−1, Sanofi-Aventis)
The following 13 other glycoproteins were purchased from Sigma-Aldrich (Merck):
- Alpha1-acid glycoprotein
- Alpha-crystallin
- Apo-transferrin
- Carboxypeptidase Y
- Conalbumin
- Avidin
- Fetuin
- Lactoferrin
- Lectin from Glycine max
- Lectin from Phaesolus vulgaris
- Lectin from Maackia amurensis
- Peroxidase
- Ribonuclease B
Finally, the last protein called PPp was provided by Xpress Biologics (Liege, Belgium). It was produced in Pichia pastoris and was received at a concentration of 25.97 mg mL−1 in PBS buffer (pH 7.2). The proteins were dissolved in 0.9% NaCl at 10 mg mL−1.
Size exclusion spin columns were used to remove residual salts and excipients present in the formulations of the therapeutic proteins. These excipients (mannitol, Tween 80, Tween 20, trehalose, sodium citrate and sodium acetate, etc.…), can indeed interfere with FT-IR measurements in the spectral region between 1200 and 950 cm−1 due to the vibrational frequencies of their chemical bonds, which are close to those of carbohydrates shown in Fig. S1-A and S1-B.† Therefore, as explained in the work of Derenne et al.,5 for FT-IR measurements a buffer exchange step with 0.9% NaCl was performed using Micro Bio-SpinTM P-6 Gel columns (Tris buffer, sample volume 10–75 μL, 6000 Da MW limit). Also, the samples were prepared using the same method with (NH4)2CO3 for buffer exchange for the MALDI measurements.
The 27 proteins and their producers are as follows:
- Adalimumab* (Humira® 40 mg, Abbvie)
- Aflibercept* (Zaltrap® 25 mg mL−1, Sanofi-Aventis)
- Avelumab* (Bavencio® 20 mg mL−1, Merck)
- Bevacizumab* (Avastin® 25 mg mL−1, Roche Pharma)
- Cetuximab* (Erbitux® 5 mg mL−1, Merck KGaA)
- Daratumumab (Darzalex® 20 mg mL−1, Janssen Biologics)
- Durvalumab (Imfinzi® 50 mg mL−1, AstraZeneca)
- Golimumab (Simponi®100 mg, Janssen Biologics)
- Infliximab (Remicade® 100 mg, Janssen Biologics)
- Ipilimumab* (Yervoy® 5 mg mL−1, Bristol Myer Squibb)
- Natalizumab (Tysabri® 300 mg, Biogen)
- Nivolumab* (Opdivo® 10 mg mL−1, Bristol Myer Squibb)
- Ocrelizumab (Ocrevus® 300 mg, Roche)
- Omalizumab (Xolair® 150 mg mL−1, Novartis)
- Panitumumab* (Vectibix® 25 mg mL−1, Amgen)
- Pembrolizumab* (Keytruda® 50 mg, Merck)
- Pertuzumab* (Perjeta® 30mg mL−1, Roche Pharma)
- Ramucirumab* (Cyramza® 10mg mL−1, Eli-Lilly)
- Rituximab* (Mabthera® 10 mg mL−1, Roche Pharma)
- Secukinumab (Cosentyx® 150 mg mL−1, Novartis)
- Tocilizumab* (Roactemra® 20 mg mL−1, Roche)
- Trastuzumab* (Herceptin® 150 mg, Roche Pharma)
- Trastuzumab-emtasine (Kadycla® 160 mg, Roche Pharma)
- Vedolizumab (Entyvio® 60 mg mL−1, Takeda)
- Biosimilar of Rituximab* (Truxima® 10 mg mL−1, Celltrion)
- Biosimilar of Infliximab (Remisma® 100 mg, Celltrion)
- Biosimilar of Infliximab (Inflectra® 100 mg, Hospira)
The following glycoproteins were purchased from Sigma-Aldrich (Merck):
- Alpha1-acid glycoprotein
- Etanercept (European Pharmacopoeia Reference Standard)
- Avidin
- IgG1 Kappa from Human Myeloma
- Ribonuclease B
Salts and excipients were removed from all samples using size exclusion spin columns, following the method used previously for the analysis of the global rate of glycosylation.
- Sialic acid
- Galactose
- Fucose
- N-Acetylglucosamine
- D-Mannose*
Glycans (Fig. S3†) were purchased from Dextra at 1 mg ml−1:
- FA2, FA2G1, FA2G2
- M5, M6, M7
Alpha crystallin, PPp and the three lectins were diluted to 0.1 mg mL−1 in 0.1% formic acid. All the other proteins were diluted to 0.2 mg mL−1 with 0.1% formic acid. The samples were loaded on the target using the dried droplet method with the addition of 0.5 μL sample followed by 0.5 μL of the matrix (10 mg mL−1 sinapinic acid in H2O/ACN/FA 70/30/0.1). The target was dried under vacuum.
Eluent A was 0.1% formic acid (FA) in H2O, and eluent B was 0.1% FA in acetonitrile. The elution profile was as follows: 0–2 min, isocratic on 5% B; 2–3 min, linear gradient to 20% B and 3–13 min, linear gradient from 20% B to 90% B, 13–15 min, isocratic on 90% B, 15–17 min, linear gradient to 10% B and from 17 to 19 min, linear gradient to 90% B, and 19–21 min; linear gradient to 5% B and isocratic for 2 min and 5% B. The flow rate was 0.3 mL min−1. Lactoferrin and Ribonuclease B were diluted to 0.5 mg mL−1 in eluent A. 2 μL of each sample were injected into the column, which was thermostated at 80 °C. The samples were kept at 5 °C and the detection was performed at a wavelength of 280 nm using a UV detector.
The mass spectra were acquired on the m/z range of 400 to 5000 in positive ion mode. The capillary voltage was set at 2.5 kV, the sample cone at 120 V and the source operated at 100 °C. Nitrogen was used as desolvation gas (500 °C, 800 L h−1) and cone gas at 100 L h−1. An on-line mass correction was applied using Leucine Enkephalin. Molecular mass was calculated by deconvoluting the mass spectra using MaxEnt1 algorithm.
Furthermore, a background was recorded before the start of the measurements and prior to each new sample. Regarding the global rate of glycosylation analysis, for each glycoprotein, 6 spectra were recorded with 6 distinct deposits. As for the analysis of the composition of monosaccharides, three independent samples were prepared for each glycoprotein on three distinct days to obtain triplicate measurements. In total, 18 deposits were made, resulting in the measurement of 18 spectra. Therefore, for the analysis of the global rate of glycosylation, a total of 108 spectra were acquired (18 glycoproteins × 6 spectra), and for the analysis of the composition of monosaccharides, a total of 846 spectra were acquired (one batch for 17 glycoproteins and two batches for 15 glycoproteins: 47 samples × 18 spectra). Finally, for the investigation of the cause of non-linearity, the glycans (FA2, FA2G1, and the high mannose: Man-1, Man-3, Man-5, Man-6, Man-8) were studied. In this context, 6 spectra (6 distinct deposits) were recorded for each sample.
The reference water vapour spectrum was obtained, in the absence of sample, as the difference between a spectrum recorded before and after purging the sample room with dry air. Therefore, the reference water vapour spectrum was recorded in the same conditions of samples acquisition and was subtracted from all FT-IR spectra, with 1956–1935 cm−1 as reference peak.24,25
The preprocessing retained for the model of the global rate of glycosylation is the Savitzky–Golay 1st derivative (polynomial order: 2, window size: 15) to improve the signal/noise ratio and the standard normal variate (SNV) to reduce the effect of the variation of the signal linked to the quantity of samples deposited on the crystal and the impact of sample drying on the crystal.
The FT-IR spectra used to build the monosaccharide model were preprocessed by the Savitzky–Golay 2nd derivative (polynomial order: 3, window size: 15) and SNV.
All instrumental techniques, including FT-IR spectroscopy, are affected by noise. The latter is considered in our case as an additional spectral perturbation, which is not related to the chemical nature of the sample26 but rather to the parameters of the spectrophotometer, such as the number of scans and the spectral resolution.27 It should be noted that a low sample concentration generates a weak signal and has a high noise impact. In this study, the noise was defined as the standard deviation in the spectral region from 2000 to 1900 cm−1 (since there is no biological related absorption band). The signal was defined as the maximum intensity between 1180 and 965 cm−1 in the spectrum after subtraction of a baseline drawn through these two points. Consequently, the signal-to-noise ratio was calculated for each IR spectrum to assess its spectral quality. All IR spectra recorded with a signal-to-noise ratio less than 60 were eliminated. In addition, spectral smoothing of the remaining IR spectra was applied to further reduce the noise level. It was carried out by apodizing the Fourier transform using a Gaussian function with a final resolution of 4 cm−1.
Principal component analysis (PCA) was performed on the spectra of each glycoprotein to detect outliers. The first principal component (PC) was selected to minimize the total distance between the data and its projection on the PC. Also, the variance of the projected points was maximized. An 85% Hotelling's T2 confidence ellipse was built around the mean position of each glycoprotein in the score plot, based on a PC1-PC2 space. For each glycoprotein, the FT-IR spectra outside the confidence ellipse were considered outliers. The PCA was performed on both the training set and the test set (Table S4†). The future samples will be projected onto the defined PCA to ensure that they have the same variance as the calibration data set. This verification is performed by looking at the orthogonal and score distances of the new samples to the pre-defined PCA space. A new spectrum exhibiting orthogonal and score distances higher than the 85% confidence limit will be considered as an outlier.
In the last step, the models were evaluated by external validation, by projecting an independent test set. These models were evaluated using the Root Mean Square Error of Prediction (RMSEP) and R2 of prediction.
Accordingly, the performances of the PLSR and SVR models were evaluated by comparing the results obtained by the calibration performances: (RMSECV) and R2CV and by the external validation performances: (RMSEP) and R2 of prediction. Low values of RMSECV and RMSEP are expected to indicate high accuracy, and a high value of R2 indicates that the model correctly handles the spectral variability and is, therefore, able to accurately estimate the concentration. Additionally, the ratio of performance to deviation (RPD) was evaluated by dividing the standard deviation (SD) of the reference values of the samples in the validation set by the standard error of prediction (SEP).36,37
The Ratio of Performance to Deviation (RPD) corresponds to the factor by which the prediction accuracy increases related to the mean composition. Ideally, in the case of good model calibration, the ratio is greater than two.37,38 Three categories of the RPD ratio were identified to interpret the model's reliability: RPD > 2: excellent model, 1,4 < RPD < 2: fair model, and RPD < 1,4: non-reliable model.
Proteins | Theoretical mass of the sequence (Da) | Intact mass (Da) | Global rate of glycosylation (% (w/w)) | Measurement method |
---|---|---|---|---|
Aflibercept | 96![]() |
114![]() |
15.43 | MALDI-TOF |
Alpha1-acid glycoprotein | 21![]() |
36![]() |
41.29 | |
Alpha-crystalline | 19![]() |
19![]() |
0.63 | |
Apo-transferrin | 75![]() |
80![]() |
6.05 | |
Avidin | 14![]() |
15![]() |
10.10 | |
Carboxypeptidase Y | 47![]() |
57![]() |
17.78 | |
Cetuximab | 145![]() |
153![]() |
5.44 | |
Conalbumin | 75![]() |
77![]() |
2.58 | |
PPp | 7700 | 7686 | 0 | |
Fetuin | 36![]() |
46![]() |
21.71 | |
Infliximab | 145![]() |
150![]() |
3.13 | |
Lactoferrin | 76![]() |
81![]() |
6.64 | ESI-QTOF |
Lectin (Glycine max) | 27![]() |
29![]() |
6.36 | MALDI-TOF |
Lectin (Maackia amurensis) | 27![]() |
31![]() |
15.44 | |
Lectin (Phaseolus vulgaris) | 27![]() |
29![]() |
8.55 | |
Nivolumab | 143![]() |
147![]() |
2.31 | |
Peroxidase | 33![]() |
43![]() |
22.56 | |
Ribonuclease B | 13![]() |
15![]() |
8.96 | ESI-QTOF |
Fig. 1 illustrates all the recorded spectra and, Fig. 2 shows the results of data modelling for the prediction of the global rate of glycosylation and Table 2 illustrates the performance criteria of both PLSR and SVR models. This result confirms the possibility to measure the global rate of glycosylation by FT-IR. The SVR model shows better results compared to PLSR in terms of R2CV of cross-validation (0.99 vs. 0.97) and RMSECV (0.38 vs. 2.20), R2Pred of external validation performance (0.99 vs. 0.89) and RMSEP (0.64 vs. 4.05) over a calibration range from 2 to 41% (w/w).
Model | PLSR | SVR | |
---|---|---|---|
Calibration | Number of latent variables | 7 | |
Number of support vectors | 113 | ||
R 2Cal | 0.98 | 1.00 | |
R 2CV | 0.96 | 0.99 | |
RMSEC (% (w/w)) | 1.70 | 0.17 | |
RMSECV (% (w/w)) | 2.20 | 0.38 | |
Validation | R 2Pred | 0.84 | 0.99 |
RMSEP (% (w/w) | 4.05 | 0.64 |
The optimal SVR parameters selected by cross-validation were set as: C = 10, ε = 0.0001 and γ = 0.01. In our case, a small ε leads to a narrow margin, which means a large number of support vectors were selected in the model. In this study, the SVR model was considered an adequate method for the prediction of the global rate of glycosylation. Indeed, in this case, the SVR method is better at handling the complex relationship between FT-IR spectra and the global rate of glycosylation than PLSR. This is possible since SVR adjusts the error within a particular threshold (±ε) with a maximum number of calibration samples. Thereby, the possibly nonlinear problem is transformed into a linear problem based on the mapping kernel function, particularly the Gaussian kernel with radial basis function (RBF). RBF has the particularity of being applied in the case of a strong nonlinear regularization of a complex system or when there is no prior knowledge of the data set. Finally, with a RPD ratio of 10.26 for the SVR model compared to 1.62 for the PLSR model (SD = 6.57), this confirms that the model of the global rate of glycosylation built with the SVR model presents good accuracy.
Fig. 3 illustrates the measured spectra. The quantitative models predicting the amount of the monosaccharides (mannose, N-acetylglucosamine, galactose, fucose, sialic acid) were built based on the FT-IR spectra of these compounds. In addition, a signal-to-noise filter, as described previously, was applied to improve the performance of the models. Therefore, the methodology followed allowed us to overcome a possible spectral variability (Fig. S5†). As a result, 233 spectra were removed, and 613 spectra were retained to build the regression models in the spectral range between 1200 and 900 cm−1. These spectra are shown in Fig. S4.† 354 spectra (58%) were used as calibration set and 259 spectra (42%) were used as test set. The distribution of the samples for calibration and validation of this model is illustrated in Table S4.†Table 3 demonstrate the performances of both PLSR and SVR models in predicting the amount of each monosaccharide. Fig. 4 and 5 show the results of data modelling by the PLSR model and the SVR model, respectively. These results show that SVR models have a low value in terms of calibration (RMSEC), cross-validation performance (RMSECV) and of external validation performance (RMSEP). Also, these results show that SVR models have high R2CV and R2Pred values indicating that the models captured most of the correlation between the spectral data and the quantitative values.
Mannose | N-Acetylglucosamine | Galactose | Fucose | Sialic acid | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Models | PLSR | SVR | PLSR | SVR | PLSR | SVR | PLSR | SVR | PLSR | SVR | |
Calibration | Number of latent variables | 6 | 5 | 7 | 5 | 3 | |||||
Number of support vectors | 262 | 263 | 308 | 247 | 191 | ||||||
R 2Cal | 0.98 | 1.00 | 0.98 | 1.00 | 0.94 | 1.00 | 0.98 | 1.00 | 0.98 | 1.00 | |
R 2CV | 0.98 | 0.99 | 0.98 | 0.99 | 0.94 | 0.99 | 0.98 | 0.99 | 0.98 | 1.00 | |
RMSEC (% (w/w)) | 0.94 | 0.13 | 0.81 | 0.08 | 1.11 | 0.08 | 0.31 | 0.03 | 0.90 | 0.05 | |
RMSECV (% (w/w)) | 0.99 | 0.40 | 0.83 | 0.27 | 1.17 | 0.46 | 0.33 | 0.09 | 0.91 | 0.14 | |
Validation | R 2Pred | 0.85 | 0.84 | 0.91 | 0.93 | 0.80 | 0.90 | 0.91 | 0.97 | 0.98 | 0.99 |
RMSEP (% (w/w)) | 1.91 | 0.84 | 1.34 | 1.09 | 2.31 | 1.27 | 0.47 | 0.29 | 0.99 | 0.55 | |
SD | 2.10 | 3.87 | 3.47 | 1.64 | 4.29 | ||||||
RPD | 1.10 | 2.51 | 2.88 | 3.56 | 1.50 | 2.73 | 3.47 | 5.58 | 4.33 | 7.77 |
![]() | ||
Fig. 4 Measured versus predicted amounts of monosaccharides obtained by PLS regression between 1200 and 950 cm−1 for the analysis of the amount of each monosaccharide. ◆ Test set ● Training set. |
![]() | ||
Fig. 5 Measured versus predicted amounts of monosaccharides obtained by SVR regression between 1200 and 950 cm−1 to analyse the quantity of each monosaccharide. ◆ Test set ● Training set. |
Optimized SVR parameters are presented in Table S5.† Also, the Ratio of Performance to Deviation (RPD) of each model PLSR and SVR was calculated and presented in Table 3. As expected, it appears that the good values of RPD (ratio greater than two) were obtained by the SVR model indicating the good performances of this model. Moreover, for both PLSR and SVR models, the influence of the test and training set on the models was evaluated through the construction of two different models. These models were built by varying the distribution of proteins between the test and training set as shown in Tables S6 and S7.† The results of the analytical performances of each respective model are shown in Tables S8 and S9.† It emerged that the SVR models have a low value in terms of calibration (RMSEC), cross-validation performance (RMSECV) and external validation performance (RMSEP). In addition, the SVR models have high R2CV and R2Pred values.
Table S10† shows the results of the calculation of Ratio Performance to Deviation RPD1 for a first distribution of samples in the test and training set, and RPD2 for a second distribution. As expected, it appears in both variations, the models exhibit good values of RPD (ratio greater than 2) and these were obtained by the SVR model, which confirms a good robustness of these models. It emerges from this study that it is possible to measure the relative amounts of monosaccharides in protein glycosylation by FT-IR using SVR.
The theory underlying the application of PLSR assumes that spectra follow the Beer–Lambert law = εcl, where the absorbance A follows a linear model depending on the molar extinction coefficient ε of the analyte of concentration c, and on the optical path traveled by the optical beam l. Thereby, a system is considered linear if a simple dependency ratio is established. In the case where we have K constituents, the equation becomes: D = c1s1 + c2s2 + … + cisi. Where ci represent the concentration for sample i and si the spectra for sample i.
Therefore, it has been attempted to show that a more complex relationship exists by comparing theoretical (weighted sum of the spectra of each monosaccharide) and experimentally measured glycans spectra. In this context, the spectra of glycans (FA2, FA2G1, as well as the high mannose: Man-1, Man-3, Man-5, Man-6, Man-8) were studied.
For example, the theoretical spectrum of glycan FA2 is computed as:
DFA2 = (4sN-acetylglucosamine + 3smannose + 1sfucose) |
Since it is known that FA2 is a combination of 4 N-acetylglucosamine, 3 mannose and 1 fucose. By comparing this DFA2 weighted spectrum with the reference spectrum of glycans FA2. Fig. 6 clearly shows that the calculated spectrum of glycans FA2 differs from the reference spectrum FA2 in terms of intensity and positions of the IR bands. In addition, similar results are observed for other glycans as illustrated in Fig. S6–S9.†
As expected, the measured FT-IR spectrum of a glycan is not the simple sum of the different monosaccharide spectra. This reinforces the fact that a more complex relationship exists between the spectral data and the amount of monosaccharides. This might be explained by the existence of different environments surrounding the chemical bonds (different protein sequences implying different conformations) implying minor differences in the vibrational. Glycans contain the same monosaccharides but with different complex structures (Fig. S3†). Thus, the modes of vibration of the monosaccharide molecule can be influenced by the non-covalent interaction effects that may occur; also the vibration of molecules affects the vibration of other molecules.
First, the global rate of glycosylation on the intact proteins was modelled. To increase the range of application of the model, a wide variety of proteins was included in the calibration. To build the model, SVR regression was used in the spectral range between 1179 and 965 cm−1. The model presents good predictive performances in terms of RMSEP with 0.64% (w/w) and of RMSECV with 0.38% (w/w).
The second part of the study is specifically related to antibodies and their biosimilars, representing a large part of the biotherapeutic market. This study demonstrated the capacity of FT-IR spectroscopy to quantify the relative amount of each monosaccharide. In this context, the regression models were established on the spectral region of glycans, between 1200 and 950 cm−1. It was shown that the SVR models outperformed the PLSR ones exhibiting good performances in terms of RMSEP, RMSECV, underlining high robustness and high predictive accuracy of the models. Finally, the Ratio of Performance to Deviation (RPD) was evaluated for the different models. The RPD for SVR models value were more than doubled compared to PLS. Thus it confirms the accuracy of the SVR models.
This approach based on FT-IR spectroscopy combined with the SVR models, paves the way to three potential applications: comparing the glycosylation of a biosimilar and the original molecule, monitoring batch-to-batch homogeneity, and for in-process control.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1an00697e |
This journal is © The Royal Society of Chemistry 2022 |