Background and Introduction: Osteocytes, the most numerous bone cells, create sclerostin. The sclerostin protein sequence predictive model helps create novel medications and produce alveolar bone in periodontitis and other oral bone illnesses, including osteoporosis. Neural networks examine protein variants for protein engineering and predict their structure and function impacts. Proteins with improved function and stability have been engineered using LLMs and CNNs. Sequence-based models, especially protein LLMs, predict variation effects, fitness, post-translational modifications, biophysical properties, and protein structure. CNNs trained on structural data also improve enzyme function. It is unknown if these models differ or forecast similarly. This study seeks Pre-trained language models to predict Wnt-Sclerostin Protein sequences in alveolar bone formation. Methods: Using UniProt ID, sclerostin and related proteins (Q9BQB4, Q9BQB4-1, Q9BQB4-2, Q6X4U4, O75197) were identified and quality-checked. Deepbio analyzed FASTA sequences. Deep Bio is a one-stop web service allowing academics to build any biological deep-learning architecture. DeepBIO used deep learning to improve and visualize biological sequencing data. LLM BASED Reformer, AAPNP, TEXTRGNN, VDCNN, and \(RNN\_CNN\) split sequence-based datasets into test and training. We randomly partitioned each dataset into 1000 training and 200 testing sets to change hyperparameters and measure performance. Results: Reformer, AAPNP, TEXTRGNN, VDCNN, RNN CNN exhibit 93, 64, 51, 91, and 64 percent accuracy. Conclusion: Protein sequence-based massive language models are growing, and R\&D is solving complicated challenges.
In recent years, there has been a major evolution in our understanding of bone health and regulation [1, 2]. The Wnt signaling pathway is one of the major mechanisms controlling bone homeostasis; sclerostin is a critical regulator in this system. The purpose of this article is to present a thorough summary of the functions of sclerostin and Wnt signaling in bone health. The Wnt signaling system tightly regulates bone resorption and formation. It comprises the \(\beta\)-catenin-dependent canonical Wnt pathway and the \(\beta\)-catenin-independent non-canonical Wnt pathway [3, 4]. While osteoclastogenesis, bone resorption, and bone remodeling are involved in the non-canonical pathway, the canonical pathway controls osteoblast proliferation, differentiation, and survival [5].
The most prevalent cells in bone tissue [6, 7, 8], osteocytes, are the main producers of the protein sclerostin. Attaching itself to the receptors LRP5/6 and preventing \(\beta\)-catenin from being activated is a negative regulator of the Wnt signaling cascade. Sclerostin suppresses osteoblast proliferation and activity through this mechanism, which ultimately results in lower bone mass and inhibits the creation of new bone [9]. The Wnt signaling pathway is activated, favorably controlling osteoblastogenesis and encouraging new bone growth. When Wnt ligands, like Wnt1, Wnt3a, and Wnt10b, engage with LRP5/6 receptors, \(\beta\)-catenin [10] in the nucleus-activator is stabilized and accumulates, and target genes important in osteoblast differentiation, and function are expressed [11, 12]. On the other hand, Sclerostin binds to LRP5/6 and inhibits this process by stopping \(\beta\)-catenin from activating [13, 14].
Numerous bone disorders have been linked to the deregulation of sclerostin and abnormalities in the Wnt signaling pathway. For example, osteoporosis, osteogenesis imperfecta, and juvenile idiopathic arthritis have been linked to elevated sclerostin levels and decreased Wnt signaling activity [15, 16, 17, 18]. These abnormalities weaken bones and raise fracture risk by decreasing bone production and increasing bone resorption. Therapeutic strategies that target sclerostin have been investigated because of its critical function in maintaining bone homeostasis.
In periodontitis [19] and other problems involving the mouth’s bone, such as osteoporosis, the predictive model of the sclerostin protein sequences helps create healthy alveolar bone and can be useful in designing novel medications. Neural networks and other machine learning models are increasingly being utilized to investigate and forecast the impacts of protein variations on structure and function in protein engineering [18, 19, 20]. Convolutional neural networks (CNNs) and large language models (LLMs) [21, 22, 23] have effectively created proteins with improved stability and function. Protein LLMs [24], particularly sequence-based models, have successfully predicted protein structure, post-translational modifications, variation effects, and biophysical characteristics. CNNs trained on structural data have also successfully increased enzyme function activity. It’s unclear whether these models are fundamentally different or produce comparable forecasts. This work aims to predict Wnt-Sclerostin protein sequences in alveolar bone formation using pre-trained language models.
The following sclerostin and related proteins Q9BQB4, Q9BQB4-1, Q9BQB4-2, Q6X4U4, and O75197 were downloaded using UniProt id, and their sequences were recognized and quality-checked. The Deepbio tool was used for FASTA sequences.
Deep Bio is a one-stop shop for researchers wishing to create a deep-learning architecture for any biological subject. DeepBIO used deep learning techniques to evaluate, improve, and visualize biological sequencing data. Sequence-based datasets were divided into training and test sets by Deep Bio. We randomly divided each dataset into 1000 training and 200 testing sets to modify hyperparameters and assess performance.
Large language models and other algorithms for sequence prediction used were Reformer, AAPNP, TEXTRGNN, VDCNN, RNN_CNN (see Table 1).
Cuda: | TRUE2 | TRUE3 |
Seed: | 43 | 43 |
num_workers: | 4 | 4 |
num_class: | 2 | 2 |
Kmer: | 3 | 3 |
save_figure_type: | png | png |
Mode: | train-test | train-test |
Type: | prot | prot |
Model: | VDCNN | RNN_CNN |
datatype: | userprovide | userprovide |
interval_log: | 10 | 10 |
interval_valid: | 1 | 1 |
interval_test: | 1 | 1 |
Epoch: | 50 | 50 |
optimizer: | Adam | Adam |
loss_func: | CE | CE |
batch_size: | 32 | 32 |
LR: | 0.0001 | 0.0001 |
Reg: | 0.0025 | 0.0025 |
Gamma: | 2 | 2 |
Alpha: | 0.25 | 0.25 |
max_len: | 52 | 52 |
dim_embedding: | 32 | 32 |
minimode: | modelCompare | modelCompare |
if_use_FL: | 0 | 0 |
if_data_aug: | 1 | 1 |
if_data_enh: | 0 | 0 |
CDHit: | ['1'] | ['1'] |
Reformer
The Reformer is a natural language processing AI model. In 2019, Google researchers published "Reformer: The Efficient Transformer," introducing it. The Reformer model uses the Transformer architecture popularised by BERT and GPT (Generative Pre-trained Transformer). Reformer efficiency in managing long-range dependencies is a significant contribution. Compared to classic Transformers. It reduces long sequence attention time and memory complexity with "Locality-Sensitive Hashing" (LSH). Reversible residual layers make Reformer training memory-efficient. This is crucial for long sequences, as regular Transformers often struggle with memory.
AAPNP
AAPNP, which stands for Approximation of Personalized Propagation of Neural Prediction, is a novel approach for semi-supervised learning on graphs. It combines the strengths of two powerful techniques: Personalized PageRank, which uses Google’s PageRank to rank network nodes by neighbor importance, and a "seed" node. This means a node’s value depends on its connections and neighbors’ connections, considering a specific focus. Neural Networks: These powerful models can learn complex relationships and patterns from data, often achieving impressive results in many tasks.
AAPNP Leverages these Two Techniques in a Two-step Process
Predict: Features are used by a neural network to forecast each node. First, guesses gather local information around each node. Propagate: Then, Personalized PageRank is adjusted to "spread" these predictions around the graph, taking into account nearby nodes and the focus point. This propagation stage refines initial predictions and incorporates global context by sharing information between nodes.
AAPNP’s fundamental benefit is its ability to use information from a vast, configurable neighborhood around each node while preserving computational efficiency and a minimum number of parameters. This makes it useful for semi-supervised classification, where you have little labeled data but a big network of unlabeled data.
TextRGNN
A new graph neural network-based text classification architecture is Residual Graph Neural Networks (GNNs). It was introduced in a December 2021 research report and has performed well in many datasets.
Here is a breakdown of TextRGNN’s key features;
TextRGNN uses a probabilistic language model (PLM) to initialize graph node embeddings to improve semantic information capture. This uses the PLM’s word relationships and syntax to enrich the GNN’s message-passing process.
VDCNN
VDCNN, or Very Deep Convolutional Neural Network, is an architecture designed for text classification. It uses small convolutions and pooling operations at the character level to obtain outstanding results. VDCNN breakdown:
Architectural Modularity
Choose from 9, 17, 29, and 49 layers to suit dataset sizes and complexity.
Character-level Processing
Works directly with text characters to capture fine-grained classification information.
Pooling and Small Convolutions
Uses 3 or 5 filter sizes in convolutional layers with max pooling to reduce dimensionality, reduce the computational cost, and improve noise resistance.
Multiple Nonlinear Activations
Non-linearity from ReLU activations throughout the network improves feature extraction and representation. Global average pooling aggregates information from all feature maps in the final layer to achieve efficient categorization, collecting global context.
RNN_CNN
RNN-CNN, or Recurrent Neural Network-Convolutional Neural Network, is a powerful deep learning architecture that uses its capabilities to solve complicated problems, especially in NLP and computer vision.Text or video input data is preprocessed before being fed into the network. This may entail text tokenization or video frame extraction.
The CNN uses convolutional layers to extract local information from preprocessed input. These attributes capture Edges, textures, shapes in photographs, word frequencies, and grammatical structures in text.In RNN Sequence Modeling, retrieved features are input into the RNN. The RNN successively processes features and stores their associations in its memory. Using processed features and internal memory, the RNN generates the required output. This could be a classification label, caption, or sequence prediction.
The study reveals that prediction accuracy varies across protein structures, with LLMs yielding good accuracy. This reflects the bias/variance dilemma in machine learning, where convolution layers have an inductive bias for spatial data.
Model Name | ACC | Sensitivity | Specificity | AUC |
---|---|---|---|---|
Reformer | 0.885 | 0.88 | 0.89 | 0.936 |
APPNP | 0.56 | 0.48 | 0.64 | 0.614 |
TextRGNN | 0.5 | 0 | 1 | 0.511 |
VDCNN | 0.875 | 0.87 | 0.88 | 0.914 |
RNN_CNN | 0.625 | 0.66 | 0.59 | 0.649 |
The "sensitivity" of Reformer, AAPNP, TEXTRGNN, VDCNN, and RNN CNN is 0.88, 0.48, 0.87, and 0.66 for TP / (TP + FN). The model’s ability to correctly identify negative cases is known as its specificity or genuine negative rate. The results show that the specificities of Reformer, AAPNP, TEXTRGNN, VDCNN, and RNN CNN are TN / (TN + FP) -0.89, 0.64,1, 0.88, and 0.59, respectively.
Roc Curve
The Receiver Operating Characteristic demonstrates how categorization thresholds affect a model’s true positive rate (sensitivity) and false positive rate (1 - specificity) (ROC). The ROC curve in the upper left corner of the plot demonstrates that Reformer’s VDCNN is accurate, whereas AAPNP, TEXTRGNN, and RNN CNN are moderate.
Precision Recall Curve
The precision-recall curve shows binary classifiers with varying probability thresholds’ recall-precision trade-off (PRC). Recall is the percentage of accurately expected positives, whereas precision is the percentage of positive predictions. This model’s unequal class performance is shown. AUC-PR is a common statistic for classifier performance. Reformer VDCNN model performance improves with higher AUC-PR values.
An epoch plot graphs machine learning model accuracy and loss over training. It detects overfitting and other model flaws well. Etoch plots display the number of epochs or iterations the model was trained on on the x-axis. Model accuracy or loss is represented on the y-axis. The loss shows how well the model predicts an input’s output. Accuracy measures the model’s prediction accuracy.
SHAP Values
Machine-learning models calculate each feature’s prediction value. All potential feature combinations and their relative contributions to a prediction when combined with a subset of features are analyzed to compute them. SHAP red is positive when a feature improves prediction. Negative SHAP blue features are less predictive.
Upset Plot
Comparison of intersection diameters shows the group frequency of common elements. Larger junctions indicate more group overlap than smaller crossings. Vertical UpSet plots show crossings as rows and sets as matrix columns. Every row has filled intersection cells showing row relationships.
UMAP shows clustering patterns in a weighted graph using high-dimensional data, with edge strength indicating "near" points. Projecting this graph reduces its size. Data demonstrate algorithm clustering. UMAP embeds high-dimensional data in low-dimensional space using nonlinear dimensionality reduction. It expects high-dimensional data points to reside near low-dimensional space.
Understanding the mechanisms and regulation of sclerostin protein sequences can provide insights into developing therapies for various bone-related disorders, including those affecting the alveolar bone. Protein sequence prediction using Large Language Models (LLMs) [24, 25] is a rapidly emerging field with exciting protein engineering and drug discovery possibilities. LLMs trained on massive protein databases can learn the underlying patterns and rules of protein sequences. This allows them to generate new sequences with desired properties, like increased stability, specific binding affinities, or even new functionalities. LLMs can statistically predict the most likely missing amino acids when faced with incomplete protein sequences based on the surrounding context. This can be crucial for structural modeling and understanding protein function [26, 27]. Analyzing the relationships between different protein sequences is key to understanding their evolution and function. LLMs can help uncover these relationships by learning the subtle changes in sequences that translate to functional differences.
Sclerostin protein sequences prediction shows the accuracy of Reformer, AAPNP, TEXTRGNN, VDCNN, and RNN_CNN, which show 93%, 64%,51%, 91% AND 64% (Table 1, Figures 1-7).
Various Models Like ProteinBERT [28, 29] is a deep language model specifically designed for proteins, combining language modeling with Gene Ontology (GO) annotation prediction. It offers efficient and flexible biological sequence performance with local and global representations. ProteinBERT [30, 31] achieves near-state-of-the-art performance on various protein properties, making it an efficient framework for rapidly training protein predictors, even with limited labeled data. Transformer-based architectures have revolutionized protein design, enabling the creation of personalized proteins for various applications. ProtGPT2 [11, 32], a language model trained on protein space, generates de novo protein sequences based on natural principles, displaying natural amino acid propensities and distantly related to natural sequences, thereby exploring unexplored regions of protein space. Sclerostin protein sequence prediction is useful for designing novel drugs and increasing alveolar bone formation [14, 16, 17, 33, 34, 35]. Antisclerostin monoclonal antibodies have shown significant osteoanabolic effects in animal studies, including increased bone mineral density in mice and reversing bone loss in ovariectomized rats. Antisclerostin therapy improved nonhuman primates’ fracture healing,alveolar bone repair, and callus density. Sclerostin is a protein that plays a crucial role in alveolar bone formation. Alveolar bone [36] refers to the bone surrounding the teeth and helps provide support and stability. Sclerostin is primarily produced and secreted by osteocytes, mature bone cells within the bone tissue. In summary, sclerostin is vital in alveolar bone formation by inhibiting excessive alveolar bone formation and maintaining a balanced bone remodeling process. This AI model will help predict difficult sequence information and aid in novel protein drug designs targeting sclerostin.
This predictive AI model will solve complex sclerostin protein sequences and help design novel drugs to target sclerostin for alveolar bone formation.
The authors declare no conflict of interests. All authors read and approved final version of the paper.
All authors contributed equally in this paper.