In vivo RNA structure determination
All the RNA structures in different cell lines are performed from the same experimental methods and downstream analysis pipeline. The experimental details of icSHAPE are in the paper (Sun et al., 2021). The downstream computational analysis of icSHAPE-pipe is in the paper (Li et al., 2020).
Input data of PrismNet
The length of the input sequence is 101nt (A, T, C, G). The values of its matched structures range from 0 to 1. 0 means the nucleotide prefers to double-stranded RNA structure and 1 means the nucleotide prefers to single-stranded RNA structure. The input sequence can be provided by FASTA format, the matched structural information is also provided by FASTA format (-1 or NULL for the NAN scores). The users can search the human, mouse or yeast genome structures by the genome localization (chr: start position) if they don’t have the structure information. The users can also only provide the sequence information by the Sequence mode.
PrismNet architecture
The input data of PrismNet are encoded as X∈RN×L×D, which is a tensor. N is the sampels, L=101 is the length, and D=5 is the input data dimensions, including 4 dimension of sequence information by one hot encoding and the fifth dimension for structure information. The binding probability of Y∈RN is computed by:
Y = PrismNet(X)
PrismNet is then defined by the following set of functions.
PrismNet(X)=σ(FC(fR (fS (fC (X)))))
σ is a sigmoid function, FC is a fully connection layer, and fC, fS, fR are the formulation of the convolutional block, the squeeze-excitation block (SE)(Hu et al., 2018) and the residual blocks defined in this work (Zarnegar et al., 2016) respectively.
Pre-training models of PrismNet
In total, we trained 256 PrismNet models for 168 human proteins using all the known CLIP data collected from POSTAR and ENCODE (eCLIP) (Van Nostrand et al., 2020; Zhu et al., 2019). The users can train their own models if they have new CLIP data for other RBPs or RBP binding in the new conditions. The scripts of PrismNet model architecture are available from github (https://github.com/kuixu/PrismNet).
Saliency map and integrative motifs of PrismNet
For each query sequence, a saliency map will be calculated to present the contribution of each nucleotide that is particularly influential to the decision whether the input is positive or negative. The sequence and structure logo was generated by SmoothGrad (Wattenberg, 2017).
The sequence and structural integrative motifs are pre-calculated from exact RBP binding sites of the whole transcriptome, which are extracted from the saliency maps with predicted binding probabilities >= 0.8.(Sun et al., 2021). The generation steps are as follows: 1. We predict the binding sites with a binding probability>0.8 from the whole transcriptome with sequence and structure information input of each RBP. We choose sequence and structure information both because these information provides a more complete picture of protein binding preferences. 2. Then we generate the saliency map of all the positive binding sites. 3. We retain 20% of the sliding windows (20nt) with the highest response signals of saliency map. We merge the overlapped windows. 4. We then enumerate the sequence-and-structure patterns of all k-mers (k= 6) in these windows. For each position, the sequence component is the nucleotide, while the structure component is labeled as “U “for unpaired nucleotide (icSHAPE>=0.223), and “P” for paired nucleotide (icSHAPE<0.223). 5. The similar k-mers are combined to build the sequence and structure integrative motif. Notably, the pre-calculated motifs are fixed and will not change with the users' input sequences or structures.
How to predict?
a) On the pages of Predict, there are three modes are shown: “Sequence&Structure” could predict RBP-RNA interaction using RNA sequence and RNA structures; “Sequence” mode only depends on the RNA sequence information; “Structure” mode only depends on the RNA structure information; If the input is large and may take longer time to predict the results, we suggest the user upload a file using the “Batch prediction” mode and get the results by a URL linkage.
To predict RNA-binding protein (RBP) and RNA interaction, users are required to input RNA sequences and structures in a FASTA-like format. From the available 256 pre-trained models for 168 human proteins, users must select at least one Protein(Cell line) Model for analysis. The web server uses a default binding probability threshold of 0.8 to classify each site as a binding site. However, users have the option to modify this threshold to suit their needs. In addition, users can access input examples by clicking the "Example" button. If desired, users may also provide their email address to receive a notification with the URL to retrieve the results (optional).
b) The user also can search the RNA structural information in different conditions when unclose the structure search page. There are two modes are shown: “coordinate” mode can search the RNA structure by genomic coordinate. The max length is 50,000nt; “Transcript” mode can search the RNA structure by transcript Ensembl ID (such as ENST00000331789). If the interested sequence are not from human or mouse, such as the E. coli or plant, User can search the SHAPE score by RASP database which collected all the available SHAPE data.
What’s the predicted results?
a) When you "Submit" the data to infer and wait for several seconds (The URL for retrieving results is provided), you will see the inferred results as the following interface. The “Summary” shows the RBP binding sites of each input sequence. The user can click the binding sites to get the detail information. The "Query information" shows all the sequence information prepared for the PrismNet models. If the length of input sequence is longer than 101nt, we will split it into 101nt with 20nt sliding window and name it as name_start_end (such as ENST00000389680(21_121). The following "Prediction results" shows the inferred results of all 101nt sequences.
b) The corresponding sequence information follow the "Rank" can be queried in the above "Query information". The "Probability" shows the probability of RNA sequence bound to proteins, and the "Bind" give the judgement of RBP-RNA interaction according to the default binding probability threshold "0.8". "HAR" (High Attention Regions) give the most important regions of RBP-RNA interaction. The user could click the button "View/Hide" to view and hide the saliency map and motifs.
c) The user can get the results by the URL for the “Batch prediction”. Please download it on time.
How to train new model?
On the Model Training page, users have the option to upload a file that shows new RBP binding sites for training a new PrismNet model automatically by the web server. The file need to be named RBP-cellLine-genome-CLIP.tsv, such as IGF2BP1-K562-hg38-CLIP.tsv, or a zip file containing multiple tsv files. Please provide 15000 samples for every tsv file, including 5000 positive samples and 10000 negative samples. If you want to publish the pre-trained model, please choose 'YES' in 'Publish Models'.
After submitting the task, please be sure to wait for the file to be uploaded successfully (it takes some time to be successfully uploaded). When you see the following pop-up window, it indicates that the task has been successfully submitted. You can retrieve the training process by the provided URL.
When the model training is finished, you can download the pre-trained model and generated integrative motifs through the URL provided above. If you have choosed to publish the model, you can use it on the 'Predict' page.
Reference
1. Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-Excitation Networks. Paper presented at: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2. Li, P., Shi, R., and Zhang, Q.C. (2020). icSHAPE-pipe: A comprehensive toolkit for icSHAPE data analysis and evaluation. Methods 178, 96-103.
3. Sun, L., Xu, K., Huang, W., Yang, Y.T., Li, P., Tang, L., Xiong, T., and Zhang, Q.C. (2021). Predicting dynamic cellular protein-RNA interactions by deep learning using in vivo RNA structures. Cell Res, 31, 495-516.
4. Van Nostrand, E.L., Freese, P., Pratt, G.A., Wang, X., Wei, X., Xiao, R., Blue, S.M., Chen, J.Y., Cody, N.A.L., Dominguez, D., et al. (2020). A large-scale binding and functional map of human RNA-binding proteins. Nature 583, 711-719.
5. Wattenberg, D.S.a.N.T.a.B.K.a.F.B.V.e.g.a.M. (2017). SmoothGrad: removing noise by adding noise. ArXiv abs/1706.03825.
6. Zarnegar, B.J., Flynn, R.A., Shen, Y., Do, B.T., Chang, H.Y., and Khavari, P.A. (2016). irCLIP platform for efficient characterization of protein-RNA interactions. Nature methods 13, 489-492.
7. Zhu, Y., Xu, G., Yang, Y.T., Xu, Z., Chen, X., Shi, B., Xie, D., Lu, Z.J., and Wang, P. (2019). POSTAR2: deciphering the post-transcriptional regulatory logics. Nucleic acids research 47, D203-D211.