UniProt dataset
The UniProt dataset is a comprehensive protein dataset. We download reviewed protein sequences (550k) with the limitation of 100 in length as D_r (57k examples). Then we use a community reimplementation of AlphaFold to predict the secondary structure for D_r. After filtering some low-quality examples, we obtain D_s with 46k examples, including both sequence and secondary structure information.
BibTex: