SynthLC

SynthLC models lung cancer patient data. Each patient has the following attributes:

  • Age category
  • Sex
  • Smoking habit
  • Comorbidities
  • Biomarkers
  • Drugs taken
  • Relapse

The attribute values are assigned at random. Hence, no patterns within real lung cancer patients can be observed. Alongside a generator for synthetic lung cancer data, there are already generated datasets modeling 1,000, 10,000, and 100,000 patients, respectively. The following files are part of this SynthLC entry.

  • SynthLC CSV: The 1k, 10k, and 100k dataset in CSV format.
  • SynthLC RDF: The 1k, 10k, and 100k dataset in RDF.
  • SynthLC Virtuoso: The 1k, 10k, and 100k dataset preloaded in Virtuoso 07.20.3238.
  • SynthLC Shapes: 25 SHACL shapes consisting of biomarker, drug, and relapse combinations. There are two variants, one is using SPARQL constraints while the other uses a non-standard approach for specifying the shapes target via query.
  • SynthLC Generator: The script used to create the 1k, 10k, and 100k dataset, and the shapes. It can be used to create more shapes or datasets of a different size.

Data and Resources

Cite this as

Philipp D. Rohde, Maria-Esther Vidal (2024). Dataset: SynthLC. https://doi.org/10.57702/oyfz6rmc

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created November 26, 2024
Last update December 2, 2024
License cc-by: Creative Commons Attribution
Version 1.0
Author Philipp D. Rohde
More Authors
Maria-Esther Vidal