Data Requirements

SeqUIaSCOPE accepts four types of genomic data: somatic variant calling results, germline variant calling results, fusion gene detection results, and gene expression profiles. Not all four are required — you can upload any combination depending on your analysis needs.

Before uploading, make sure your files follow the naming and format rules described on this page. The application automatically scans your directory for files and uses these rules to identify which file belongs to which patient and dataset type. Errors at upload time are almost always caused by naming issues.


Input Files Overview

Somatic Variant Calling
.tsv — annotated somatic variants

Tumor DNA .bam + .bai — for IGV
Normal DNA .bam + .bai — for IGV
mutation_loads.tsv — tumour mutational burden
Germline Variant Calling
.tsv — annotated germline variants

Normal DNA .bam + .bai — for IGV
Fusion Gene Detection
.tsv or .xlsx — combined Arriba + STARFusion results

Tumor RNA .bam + .bai — for IGV snapshots
Chimeric .bam + .bai — for IGV snapshots
Arriba .pdf + .tsv — for expanded row preview
Expression Profile
.tsv or .xlsx — gene expression data

One file per tissue if comparing multiple reference tissues

Tip: When you have a choice, always prefer .tsv over .xlsx. TSV files load faster and are less likely to cause formatting issues.

Fusion gene detection requires results from both Arriba and STARFusion as upstream tools. The required input file is a pre-processed table combining the output of both callers. The Arriba .pdf and .tsv output files are separate and optional — they unlock the expanded row preview in the Fusion Genes module.


File Naming Rules

SeqUIaSCOPE identifies files by scanning the full file path (folder names + file name together). It looks for specific keywords anywhere in the path to decide what type each file is. This means a keyword can appear in the filename itself or in a parent folder name — both work.

Keyword Reference

File type Path must contain Path must NOT contain
Somatic variant file somatic
Germline variant file germline
Fusion gene file fusion or fuze arriba, STAR
Tumor RNA BAM user-defined pattern (set during upload) Chimeric, transcriptome
Chimeric BAM user-defined pattern (set during upload) transcriptome
Arriba output files arriba discarded, STAR
Expression file expression or RNAseq report, genes_of_interest
TMB file filename must be exactly mutation_loads

The keyword can live anywhere in the path — these two examples both work:

project_root/patient_001/somatic_variants.tsv    # keyword in filename
project_root/somatic_data/patient_001/variants.tsv  # keyword in folder name

Patient ID in path: Every per-patient file must have the patient ID somewhere in its full path (either in the filename or a parent folder). Files without the patient ID will not be matched. The TMB file (mutation_loads) and the Genes of Interest file are the only exceptions — they are shared across all patients and do not need a patient ID in the path.

BAM Files

BAM files don’t use the keyword system. Instead, you define patterns during upload (e.g. tumor, FFPE, chimeric) and the application finds BAM files whose names match those patterns. See Upload Data — Step 1 for details.

For every .bam file, a corresponding index file must exist in the same directory, named either file.bam.bai or file.bai.

Arriba Output Files

If you provide Arriba output files, both the .pdf and .tsv must be present as a matched pair (same base name, different extension). A .pdf without a matching .tsv will be ignored, and vice versa.

Expression Files with Multiple Tissues

When comparing against multiple reference tissues, provide one file per tissue. The tissue name must appear in the filename (e.g. blood_expression.tsv, liver.tsv). Use underscores instead of spaces — blood_vessel.tsv, not blood vessel.tsv.

Tissue names you enter in the upload form must match the names in the filenames. If no tissue names are provided, the application looks for a single expression file per patient containing expression or RNAseq in its path.


Directory Layouts

SeqUIaSCOPE is flexible — your data can be organised in many different ways as long as the naming rules above are respected. Below are three common layouts that all work correctly. The key principle is that the patient ID and dataset keyword must both appear somewhere in each file’s path.

Choose your root directory carefully. In the upload form you select a single root directory that contains all your patient data. Pick the most specific folder that still contains everything — selecting a very broad directory (e.g. your home folder) may cause the scanner to pick up unrelated files and confuse file matching.

Option A — one folder per data type

Best when your pipeline already separates outputs by analysis type.

project_root/                           ← select this as root
├── somatic_data/
│   ├── mutation_loads.tsv
│   ├── patient_001/
│   │   ├── somatic_variants.tsv
│   │   ├── tumor.bam  +  tumor.bam.bai
│   │   └── normal.bam  +  normal.bam.bai
│   └── patient_002/ ...
├── germline_data/
│   ├── patient_001/
│   │   ├── germline_variants.tsv
│   │   └── normal.bam  +  normal.bam.bai
│   └── patient_002/ ...
├── fusion_data/
│   ├── patient_001/
│   │   ├── fusions.tsv
│   │   ├── fusion.bam  +  fusion.bam.bai
│   │   ├── chimeric.bam  +  chimeric.bam.bai
│   │   ├── arriba_report.pdf
│   │   └── arriba_results.tsv
│   └── patient_002/ ...
└── expression_data/
    ├── patient_001/
    │   └── expression.tsv
    └── patient_002/ ...

Option B — BAM files stored separately from analysis results

Common when primary (alignment) and secondary (variant/fusion calling) outputs live in separate trees.

project_root/                           ← select this as root
├── alignments/
│   ├── DNA/
│   │   ├── patient_001/
│   │   │   ├── tumor.bam  +  tumor.bam.bai
│   │   │   └── normal.bam  +  normal.bam.bai
│   │   └── patient_002/ ...
│   └── RNA/
│       ├── patient_001/
│       │   ├── fusion.bam  +  fusion.bam.bai
│       │   └── chimeric.bam  +  chimeric.bam.bai
│       └── patient_002/ ...
└── results/
    ├── somatic_data/
    │   ├── mutation_loads.tsv
    │   ├── patient_001/ somatic_variants.tsv
    │   └── patient_002/ ...
    ├── germline_data/
    │   ├── patient_001/ germline_variants.tsv
    │   └── patient_002/ ...
    ├── fusion_data/
    │   ├── patient_001/
    │   │   ├── fusions.tsv
    │   │   ├── arriba_report.pdf
    │   │   └── arriba_results.tsv
    │   └── patient_002/ ...
    └── expression_data/
        ├── patient_001/ expression.tsv
        └── patient_002/ ...

Option C — all files flat per patient

Works for smaller projects or when a single pipeline writes everything into one folder per patient.

project_root/                           ← select this as root
├── mutation_loads.tsv                  ← in root (shared across patients)
├── patient_001/
│   ├── somatic_variants.tsv
│   ├── germline_variants.tsv
│   ├── fusions.tsv
│   ├── tumor.bam  +  tumor.bam.bai
│   ├── normal.bam  +  normal.bam.bai
│   ├── fusion.bam  +  fusion.bam.bai
│   ├── chimeric.bam  +  chimeric.bam.bai
│   ├── arriba_report.pdf
│   ├── arriba_results.tsv
│   └── expression.tsv
└── patient_002/ ...

Required Columns

Each data file must contain a set of required columns. Column names are case-sensitive. Any additional columns in your file are allowed — they will appear in the table but without custom formatting or labels.

Somatic Variant File

Column Type Description
var_name string Variant identifier
gene_symbol string Gene symbol
tumor_variant_freq numeric Variant allele frequency in tumour
tumor_depth integer Sequencing depth at variant position in tumour
gene_region string Genomic region (e.g. exon, intron, splice)
gnomAD_NFE numeric gnomAD Non-Finnish European allele frequency
consequence string Variant consequence (e.g. missense_variant, stop_gained)
HGVSc string HGVS coding sequence notation
HGVSp string HGVS protein sequence notation
variant_type string Variant type (e.g. SNV, insertion, deletion)
all_full_annot_name string Full annotation name

Optional columns recognised with custom labels:

Column Description
in_library Number of times var_name was observed in the project cohort (added automatically if absent)
clinvar_sig ClinVar clinical significance
clinvar_DBN ClinVar disease name
CGC_Somatic Cancer Gene Census somatic annotation
fOne fOne database annotation
COSMIC COSMIC database annotation
HGMD HGMD database annotation
snpDB dbSNP annotation

TMB File (mutation_loads)

Column Type Description
patient string Patient ID — must match the IDs entered in the upload form
TMB numeric Tumour mutational burden value
patient TMB
P001 0.17
P002 0.74

Germline Variant File

Column Type Description
var_name string Variant identifier
gene_symbol string Gene symbol
variant_freq numeric Variant allele frequency
coverage_depth integer Sequencing depth at variant position
gene_region string Genomic region (e.g. exon, intron, splice)
gnomAD_NFE numeric gnomAD Non-Finnish European allele frequency
clinvar_sig string ClinVar clinical significance
consequence string Variant consequence
HGVSc string HGVS coding sequence notation
HGVSp string HGVS protein sequence notation
variant_type string Variant type (e.g. SNV, insertion, deletion)
all_full_annot_name string Full annotation name

Optional columns:

Column Description
in_library Number of times var_name was observed in the project cohort (added automatically if absent)
clinvar_DBN ClinVar disease name
CGC_Germline Cancer Gene Census germline annotation
trusight_genes TruSight gene panel annotation
fOne fOne database annotation
snpDB dbSNP annotation

Fusion Gene File

This is a pre-processed file combining results from both Arriba and STARFusion. Column names chrom1/chrom2 are accepted as alternatives to chr1/chr2 and will be renamed automatically.

Column Type Description
gene1 string First fusion partner gene name
gene2 string Second fusion partner gene name
chr1 (or chrom1) string Chromosome of first partner (with chr prefix, e.g. chr2)
chr2 (or chrom2) string Chromosome of second partner
pos1 numeric Genomic position of first partner breakpoint
pos2 numeric Genomic position of second partner breakpoint
strand1 string Strand orientation of first partner (+ or -)
strand2 string Strand orientation of second partner (+ or -)
arriba.called boolean Whether Arriba called this fusion
starfus.called boolean Whether STARFusion called this fusion
overall_support numeric Overall read support for the fusion
arriba.confidence string Arriba confidence level (high, medium, or low)
arriba.site1 string Breakpoint site annotation for first partner
arriba.site2 string Breakpoint site annotation for second partner

Optional columns:

Column Description
DB_count Number of databases listing this fusion
DB_list Database names
arriba.split_reads Split reads supporting the fusion (Arriba)
arriba.discordant_mates Discordant mate pairs (Arriba)
arriba.break_coverage Coverage at first breakpoint (Arriba)
arriba.break2_coverage Coverage at second breakpoint (Arriba)
arriba.break_seq Sequence at breakpoint (Arriba)
starfus.split_reads Split reads supporting the fusion (STARFusion)
starfus.discordant_mates Discordant mate pairs (STARFusion)
starfus.counter_fusion1 Counter fusion reads for first gene (STARFusion)
starfus.counter_fusion2 Counter fusion reads for second gene (STARFusion)
starfus.splice_type Splice junction type (STARFusion)
starfus.break_seq Breakpoint sequence (STARFusion)

Expression Profile File

Column Type Description
sample string Sample ID
feature_name string Gene name
geneid string Gene ID (Ensembl, RefSeq, or other)
all_kegg_gene_names string KEGG gene names
log2FC numeric Log2 fold change (tumour vs. reference)
p_value numeric P-value
p_adj numeric Adjusted p-value

Optional columns:

Column Description
pathway Pathway name (retrieved automatically from KEGG if absent)
refseq_id RefSeq gene ID
type Gene type
gene_definition Gene description
num_of_paths Number of pathways the gene participates in

Genes of Interest (GOI)

SeqUIaSCOPE ships with a built-in genes of interest list used for expression highlighting and network visualisation. To use a custom list instead, provide a .tsv or .xlsx file and update the path in reference_paths.json (see Configuration).

Column Required Description
gene Gene name
pathway optional Pathway name — retrieved from KEGG automatically if absent

If the pathway column is provided, your own pathway classifications are used instead of KEGG.

gene pathway
BRCA1 DNA damage/repair
TP53 RTK Signalling