Data Requirements

SeqUIaSCOPE accepts four types of genomic data: somatic variant calling results, germline variant calling results, fusion gene detection results, and gene expression profiles. Not all four are required — you can upload any combination depending on your analysis needs.

Before uploading, make sure your files follow the naming and format rules described on this page. The application automatically scans your directory for files and uses these rules to identify which file belongs to which patient and dataset type. Errors at upload time are almost always caused by naming issues.

Input Files Overview

Somatic Variant Calling

Required

.tsv — annotated somatic variants

Optional

Tumor DNA .bam + .bai — for IGV

Normal DNA .bam + .bai — for IGV

mutation_loads.tsv — tumour mutational burden

Germline Variant Calling

Required

.tsv — annotated germline variants

Optional

Normal DNA .bam + .bai — for IGV

Fusion Gene Detection

Required

.tsv or .xlsx — combined Arriba + STARFusion results

Optional

Tumor RNA .bam + .bai — for IGV snapshots

Chimeric .bam + .bai — for IGV snapshots

Arriba .pdf + .tsv — for expanded row preview

Expression Profile

Required

.tsv or .xlsx — gene expression data

Optional

One file per tissue if comparing multiple reference tissues

Tip: When you have a choice, always prefer .tsv over .xlsx. TSV files load faster and are less likely to cause formatting issues.

Fusion gene detection requires results from both Arriba and STARFusion as upstream tools. The required input file is a pre-processed table combining the output of both callers. The Arriba .pdf and .tsv output files are separate and optional — they unlock the expanded row preview in the Fusion Genes module.

File Naming Rules

SeqUIaSCOPE identifies files by scanning the full file path (folder names + file name together). It looks for specific keywords anywhere in the path to decide what type each file is. This means a keyword can appear in the filename itself or in a parent folder name — both work.

Keyword Reference

File type	Path must contain	Path must NOT contain
Somatic variant file	`somatic`	—
Germline variant file	`germline`	—
Fusion gene file	`fusion` or `fuze`	`arriba`, `STAR`
Tumor RNA BAM	user-defined pattern (set during upload)	`Chimeric`, `transcriptome`
Chimeric BAM	user-defined pattern (set during upload)	`transcriptome`
Arriba output files	`arriba`	`discarded`, `STAR`
Expression file	`expression` or `RNAseq`	`report`, `genes_of_interest`
TMB file	filename must be exactly `mutation_loads`	—

The keyword can live anywhere in the path — these two examples both work:

project_root/patient_001/somatic_variants.tsv    # keyword in filename
project_root/somatic_data/patient_001/variants.tsv  # keyword in folder name

Patient ID in path: Every per-patient file must have the patient ID somewhere in its full path (either in the filename or a parent folder). Files without the patient ID will not be matched. The TMB file (mutation_loads) and the Genes of Interest file are the only exceptions — they are shared across all patients and do not need a patient ID in the path.

BAM Files

BAM files don’t use the keyword system. Instead, you define patterns during upload (e.g. tumor, FFPE, chimeric) and the application finds BAM files whose names match those patterns. See Upload Data — Step 1 for details.

For every .bam file, a corresponding index file must exist in the same directory, named either file.bam.bai or file.bai.

Arriba Output Files

If you provide Arriba output files, both the .pdf and .tsv must be present as a matched pair (same base name, different extension). A .pdf without a matching .tsv will be ignored, and vice versa.

Expression Files with Multiple Tissues

When comparing against multiple reference tissues, provide one file per tissue. The tissue name must appear in the filename (e.g. blood_expression.tsv, liver.tsv). Use underscores instead of spaces — blood_vessel.tsv, not blood vessel.tsv.

Tissue names you enter in the upload form must match the names in the filenames. If no tissue names are provided, the application looks for a single expression file per patient containing expression or RNAseq in its path.

Directory Layouts

SeqUIaSCOPE is flexible — your data can be organised in many different ways as long as the naming rules above are respected. Below are three common layouts that all work correctly. The key principle is that the patient ID and dataset keyword must both appear somewhere in each file’s path.

Choose your root directory carefully. In the upload form you select a single root directory that contains all your patient data. Pick the most specific folder that still contains everything — selecting a very broad directory (e.g. your home folder) may cause the scanner to pick up unrelated files and confuse file matching.

Option A — one folder per data type

Best when your pipeline already separates outputs by analysis type.

project_root/                           ← select this as root
├── somatic_data/
│   ├── mutation_loads.tsv
│   ├── patient_001/
│   │   ├── somatic_variants.tsv
│   │   ├── tumor.bam  +  tumor.bam.bai
│   │   └── normal.bam  +  normal.bam.bai
│   └── patient_002/ ...
├── germline_data/
│   ├── patient_001/
│   │   ├── germline_variants.tsv
│   │   └── normal.bam  +  normal.bam.bai
│   └── patient_002/ ...
├── fusion_data/
│   ├── patient_001/
│   │   ├── fusions.tsv
│   │   ├── fusion.bam  +  fusion.bam.bai
│   │   ├── chimeric.bam  +  chimeric.bam.bai
│   │   ├── arriba_report.pdf
│   │   └── arriba_results.tsv
│   └── patient_002/ ...
└── expression_data/
    ├── patient_001/
    │   └── expression.tsv
    └── patient_002/ ...

Option B — BAM files stored separately from analysis results

Common when primary (alignment) and secondary (variant/fusion calling) outputs live in separate trees.

project_root/                           ← select this as root
├── alignments/
│   ├── DNA/
│   │   ├── patient_001/
│   │   │   ├── tumor.bam  +  tumor.bam.bai
│   │   │   └── normal.bam  +  normal.bam.bai
│   │   └── patient_002/ ...
│   └── RNA/
│       ├── patient_001/
│       │   ├── fusion.bam  +  fusion.bam.bai
│       │   └── chimeric.bam  +  chimeric.bam.bai
│       └── patient_002/ ...
└── results/
    ├── somatic_data/
    │   ├── mutation_loads.tsv
    │   ├── patient_001/ somatic_variants.tsv
    │   └── patient_002/ ...
    ├── germline_data/
    │   ├── patient_001/ germline_variants.tsv
    │   └── patient_002/ ...
    ├── fusion_data/
    │   ├── patient_001/
    │   │   ├── fusions.tsv
    │   │   ├── arriba_report.pdf
    │   │   └── arriba_results.tsv
    │   └── patient_002/ ...
    └── expression_data/
        ├── patient_001/ expression.tsv
        └── patient_002/ ...

Option C — all files flat per patient

Works for smaller projects or when a single pipeline writes everything into one folder per patient.

project_root/                           ← select this as root
├── mutation_loads.tsv                  ← in root (shared across patients)
├── patient_001/
│   ├── somatic_variants.tsv
│   ├── germline_variants.tsv
│   ├── fusions.tsv
│   ├── tumor.bam  +  tumor.bam.bai
│   ├── normal.bam  +  normal.bam.bai
│   ├── fusion.bam  +  fusion.bam.bai
│   ├── chimeric.bam  +  chimeric.bam.bai
│   ├── arriba_report.pdf
│   ├── arriba_results.tsv
│   └── expression.tsv
└── patient_002/ ...

Required Columns

Each data file must contain a set of required columns. Column names are case-sensitive. Any additional columns in your file are allowed — they will appear in the table but without custom formatting or labels.

Somatic Variant File

Column	Type	Description
`var_name`	string	Variant identifier
`gene_symbol`	string	Gene symbol
`tumor_variant_freq`	numeric	Variant allele frequency in tumour
`tumor_depth`	integer	Sequencing depth at variant position in tumour
`gene_region`	string	Genomic region (e.g. `exon`, `intron`, `splice`)
`gnomAD_NFE`	numeric	gnomAD Non-Finnish European allele frequency
`consequence`	string	Variant consequence (e.g. `missense_variant`, `stop_gained`)
`HGVSc`	string	HGVS coding sequence notation
`HGVSp`	string	HGVS protein sequence notation
`variant_type`	string	Variant type (e.g. `SNV`, `insertion`, `deletion`)
`all_full_annot_name`	string	Full annotation name

Optional columns recognised with custom labels:

Column	Description
`in_library`	Number of times `var_name` was observed in the project cohort (added automatically if absent)
`clinvar_sig`	ClinVar clinical significance
`clinvar_DBN`	ClinVar disease name
`CGC_Somatic`	Cancer Gene Census somatic annotation
`fOne`	fOne database annotation
`COSMIC`	COSMIC database annotation
`HGMD`	HGMD database annotation
`snpDB`	dbSNP annotation

TMB File (`mutation_loads`)

Column	Type	Description
`patient`	string	Patient ID — must match the IDs entered in the upload form
`TMB`	numeric	Tumour mutational burden value

patient	TMB
P001	0.17
P002	0.74

Germline Variant File

Column	Type	Description
`var_name`	string	Variant identifier
`gene_symbol`	string	Gene symbol
`variant_freq`	numeric	Variant allele frequency
`coverage_depth`	integer	Sequencing depth at variant position
`gene_region`	string	Genomic region (e.g. `exon`, `intron`, `splice`)
`gnomAD_NFE`	numeric	gnomAD Non-Finnish European allele frequency
`clinvar_sig`	string	ClinVar clinical significance
`consequence`	string	Variant consequence
`HGVSc`	string	HGVS coding sequence notation
`HGVSp`	string	HGVS protein sequence notation
`variant_type`	string	Variant type (e.g. `SNV`, `insertion`, `deletion`)
`all_full_annot_name`	string	Full annotation name

Optional columns:

Column	Description
`in_library`	Number of times `var_name` was observed in the project cohort (added automatically if absent)
`clinvar_DBN`	ClinVar disease name
`CGC_Germline`	Cancer Gene Census germline annotation
`trusight_genes`	TruSight gene panel annotation
`fOne`	fOne database annotation
`snpDB`	dbSNP annotation

Fusion Gene File

This is a pre-processed file combining results from both Arriba and STARFusion. Column names chrom1/chrom2 are accepted as alternatives to chr1/chr2 and will be renamed automatically.

Column	Type	Description
`gene1`	string	First fusion partner gene name
`gene2`	string	Second fusion partner gene name
`chr1` (or `chrom1`)	string	Chromosome of first partner (with `chr` prefix, e.g. `chr2`)
`chr2` (or `chrom2`)	string	Chromosome of second partner
`pos1`	numeric	Genomic position of first partner breakpoint
`pos2`	numeric	Genomic position of second partner breakpoint
`strand1`	string	Strand orientation of first partner (`+` or `-`)
`strand2`	string	Strand orientation of second partner (`+` or `-`)
`arriba.called`	boolean	Whether Arriba called this fusion
`starfus.called`	boolean	Whether STARFusion called this fusion
`overall_support`	numeric	Overall read support for the fusion
`arriba.confidence`	string	Arriba confidence level (`high`, `medium`, or `low`)
`arriba.site1`	string	Breakpoint site annotation for first partner
`arriba.site2`	string	Breakpoint site annotation for second partner

Optional columns:

Column	Description
`DB_count`	Number of databases listing this fusion
`DB_list`	Database names
`arriba.split_reads`	Split reads supporting the fusion (Arriba)
`arriba.discordant_mates`	Discordant mate pairs (Arriba)
`arriba.break_coverage`	Coverage at first breakpoint (Arriba)
`arriba.break2_coverage`	Coverage at second breakpoint (Arriba)
`arriba.break_seq`	Sequence at breakpoint (Arriba)
`starfus.split_reads`	Split reads supporting the fusion (STARFusion)
`starfus.discordant_mates`	Discordant mate pairs (STARFusion)
`starfus.counter_fusion1`	Counter fusion reads for first gene (STARFusion)
`starfus.counter_fusion2`	Counter fusion reads for second gene (STARFusion)
`starfus.splice_type`	Splice junction type (STARFusion)
`starfus.break_seq`	Breakpoint sequence (STARFusion)

Expression Profile File

Column	Type	Description
`sample`	string	Sample ID
`feature_name`	string	Gene name
`geneid`	string	Gene ID (Ensembl, RefSeq, or other)
`all_kegg_gene_names`	string	KEGG gene names
`log2FC`	numeric	Log2 fold change (tumour vs. reference)
`p_value`	numeric	P-value
`p_adj`	numeric	Adjusted p-value

Optional columns:

Column	Description
`pathway`	Pathway name (retrieved automatically from KEGG if absent)
`refseq_id`	RefSeq gene ID
`type`	Gene type
`gene_definition`	Gene description
`num_of_paths`	Number of pathways the gene participates in

Genes of Interest (GOI)

SeqUIaSCOPE ships with a built-in genes of interest list used for expression highlighting and network visualisation. To use a custom list instead, provide a .tsv or .xlsx file and update the path in reference_paths.json (see Configuration).

Column	Required	Description
`gene`	✅	Gene name
`pathway`	optional	Pathway name — retrieved from KEGG automatically if absent

If the pathway column is provided, your own pathway classifications are used instead of KEGG.

gene	pathway
BRCA1	DNA damage/repair
TP53	RTK Signalling

How to install and run

Upload Data

Data Requirements

Input Files Overview

File Naming Rules

Keyword Reference

BAM Files

Arriba Output Files

Expression Files with Multiple Tissues

Directory Layouts

Required Columns

Somatic Variant File

TMB File (mutation_loads)

Germline Variant File

Fusion Gene File

Expression Profile File

Genes of Interest (GOI)

TMB File (`mutation_loads`)