Genomic intervals

Initialize SeQuiLaSession and download sample data (check Initialize section for details)

[1]:
%run initialize.ipynb

Create an additional table with features

[2]:
ss.sql(f'''
    CREATE TABLE IF NOT EXISTS targets(contig String,pos_start Integer,pos_end Integer) \
    USING csv \
    OPTIONS (path "file:///{bed_path}", delimiter "\t")''')

[2]:
DataFrame[]
[3]:
ss.sql('SELECT replace(contig,"chr","") AS contig, pos_start, pos_end FROM targets LIMIT 5').toPandas()
[3]:
contig pos_start pos_end
0 1 4505 4745
1 1 4806 4926
2 1 5614 5853
3 1 6462 6582
4 1 6697 6937

Run an interval join query with grouping between reads and targets tables

[4]:
ss.sql(f'''
SELECT replace(targets.contig,"chr","") AS contig,targets.pos_start,targets.pos_end, count(*) AS cnt FROM reads JOIN targets
     ON (replace(targets.contig,"chr","") = reads.contig
     AND
     CAST(reads.pos_end AS INTEGER)>=CAST(targets.pos_start AS INTEGER)
     AND
     CAST(reads.pos_start AS INTEGER)<=CAST(targets.pos_end AS INTEGER)
     )
     GROUP BY replace(targets.contig,"chr",""),targets.pos_start,targets.pos_end LIMIT 5''').toPandas()
[4]:
contig pos_start pos_end cnt
0 1 4806 4926 16
1 1 5614 5853 7
2 1 7074 7194 13
3 1 6697 6937 18
4 1 6462 6582 5
[ ]: