Coverage analysis

Initialize SeQuiLaSession and download sample data (check Initialize section for details)

[1]:
%run initialize.ipynb

coverage

coverage(tableName: String, sampleId: String, refPath: String)

Compute reads coverage over using aligment data

input parameters

  • tableName - registered table over alignment files (see File formats for details)

  • sampleId - name of the sample correspondind to the name of the alignment file without a file extension

  • refPath - path to the reference file

returned columns

Blocks or based-level of coverage with the following columns:

  • sample_id - name of the sample correspondind to the name of the alignment file without a file extension

  • contig - contig name

  • pos_start - start postition of a block

  • pos_end - end postition of a block

  • coverage - depth of coverage for a block or single position

[2]:
ss.sql(f"SELECT * FROM coverage('{table_name}','{sample_id}', '{ref_path}') LIMIT 5").toPandas()
[2]:
contig pos_start pos_end ref coverage
0 1 34 34 R 1
1 1 35 35 R 2
2 1 36 37 R 3
3 1 38 40 R 4
4 1 41 49 R 5

In order to include positions(organized in blocks or base-level) with depth of coverage equal 0 you can set the following parameter:

[3]:
ss.sql("SET spark.biodatageeks.coverage.allPositions=true")
ss.sql(f"SELECT * FROM coverage('{table_name}','{sample_id}', '{ref_path}') LIMIT 5").toPandas()
[3]:
contig pos_start pos_end ref coverage
0 1 34 34 R 1
1 1 35 35 R 2
2 1 36 37 R 3
3 1 38 40 R 4
4 1 41 49 R 5

For more details on other coverage related parameters including reads filtering please refer to SeQuiLa documentation

[4]:
ss.stop()