Coverage analysis
Initialize SeQuiLaSession and download sample data (check Initialize section for details)
[1]:
%run initialize.ipynb
coverage
coverage(tableName: String, sampleId: String, refPath: String)
Compute reads coverage over using aligment data
input parameters
tableName - registered table over alignment files (see File formats for details)
sampleId - name of the sample correspondind to the name of the alignment file without a file extension
refPath - path to the reference file
returned columns
Blocks or based-level of coverage with the following columns:
sample_id - name of the sample correspondind to the name of the alignment file without a file extension
contig - contig name
pos_start - start postition of a block
pos_end - end postition of a block
coverage - depth of coverage for a block or single position
[2]:
ss.sql(f"SELECT * FROM coverage('{table_name}','{sample_id}', '{ref_path}') LIMIT 5").toPandas()
[2]:
contig | pos_start | pos_end | ref | coverage | |
---|---|---|---|---|---|
0 | 1 | 34 | 34 | R | 1 |
1 | 1 | 35 | 35 | R | 2 |
2 | 1 | 36 | 37 | R | 3 |
3 | 1 | 38 | 40 | R | 4 |
4 | 1 | 41 | 49 | R | 5 |
In order to include positions(organized in blocks or base-level) with depth of coverage equal 0 you can set the following parameter:
[3]:
ss.sql("SET spark.biodatageeks.coverage.allPositions=true")
ss.sql(f"SELECT * FROM coverage('{table_name}','{sample_id}', '{ref_path}') LIMIT 5").toPandas()
[3]:
contig | pos_start | pos_end | ref | coverage | |
---|---|---|---|---|---|
0 | 1 | 34 | 34 | R | 1 |
1 | 1 | 35 | 35 | R | 2 |
2 | 1 | 36 | 37 | R | 3 |
3 | 1 | 38 | 40 | R | 4 |
4 | 1 | 41 | 49 | R | 5 |
For more details on other coverage related parameters including reads filtering please refer to SeQuiLa documentation
[4]:
ss.stop()