Pileup analysis
Initialize SeQuiLaSession and download sample data (check Initialize section for details)
[1]:
%run initialize.ipynb
pileup
pileup(tableName: String, sampleId: String, referencePath: String, includeBaseQual: Boolean = False)
Compute reads pileup over using aligment data
input parameters
tableName - registered table over alignment files (see File formats for details)
sampleId - name of the sample correspondind to the name of the alignment file without a file extension
referencePath - a local path to referenece in a FASTQ format (should be indexed and available on all computing nodes if run in distributed nodes). Can be also distributed in the runtime using
--files
parameter topyspark-shell
includeAlts - determines whether alts should included in the output
includeBaseQual - determines whether base qualities should be computed and included in the output (defaults to False). Please note that calculating base qualities has significant impact on performance (see Benchmark page for details).
returned columns
Blocks of pileup with the following columns:
sample_id - name of the sample correspondind to the name of the alignment file without a file extension
contig - contig name
pos_start - start postition of a block
pos_end - end postition of a block
coverage - depth of coverage for a block or single position
countRef - depth of coverage of reads that have at a give position base equal to reference
alts - map of alts in format
(ASCII code of alt: coverage)
quals - map of base qualities in format (
base: Array[qualities]
)
[2]:
ss.sql(f'''SELECT contig, pos_start, pos_end, ref, coverage, countRef, alts \
FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()
[2]:
contig | pos_start | pos_end | ref | coverage | countRef | alts | |
---|---|---|---|---|---|---|---|
0 | 1 | 34 | 34 | C | 1 | 1 | None |
1 | 1 | 35 | 35 | C | 2 | 2 | None |
2 | 1 | 36 | 37 | CT | 3 | 3 | None |
3 | 1 | 38 | 40 | AAC | 4 | 4 | None |
4 | 1 | 41 | 49 | CCTAACCCT | 5 | 5 | None |
5 | 1 | 50 | 67 | AACCCTAACCCTAACCCT | 6 | 6 | None |
6 | 1 | 68 | 68 | A | 7 | 7 | None |
7 | 1 | 69 | 69 | A | 7 | 6 | {99: 1} |
8 | 1 | 70 | 74 | CCCTA | 7 | 7 | None |
9 | 1 | 75 | 75 | A | 7 | 6 | {99: 1} |
to_charmap
to_charmap(quals: Map(Base: Short, Array[BaseQuality])
Convert binary representation of base qualities into human-readable map
example:
[3]:
ss.sql(f'''SELECT quals, to_charmap(quals) AS quals_decoded \
FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()
[3]:
quals | quals_decoded | |
---|---|---|
0 | None | None |
1 | None | None |
2 | None | None |
3 | None | None |
4 | None | None |
5 | None | None |
6 | None | None |
7 | {65: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 3, 0, 0], 99: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} | {'A': {'C': 1, 'D': 1, '=': 1, 'G': 3}, 'c': {'#': 1}} |
8 | None | None |
9 | {65: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 3, 0, 0], 99: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} | {'A': {'@': 1, 'C': 1, 'D': 1, 'G': 3}, 'c': {'#': 1}} |
to_char
to_char(alts: Map(Alt: Short, coverage:Short)
Convert binary representation of alts into human-readable map with strand information encoded as lower/upper case
example:
[4]:
ss.sql(f'''SELECT alts, alts_to_char(alts) AS alts_decoded \
FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()
[4]:
alts | alts_decoded | |
---|---|---|
0 | None | None |
1 | None | None |
2 | None | None |
3 | None | None |
4 | None | None |
5 | None | None |
6 | None | None |
7 | {99: 1} | {'c': 1} |
8 | None | None |
9 | {99: 1} | {'c': 1} |
[5]:
ss.stop()