Pileup analysis

[1]:

%run initialize.ipynb


pileup

pileup(tableName: String, sampleId: String, referencePath: String, includeBaseQual: Boolean = False)

input parameters

• tableName - registered table over alignment files (see File formats for details)

• sampleId - name of the sample correspondind to the name of the alignment file without a file extension

• referencePath - a local path to referenece in a FASTQ format (should be indexed and available on all computing nodes if run in distributed nodes). Can be also distributed in the runtime using --files parameter to pyspark-shell

• includeAlts - determines whether alts should included in the output

• includeBaseQual - determines whether base qualities should be computed and included in the output (defaults to False). Please note that calculating base qualities has significant impact on performance (see Benchmark page for details).

returned columns

Blocks of pileup with the following columns:

• sample_id - name of the sample correspondind to the name of the alignment file without a file extension

• contig - contig name

• pos_start - start postition of a block

• pos_end - end postition of a block

• coverage - depth of coverage for a block or single position

• countRef - depth of coverage of reads that have at a give position base equal to reference

• alts - map of alts in format (ASCII code of alt: coverage)

• quals - map of base qualities in format (base: Array[qualities])

[2]:

ss.sql(f'''SELECT contig, pos_start, pos_end, ref, coverage, countRef, alts \
FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()

[2]:

contig pos_start pos_end ref coverage countRef alts
0 1 34 34 C 1 1 None
1 1 35 35 C 2 2 None
2 1 36 37 CT 3 3 None
3 1 38 40 AAC 4 4 None
4 1 41 49 CCTAACCCT 5 5 None
5 1 50 67 AACCCTAACCCTAACCCT 6 6 None
6 1 68 68 A 7 7 None
7 1 69 69 A 7 6 {99: 1}
8 1 70 74 CCCTA 7 7 None
9 1 75 75 A 7 6 {99: 1}

to_charmap

to_charmap(quals: Map(Base: Short, Array[BaseQuality])

example:

[3]:

ss.sql(f'''SELECT quals, to_charmap(quals) AS quals_decoded \
FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()

[3]:

quals quals_decoded
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 {65: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 3, 0, 0], 99: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} {'A': {'C': 1, 'D': 1, '=': 1, 'G': 3}, 'c': {'#': 1}}
8 None None
9 {65: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 3, 0, 0], 99: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} {'A': {'@': 1, 'C': 1, 'D': 1, 'G': 3}, 'c': {'#': 1}}

to_char

to_char(alts: Map(Alt: Short, coverage:Short)

example:

[4]:

ss.sql(f'''SELECT alts, alts_to_char(alts) AS alts_decoded \
FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()

[4]:

alts alts_decoded
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 {99: 1} {'c': 1}
8 None None
9 {99: 1} {'c': 1}
[5]:

ss.stop()