Pileup analysis

Initialize SeQuiLaSession and download sample data (check Initialize section for details)

[1]:
%run initialize.ipynb

pileup

pileup(tableName: String, sampleId: String, referencePath: String, includeBaseQual: Boolean = False)

Compute reads pileup over using aligment data

input parameters

  • tableName - registered table over alignment files (see File formats for details)

  • sampleId - name of the sample correspondind to the name of the alignment file without a file extension

  • referencePath - a local path to referenece in a FASTQ format (should be indexed and available on all computing nodes if run in distributed nodes). Can be also distributed in the runtime using --files parameter to pyspark-shell

  • includeAlts - determines whether alts should included in the output

  • includeBaseQual - determines whether base qualities should be computed and included in the output (defaults to False). Please note that calculating base qualities has significant impact on performance (see Benchmark page for details).

returned columns

Blocks of pileup with the following columns:

  • sample_id - name of the sample correspondind to the name of the alignment file without a file extension

  • contig - contig name

  • pos_start - start postition of a block

  • pos_end - end postition of a block

  • coverage - depth of coverage for a block or single position

  • countRef - depth of coverage of reads that have at a give position base equal to reference

  • alts - map of alts in format (ASCII code of alt: coverage)

  • quals - map of base qualities in format (base: Array[qualities])

[2]:
ss.sql(f'''SELECT contig, pos_start, pos_end, ref, coverage, countRef, alts \
      FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()
[2]:
contig pos_start pos_end ref coverage countRef alts
0 1 34 34 C 1 1 None
1 1 35 35 C 2 2 None
2 1 36 37 CT 3 3 None
3 1 38 40 AAC 4 4 None
4 1 41 49 CCTAACCCT 5 5 None
5 1 50 67 AACCCTAACCCTAACCCT 6 6 None
6 1 68 68 A 7 7 None
7 1 69 69 A 7 6 {99: 1}
8 1 70 74 CCCTA 7 7 None
9 1 75 75 A 7 6 {99: 1}

to_charmap

to_charmap(quals: Map(Base: Short, Array[BaseQuality])

Convert binary representation of base qualities into human-readable map

example:

[3]:
ss.sql(f'''SELECT quals, to_charmap(quals) AS quals_decoded \
      FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()
[3]:
quals quals_decoded
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 {65: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 3, 0, 0], 99: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} {'A': {'C': 1, 'D': 1, '=': 1, 'G': 3}, 'c': {'#': 1}}
8 None None
9 {65: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 3, 0, 0], 99: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} {'A': {'@': 1, 'C': 1, 'D': 1, 'G': 3}, 'c': {'#': 1}}

to_char

to_char(alts: Map(Alt: Short, coverage:Short)

Convert binary representation of alts into human-readable map with strand information encoded as lower/upper case

example:

[4]:
ss.sql(f'''SELECT alts, alts_to_char(alts) AS alts_decoded \
      FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()
[4]:
alts alts_decoded
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 {99: 1} {'c': 1}
8 None None
9 {99: 1} {'c': 1}
[5]:
ss.stop()