{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pileup analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Initialize SeQuiLaSession and download sample data (check Initialize section for details)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run initialize.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `pileup`\n", "`pileup(tableName: String, sampleId: String, referencePath: String, includeBaseQual: Boolean = False)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Compute reads pileup over using aligment data\n", "\n", "#### input parameters\n", "* tableName - registered table over alignment files (see File formats for details)\n", "* sampleId - name of the sample correspondind to the name of the alignment file without a file extension\n", "* referencePath - a local path to referenece in a FASTQ format (should be indexed and available on all computing nodes if run in distributed nodes). Can be also distributed in the runtime using `--files` parameter to `pyspark-shell`\n", "* includeAlts - determines whether alts should included in the output\n", "* includeBaseQual - determines whether base qualities should be computed and included in the output (defaults to False). Please note that calculating base qualities has **significant** impact on performance (see Benchmark page for details).\n", "\n", "#### returned columns\n", "Blocks of pileup with the following columns:\n", "\n", "* sample_id - name of the sample correspondind to the name of the alignment file without a file extension\n", "* contig - contig name\n", "* pos_start - start postition of a block\n", "* pos_end - end postition of a block\n", "* coverage - depth of coverage for a block or single position\n", "* countRef - depth of coverage of reads that have at a give position base equal to reference\n", "* alts - map of alts in format `(ASCII code of alt: coverage)`\n", "* quals - map of base qualities in format (`base: Array[qualities]`)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ss.sql(f'''SELECT contig, pos_start, pos_end, ref, coverage, countRef, alts \\\n", " FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `to_charmap`\n", "`to_charmap(quals: Map(Base: Short, Array[BaseQuality])`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Convert binary representation of base qualities into human-readable map\n", "\n", "#### example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ss.sql(f'''SELECT quals, to_charmap(quals) AS quals_decoded \\\n", " FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `to_char`\n", "`to_char(alts: Map(Alt: Short, coverage:Short)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Convert binary representation of alts into human-readable map with strand information encoded as lower/upper case\n", "\n", "#### example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ss.sql(f'''SELECT alts, alts_to_char(alts) AS alts_decoded \\\n", " FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true) LIMIT 10''').toPandas()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ss.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "pysequila", "language": "python", "name": "pysequila" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }