Quick start
Download sample data
[1]:
%%bash
mkdir -p data
cd data
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/NA12878.multichrom.md.bam
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/NA12878.multichrom.md.bam.bai
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai
wget --quiet https://raw.githubusercontent.com/bigdatagenomics/mango/master/mango-pileup/examples/data/alignments.ga4gh.chr17.1-250.json
[2]:
import os
base_path = f"{os.getcwd()}/data"
bam_path = f'{base_path}/NA12878.multichrom.md.bam'
ref_path=f'{base_path}/Homo_sapiens_assembly18_chr1_chrM.small.fasta'
sample_id= 'NA12878'
table_name = 'reads'
app_name = "sequila"
Check if your env variables are properly set
Ensure that:
you use Java 11 in have set JAVA_HOME accordingly,
Apache Spark 3.1.2 in SPARK_HOME,
SeQuiLa Scala package is in your PYSPARK_SUBMIT_ARGS (i.e. –package parameter)
[3]:
%%bash
echo "JAVA_HOME is $JAVA_HOME"
echo "SPARK_HOME is $SPARK_HOME"
echo "PYSPARK_SUBMIT_ARGS is $PYSPARK_SUBMIT_ARGS"
java -version
spark-shell --version
JAVA_HOME is /tmp/jovyan/.sdkman/candidates/java/current
SPARK_HOME is /usr/local/spark
PYSPARK_SUBMIT_ARGS is --packages org.biodatageeks:sequila_2.12:0.7.3 --driver-memory 4g pyspark-shell
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop2.7/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2021-05-24T04:46:13Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.
Install pysequila and pandas
[4]:
!pip install pysequila==$VERSION pandas
Requirement already satisfied: pysequila==0.3.0 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (0.3.0)
Requirement already satisfied: pandas in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (1.1.4)
Requirement already satisfied: findspark in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.0) (1.4.2)
Requirement already satisfied: pyspark==3.1.2 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.0) (3.1.2)
Requirement already satisfied: setuptools in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.0) (59.2.0)
Requirement already satisfied: typeguard==2.9.1 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.0) (2.9.1)
Requirement already satisfied: py4j==0.10.9 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pyspark==3.1.2->pysequila==0.3.0) (0.10.9)
Requirement already satisfied: pytz>=2017.2 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pandas) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.15.4 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pandas) (1.21.4)
Requirement already satisfied: six>=1.5 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Initialize PySequila Session
[5]:
from pysequila import SequilaSession
ss = SequilaSession \
.builder \
.appName(f'{app_name}') \
.getOrCreate()
Create a table over BAM files
[6]:
ss.sql(f'''CREATE TABLE IF NOT EXISTS {table_name} \
USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource \
OPTIONS(path "{bam_path}")''')
[6]:
DataFrame[]
Run a simple select statement
[7]:
import pandas as pd
pd.options.display.max_columns = None
ss.sql(f'''SELECT sample_id, contig, pos, cigar, seq \
FROM {table_name} LIMIT 5''').toPandas()
[7]:
sample_id | contig | pos | cigar | seq | |
---|---|---|---|---|---|
0 | NA12878 | MT | 7 | 101M | AGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTT... |
1 | NA12878 | MT | 9 | 101M | GTCTGTCACCCTTGTAGCCGCTCACGGGAGCTCTCCATGCATTTGG... |
2 | NA12878 | MT | 10 | 101M | TCTATCCCCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGT... |
3 | NA12878 | MT | 20 | 101M | TATTATCCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCT... |
4 | NA12878 | MT | 25 | 76M | ACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG... |
Calculate pileup
[8]:
ss.sql(f'''SELECT contig, pos_start, pos_end, ref, coverage, countRef, alts \
FROM pileup('{table_name}', '{sample_id}', '{ref_path}', true, true ) LIMIT 10''').toPandas()
[8]:
contig | pos_start | pos_end | ref | coverage | countRef | alts | |
---|---|---|---|---|---|---|---|
0 | 1 | 34 | 34 | C | 1 | 1 | None |
1 | 1 | 35 | 35 | C | 2 | 2 | None |
2 | 1 | 36 | 37 | CT | 3 | 3 | None |
3 | 1 | 38 | 40 | AAC | 4 | 4 | None |
4 | 1 | 41 | 49 | CCTAACCCT | 5 | 5 | None |
5 | 1 | 50 | 67 | AACCCTAACCCTAACCCT | 6 | 6 | None |
6 | 1 | 68 | 68 | A | 7 | 7 | None |
7 | 1 | 69 | 69 | A | 7 | 6 | {99: 1} |
8 | 1 | 70 | 74 | CCCTA | 7 | 7 | None |
9 | 1 | 75 | 75 | A | 7 | 6 | {99: 1} |
[9]:
ss.stop()