Quick start

Download sample data

[1]:
%%bash
mkdir -p data
cd data
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/NA12878.multichrom.md.bam
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/NA12878.multichrom.md.bam.bai
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta
wget --quiet http://biodatageeks.ii.pw.edu.pl/sequila/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai
wget --quiet https://raw.githubusercontent.com/bigdatagenomics/mango/master/mango-pileup/examples/data/alignments.ga4gh.chr17.1-250.json
[2]:
import os
base_path = f"{os.getcwd()}/data"

bam_path = f'{base_path}/NA12878.multichrom.md.bam'
ref_path=f'{base_path}/Homo_sapiens_assembly18_chr1_chrM.small.fasta'
sample_id= 'NA12878'
table_name = 'reads'

app_name = "sequila"

Check if your env variables are properly set

Ensure that:

  • you use Java 11 in have set JAVA_HOME accordingly,

  • Apache Spark 3.1.2 in SPARK_HOME,

  • SeQuiLa Scala package is in your PYSPARK_SUBMIT_ARGS (i.e. –package parameter)

[3]:
%%bash
echo "JAVA_HOME is $JAVA_HOME"
echo "SPARK_HOME is $SPARK_HOME"
echo "PYSPARK_SUBMIT_ARGS is $PYSPARK_SUBMIT_ARGS"
java -version
spark-shell --version
JAVA_HOME is /tmp/jovyan/.sdkman/candidates/java/current
SPARK_HOME is /usr/local/spark
PYSPARK_SUBMIT_ARGS is --packages org.biodatageeks:sequila_2.12:0.7.5 --driver-memory 4g pyspark-shell
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop2.7/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2021-05-24T04:46:13Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

Install pysequila and pandas

[4]:
!pip install pysequila==$VERSION pandas
Requirement already satisfied: pysequila==0.3.3+g904094a-SNAPSHOT in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (0.3.3+g904094a.snapshot)
Requirement already satisfied: pandas in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (1.1.4)
Requirement already satisfied: findspark in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.3+g904094a-SNAPSHOT) (1.4.2)
Requirement already satisfied: setuptools in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.3+g904094a-SNAPSHOT) (59.2.0)
Requirement already satisfied: pyspark==3.1.2 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.3+g904094a-SNAPSHOT) (3.1.2)
Requirement already satisfied: typeguard==2.9.1 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pysequila==0.3.3+g904094a-SNAPSHOT) (2.9.1)
Requirement already satisfied: py4j==0.10.9 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pyspark==3.1.2->pysequila==0.3.3+g904094a-SNAPSHOT) (0.10.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pandas) (2021.3)
Requirement already satisfied: numpy>=1.15.4 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from pandas) (1.21.4)
Requirement already satisfied: six>=1.5 in /tmp/jovyan/venv/pysequila/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)

Initialize PySequila Session

[5]:
from pysequila import SequilaSession
ss = SequilaSession \
    .builder \
    .appName(f'{app_name}') \
    .getOrCreate()

Create a table over BAM files

[6]:
ss.sql(f'''CREATE TABLE IF NOT EXISTS {table_name} \
         USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource \
         OPTIONS(path "{bam_path}")''')
[6]:
DataFrame[]

Run a simple select statement

[7]:
import pandas as pd
pd.options.display.max_columns = None
ss.sql(f'''SELECT sample_id, contig, pos, cigar, seq \
        FROM {table_name} LIMIT 5''').toPandas()
[7]:
sample_id contig pos cigar seq
0 NA12878 MT 7 101M AGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTT...
1 NA12878 MT 9 101M GTCTGTCACCCTTGTAGCCGCTCACGGGAGCTCTCCATGCATTTGG...
2 NA12878 MT 10 101M TCTATCCCCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGT...
3 NA12878 MT 20 101M TATTATCCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCT...
4 NA12878 MT 25 76M ACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG...

Calculate pileup

[8]:
ss.sql(f'''SELECT contig, pos_start, pos_end, ref, coverage, countRef, alts \
      FROM  pileup('{table_name}', '{sample_id}', '{ref_path}', true, true ) LIMIT 10''').toPandas()

[8]:
contig pos_start pos_end ref coverage countRef alts
0 1 34 34 C 1 1 None
1 1 35 35 C 2 2 None
2 1 36 37 CT 3 3 None
3 1 38 40 AAC 4 4 None
4 1 41 49 CCTAACCCT 5 5 None
5 1 50 67 AACCCTAACCCTAACCCT 6 6 None
6 1 68 68 A 7 7 None
7 1 69 69 A 7 6 {99: 1}
8 1 70 74 CCCTA 7 7 None
9 1 75 75 A 7 6 {99: 1}
[9]:
ss.stop()