fastq-dump SRR8181428 --split-files --gzip
fastq-dump SRR8181429 --split-files --gzip

SRR8181428_1.fastq.gz ：Read 1  
SRR8181428_2.fastq.gz ：Read 2  
SRR8181428_3.fastq.gz ：Index

cellranger count --id=RetinalBatchE2 \
   --fastqs=/path/to/SRR8181428\
   --sample=SRR8181428 \
   --create-bam=true \
   --transcriptome=/path/to/Cell_Ranger_References/refdata-gex-mm10-2020-A

cellranger count --id=RetinalBatchF2 \
   --fastqs=/path/to/SRR8181429\
   --sample=SRR8181429 \
   --create-bam=true \
   --transcriptome=/path/to/Cell_Ranger_References/refdata-gex-mm10-2020-A

import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns

sc.logging.print_header()
sc.settings.set_figure_params(dpi=100, facecolor='white')

scanpy==1.10.4 anndata==0.11.3 umap==0.5.7 numpy==2.1.3 scipy==1.15.1 pandas==2.2.2 scikit-learn==1.6.1 statsmodels==0.14.4 igraph==0.11.8 pynndescent==0.5.13

adata_E2 = sc.read_10x_mtx(path='./data/RetinalBatchE2/outs/filtered_feature_bc_matrix/')
adata_F2 = sc.read_10x_mtx(path='./data/RetinalBatchF2/outs/filtered_feature_bc_matrix/')

adata_E2

AnnData object with n_obs × n_vars = 3392 × 32285
    var: 'gene_ids', 'feature_types'

adata_F2

AnnData object with n_obs × n_vars = 3611 × 32285
    var: 'gene_ids', 'feature_types'

adata = adata_E2.concatenate(adata_F2, batch_categories=['E2', 'F2'])

adata.obs

adata.var

adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 14123833 stored elements and shape (7003, 32285)>

adata

AnnData object with n_obs × n_vars = 7003 × 32285
    obs: 'batch'
    var: 'gene_ids', 'feature_types'

# 観測された遺伝子が極端に少ない細胞
sc.pp.filter_cells(adata, min_genes=200)
# 割り当てられた細胞が極端に少ない遺伝子
sc.pp.filter_genes(adata, min_cells=20)

adata

AnnData object with n_obs × n_vars = 7001 × 15001
    obs: 'batch', 'n_genes'
    var: 'gene_ids', 'feature_types', 'n_cells'

adata.var['mt'] = adata.var_names.str.startswith('mt-')

adata.var['mt']

Xkr4              False
Gm1992            False
Gm19938           False
Rp1               False
Mrpl15            False
                  ...  
CAAA01118383.1    False
Vamp7             False
Tmlhe             False
4933409K07Rik     False
AC149090.1        False
Name: mt, Length: 15001, dtype: bool

# inplace オプション: 計算した結果を .obs と .var に書き込むか？
# percent_top オプション: Which proportions of top genes to cover. If empty or None don’t calculate. Values are considered 1-indexed, percent_top=[50] finds cumulative proportion to the 50th most expressed gene.
# log1p オプション: Set to False to skip computing log1p transformed annotations.
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=False, log1p=False, inplace=True)

sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"], jitter=0.4, multi_panel=True)

sc.pl.scatter(adata, 'total_counts', 'n_genes_by_counts', color='pct_counts_mt', size=40)

fig = plt.figure()
sns.displot(adata.obs['pct_counts_mt'], kde=False)
plt.show()

<Figure size 400x400 with 0 Axes>

fig = plt.figure()
sns.displot(adata.obs['pct_counts_mt'][adata.obs['pct_counts_mt'] < 10], kde=False)
plt.show()

<Figure size 400x400 with 0 Axes>

print('Total number of cells: {:d}'.format(adata.n_obs))

sc.pp.filter_cells(adata, min_counts = 2000)
print('Number of cells after min count filter: {:d}'.format(adata.n_obs))

sc.pp.filter_cells(adata, max_counts = 13000)
print('Number of cells after max count filter: {:d}'.format(adata.n_obs))

sc.pp.filter_cells(adata, min_genes = 1000)
print('Number of cells after gene filter: {:d}'.format(adata.n_obs))

adata = adata[adata.obs['pct_counts_mt'] < 6]
print('Number of cells after MT filter: {:d}'.format(adata.n_obs))

Total number of cells: 7001
Number of cells after min count filter: 5174
Number of cells after max count filter: 4981
Number of cells after gene filter: 4976
Number of cells after MT filter: 4773

adata

View of AnnData object with n_obs × n_vars = 4773 × 15001
    obs: 'batch', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

adata.layers['counts'] = adata.X.copy()

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)

adata.raw = adata

sc.pp.highly_variable_genes?

Signature:
sc.pp.highly_variable_genes(
    adata: 'AnnData',
    *,
    layer: 'str | None' = None,
    n_top_genes: 'int | None' = None,
    min_disp: 'float' = 0.5,
    max_disp: 'float' = inf,
    min_mean: 'float' = 0.0125,
    max_mean: 'float' = 3,
    span: 'float' = 0.3,
    n_bins: 'int' = 20,
    flavor: "Literal['seurat', 'cell_ranger', 'seurat_v3', 'seurat_v3_paper']" = 'seurat',
    subset: 'bool' = False,
    inplace: 'bool' = True,
    batch_key: 'str | None' = None,
    check_values: 'bool' = True,
) -> 'pd.DataFrame | None'
Docstring:
Annotate highly variable genes :cite:p:`Satija2015,Zheng2017,Stuart2019`.

Expects logarithmized data, except when `flavor='seurat_v3'`/`'seurat_v3_paper'`, in which count
data is expected.

Depending on `flavor`, this reproduces the R-implementations of Seurat
:cite:p:`Satija2015`, Cell Ranger :cite:p:`Zheng2017`, and Seurat v3 :cite:p:`Stuart2019`.

`'seurat_v3'`/`'seurat_v3_paper'` requires `scikit-misc` package. If you plan to use this flavor, consider
installing `scanpy` with this optional dependency: `scanpy[skmisc]`.

For the dispersion-based methods (`flavor='seurat'` :cite:t:`Satija2015` and
`flavor='cell_ranger'` :cite:t:`Zheng2017`), the normalized dispersion is obtained
by scaling with the mean and standard deviation of the dispersions for genes
falling into a given bin for mean expression of genes. This means that for each
bin of mean expression, highly variable genes are selected.

For `flavor='seurat_v3'`/`'seurat_v3_paper'` :cite:p:`Stuart2019`, a normalized variance for each gene
is computed. First, the data are standardized (i.e., z-score normalization
per feature) with a regularized standard deviation. Next, the normalized variance
is computed as the variance of each gene after the transformation. Genes are ranked
by the normalized variance.
Only if `batch_key` is not `None`, the two flavors differ: For `flavor='seurat_v3'`, genes are first sorted by the median (across batches) rank, with ties broken by the number of batches a gene is a HVG.
For `flavor='seurat_v3_paper'`, genes are first sorted by the number of batches a gene is a HVG, with ties broken by the median (across batches) rank.

The following may help when comparing to Seurat's naming:
If `batch_key=None` and `flavor='seurat'`, this mimics Seurat's `FindVariableFeatures(…, method='mean.var.plot')`.
If `batch_key=None` and `flavor='seurat_v3'`/`flavor='seurat_v3_paper'`, this mimics Seurat's `FindVariableFeatures(..., method='vst')`.
If `batch_key` is not `None` and `flavor='seurat_v3_paper'`, this mimics Seurat's `SelectIntegrationFeatures`.

See also `scanpy.experimental.pp._highly_variable_genes` for additional flavors
(e.g. Pearson residuals).

Parameters
----------
adata : 'AnnData'
    The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond
    to cells and columns to genes.
layer : 'str | None', optional (default: None)
    If provided, use `adata.layers[layer]` for expression values instead of `adata.X`.
n_top_genes : 'int | None', optional (default: None)
    Number of highly-variable genes to keep. Mandatory if `flavor='seurat_v3'`.
min_mean : 'float', optional (default: 0.0125)
    If `n_top_genes` unequals `None`, this and all other cutoffs for the means and the
    normalized dispersions are ignored. Ignored if `flavor='seurat_v3'`.
max_mean : 'float', optional (default: 3)
    If `n_top_genes` unequals `None`, this and all other cutoffs for the means and the
    normalized dispersions are ignored. Ignored if `flavor='seurat_v3'`.
min_disp : 'float', optional (default: 0.5)
    If `n_top_genes` unequals `None`, this and all other cutoffs for the means and the
    normalized dispersions are ignored. Ignored if `flavor='seurat_v3'`.
max_disp : 'float', optional (default: inf)
    If `n_top_genes` unequals `None`, this and all other cutoffs for the means and the
    normalized dispersions are ignored. Ignored if `flavor='seurat_v3'`.
span : 'float', optional (default: 0.3)
    The fraction of the data (cells) used when estimating the variance in the loess
    model fit if `flavor='seurat_v3'`.
n_bins : 'int', optional (default: 20)
    Number of bins for binning the mean gene expression. Normalization is
    done with respect to each bin. If just a single gene falls into a bin,
    the normalized dispersion is artificially set to 1. You'll be informed
    about this if you set `settings.verbosity = 4`.
flavor : "Literal['seurat', 'cell_ranger', 'seurat_v3', 'seurat_v3_paper']", optional (default: 'seurat')
    Choose the flavor for identifying highly variable genes. For the dispersion
    based methods in their default workflows, Seurat passes the cutoffs whereas
    Cell Ranger passes `n_top_genes`.
subset : 'bool', optional (default: False)
    Inplace subset to highly-variable genes if `True` otherwise merely indicate
    highly variable genes.
inplace : 'bool', optional (default: True)
    Whether to place calculated metrics in `.var` or return them.
batch_key : 'str | None', optional (default: None)
    If specified, highly-variable genes are selected within each batch separately and merged.
    This simple process avoids the selection of batch-specific genes and acts as a
    lightweight batch correction method. For all flavors, except `seurat_v3`, genes are first sorted
    by how many batches they are a HVG. For dispersion-based flavors ties are broken
    by normalized dispersion. For `flavor = 'seurat_v3_paper'`, ties are broken by the median
    (across batches) rank based on within-batch normalized variance.
check_values : 'bool', optional (default: True)
    Check if counts in selected layer are integers. A Warning is returned if set to True.
    Only used if `flavor='seurat_v3'`/`'seurat_v3_paper'`.

Returns
-------
Returns a :class:`pandas.DataFrame` with calculated metrics if `inplace=False`, else returns an `AnnData` object where it sets the following field:

`adata.var['highly_variable']` : :class:`pandas.Series` (dtype `bool`)
    boolean indicator of highly-variable genes
`adata.var['means']` : :class:`pandas.Series` (dtype `float`)
    means per gene
`adata.var['dispersions']` : :class:`pandas.Series` (dtype `float`)
    For dispersion-based flavors, dispersions per gene
`adata.var['dispersions_norm']` : :class:`pandas.Series` (dtype `float`)
    For dispersion-based flavors, normalized dispersions per gene
`adata.var['variances']` : :class:`pandas.Series` (dtype `float`)
    For `flavor='seurat_v3'`/`'seurat_v3_paper'`, variance per gene
`adata.var['variances_norm']`/`'seurat_v3_paper'` : :class:`pandas.Series` (dtype `float`)
    For `flavor='seurat_v3'`/`'seurat_v3_paper'`, normalized variance per gene, averaged in
    the case of multiple batches
`adata.var['highly_variable_rank']` : :class:`pandas.Series` (dtype `float`)
    For `flavor='seurat_v3'`/`'seurat_v3_paper'`, rank of the gene according to normalized
    variance, in case of multiple batches description above
`adata.var['highly_variable_nbatches']` : :class:`pandas.Series` (dtype `int`)
    If `batch_key` is given, this denotes in how many batches genes are detected as HVG
`adata.var['highly_variable_intersection']` : :class:`pandas.Series` (dtype `bool`)
    If `batch_key` is given, this denotes the genes that are highly variable in all batches

Notes
-----
This function replaces :func:`~scanpy.pp.filter_genes_dispersion`.
File:      ~/miniforge3/envs/pags2024/lib/python3.11/site-packages/scanpy/preprocessing/_highly_variable_genes.py
Type:      function

# top 2000genes のみを抽出する
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat')
print('\n','Number of highly variable genes: {:d}'.format(np.sum(adata.var['highly_variable'])))

 Number of highly variable genes: 2000

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

sc.pl.highly_variable_genes(adata)

adata.var

sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')

print(adata.obsm['X_pca'].shape)

(4773, 50)

sc.pl.pca(adata, color='total_counts')

sc.pl.pca(adata, color='Xkr4')

sc.pl.pca(adata, color='total_counts', 
          components=['1,2', '2,3', '1,3'])

# projection='3d'
# を指定すると3次元表示（見づらい）
sc.pl.pca(adata, color='total_counts', projection='3d')

sc.pl.pca(adata, color='batch', projection='3d')

sc.pl.pca_variance_ratio(adata, n_pcs=50)

sc.pp.neighbors(adata)

adata.obsp['connectivities']

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 96074 stored elements and shape (4773, 4773)>

sc.tl.tsne(adata)

sc.pl.tsne(adata, color='total_counts')

sc.tl.umap(adata)

sc.pl.umap(adata, color='total_counts')

sc.pl.umap(adata, color='batch')

sc.tl.leiden(adata, resolution=0.5, key_added='leiden_r0.5')

adata.obs

sc.pl.umap(adata,
           color=['batch', 'leiden_r0.5'],
           ncols=2,
           frameon=False)

import scvi

scvi.model.SCVI.setup_anndata(
    adata,
    layer='counts',
    batch_key='batch',
)

model = scvi.model.SCVI(adata)

#model.train()

#model.save('./models/scVI_model', overwrite=True)

model = scvi.model.SCVI.load('./models/scVI_model', adata=adata)

INFO     File ./models/scVI_model/model.pt already downloaded

adata.obsm['X_scVI'] = model.get_latent_representation()

adata.layers['scvi_normalized'] = model.get_normalized_expression(library_size=1e4)

adata

AnnData object with n_obs × n_vars = 4773 × 15001
    obs: 'batch', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_counts', 'leiden_r0.5', '_scvi_batch', '_scvi_labels'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg', 'pca', 'batch_colors', 'neighbors', 'tsne', 'umap', 'leiden_r0.5', 'leiden_r0.5_colors', '_scvi_uuid', '_scvi_manager_uuid'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_scVI'
    varm: 'PCs'
    layers: 'counts', 'scvi_normalized'
    obsp: 'distances', 'connectivities'

sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.pl.umap(adata, color='batch')

sc.tl.leiden(adata, key_added="leiden_scVI", resolution=1.0)
sc.pl.umap(adata,
           color=['batch', 'leiden_scVI'],
           ncols=2,
           frameon=False)

# 遺伝子 Top2a, Mcm6の比較でdrop-outが補正されたか確認。
sc.pl.scatter(adata, x='Top2a', y='Mcm6', color='leiden_scVI')

sc.pl.scatter(adata, x='Top2a', y='Mcm6', use_raw=False, layers='scvi_normalized', color='leiden_scVI')

results = []
for batch in ['F2', 'E2']:
    tmp_solo_model = scvi.external.SOLO.from_scvi_model(
        model, restrict_to_batch=batch)
    tmp_solo_model.train()
    result = tmp_solo_model.predict(soft=False)
    # 意図は不明だがSOLO予測のindexになぜか "-0" が付加されるので消しておく
    result.index = result.index.str.replace("-0$", "", regex=True)
    results.append(result)
results = pd.concat(results)

INFO     Creating doublets, preparing SOLO model.

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

Training:   0%|          | 0/400 [00:00<?, ?it/s]

Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.120. Signaling Trainer to stop.
INFO     Creating doublets, preparing SOLO model.

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

Training:   0%|          | 0/400 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=400` reached.

results[adata.obs.index]

AAACCTGAGATGTCGG-1-E2    singlet
AAACCTGCAATCCAAC-1-E2    singlet
AAACCTGTCCAATGGT-1-E2    singlet
AAACGGGAGGCAATTA-1-E2    doublet
AAACGGGCACGGCGTT-1-E2    singlet
                          ...   
TTTGTCAAGGGCTCTC-1-F2    singlet
TTTGTCACATCGATGT-1-F2    singlet
TTTGTCAGTAGAGGAA-1-F2    doublet
TTTGTCAGTCATCCCT-1-F2    singlet
TTTGTCATCAGTTCGA-1-F2    singlet
Length: 4773, dtype: object

adata.obs['SOLO_prediction'] = results[adata.obs.index]

adata.obs['SOLO_prediction'].value_counts()

SOLO_prediction
singlet    4426
doublet     347
Name: count, dtype: int64

sc.pl.umap(adata, color='SOLO_prediction', frameon=False)

DEs = model.differential_expression(groupby='leiden_scVI')
DEs.head()

DE...:   0%|          | 0/13 [00:00<?, ?it/s]

markers = {}
for i, c in enumerate(adata.obs['leiden_scVI'].unique()):
    comparison_label = "{} vs Rest".format(c)
    cluster_df = DEs.loc[DEs['comparison'] == comparison_label]
    cluster_df = cluster_df[cluster_df['lfc_mean'] > 0]  # lfc_mean > 0: そのクラスター特異的に発現している遺伝子
    cluster_df = cluster_df[cluster_df['bayes_factor'] > 3]
    cluster_df = cluster_df[cluster_df['non_zeros_proportion1'] > 0.1]
    markers[c] = cluster_df.index.tolist()[:3]

markers

{'3': ['Ccn1', 'Top2a', 'Gpx3'],
 '2': ['Nell1', 'Myo16', 'St18'],
 '0': ['Prc1', 'Pif1', 'Kif23'],
 '5': ['Kif19a', 'Egflam', 'Lhx4'],
 '8': ['Ccn1', 'Ddit4l', 'Top2a'],
 '6': ['Ccn1', 'Slfn9', 'Ccne2'],
 '11': ['Chst9', 'Pmfbp1', 'Gm32647'],
 '1': ['Thsd7b', 'D130079A08Rik', 'Cntn2'],
 '9': ['Sp8', 'Pou4f2', 'Unc5d'],
 '10': ['Lrfn5', 'Synm', 'Chl1'],
 '7': ['Gm32629', 'Krt73', 'Rgs16'],
 '4': ['Penk', 'Mybl1', 'Neurog2'],
 '12': ['Grm1', 'Fxyd7', 'Car10']}

sc.tl.dendrogram(adata, groupby='leiden_scVI', use_rep='X_scVI')

sc.pl.dotplot(
    adata,
    markers,
    groupby='leiden_scVI',
    dendrogram=True,
    color_map="Blues",
    swap_axes=True,
    use_raw=True,
    standard_scale='var')

sc.pl.heatmap(
    adata,
    markers,
    groupby='leiden_scVI',
    layer='scvi_normalized',
    standard_scale="var",
    dendrogram=True,
    figsize=(8, 12)
)

adata.write(filename='./data/retinal_1.h5ad')

	gene_ids	feature_types
Xkr4	ENSMUSG00000051951	Gene Expression
Gm1992	ENSMUSG00000089699	Gene Expression
Gm19938	ENSMUSG00000102331	Gene Expression
Gm37381	ENSMUSG00000102343	Gene Expression
Rp1	ENSMUSG00000025900	Gene Expression
...	...	...
AC124606.1	ENSMUSG00000095523	Gene Expression
AC133095.2	ENSMUSG00000095475	Gene Expression
AC133095.1	ENSMUSG00000094855	Gene Expression
AC234645.1	ENSMUSG00000095019	Gene Expression
AC149090.1	ENSMUSG00000095041	Gene Expression

	gene_ids	feature_types	n_cells	mt	n_cells_by_counts	mean_counts	pct_dropout_by_counts	total_counts	highly_variable	means	dispersions	dispersions_norm
Xkr4	ENSMUSG00000051951	Gene Expression	2817	False	2817	1.286673	59.762891	9008.0	True	1.159619	1.738898	1.326996
Gm1992	ENSMUSG00000089699	Gene Expression	533	False	533	0.088416	92.386802	619.0	False	0.115807	0.803469	-0.190706
Gm19938	ENSMUSG00000102331	Gene Expression	588	False	588	0.096843	91.601200	678.0	False	0.165617	0.838521	-0.025215
Rp1	ENSMUSG00000025900	Gene Expression	21	False	21	0.004142	99.700043	29.0	True	0.014691	1.690833	3.998795
Mrpl15	ENSMUSG00000033845	Gene Expression	1963	False	1963	0.359234	71.961148	2515.0	False	0.617765	0.783958	-0.605477
...	...	...	...	...	...	...	...	...	...	...	...	...
CAAA01118383.1	ENSMUSG00000063897	Gene Expression	205	False	205	0.029853	97.071847	209.0	False	0.066917	0.785096	-0.277454
Vamp7	ENSMUSG00000051412	Gene Expression	1735	False	1735	0.315241	75.217826	2207.0	False	0.534669	0.744384	-0.595847
Tmlhe	ENSMUSG00000079834	Gene Expression	738	False	738	0.117269	89.458649	821.0	False	0.200372	0.883025	0.184900
4933409K07Rik	ENSMUSG00000095552	Gene Expression	20	False	20	0.002857	99.714327	20.0	False	0.003138	0.025211	-3.865090
AC149090.1	ENSMUSG00000095041	Gene Expression	2249	False	2249	0.533210	67.876018	3733.0	False	0.692310	1.040684	0.181685

	batch	n_genes	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	n_counts	leiden_r0.5
AAACCTGAGATGTCGG-1-E2	E2	1776	1776	3776.0	112.0	2.966102	3776.0	1
AAACCTGCAATCCAAC-1-E2	E2	1697	1697	3113.0	97.0	3.115965	3113.0	1
AAACCTGTCCAATGGT-1-E2	E2	2102	2102	3925.0	91.0	2.318471	3925.0	8
AAACGGGAGGCAATTA-1-E2	E2	3440	3440	10695.0	224.0	2.094437	10695.0	4
AAACGGGCACGGCGTT-1-E2	E2	2019	2019	4322.0	228.0	5.275335	4322.0	4
...	...	...	...	...	...	...	...	...
TTTGTCAAGGGCTCTC-1-F2	F2	3009	3009	7465.0	143.0	1.915606	7465.0	9
TTTGTCACATCGATGT-1-F2	F2	2507	2507	5026.0	135.0	2.686033	5026.0	3
TTTGTCAGTAGAGGAA-1-F2	F2	2573	2573	6615.0	176.0	2.660620	6615.0	0
TTTGTCAGTCATCCCT-1-F2	F2	2389	2389	4923.0	121.0	2.457851	4923.0	5
TTTGTCATCAGTTCGA-1-F2	F2	1610	1610	2835.0	54.0	1.904762	2835.0	3

	proba_de	proba_not_de	bayes_factor	scale1	scale2	delta	lfc_mean	lfc_median	lfc_std	...	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	is_de_fdr_0.05	comparison	group2
Prc1	0.9806	0.0194	3.922891	0.000584	0.000093	0.25	3.560691	3.684415	1.862878	...	3.777792	0.459458	0.849850	0.185050	5.998869	0.816749	True	0 vs Rest	Rest
Pif1	0.9804	0.0196	3.912431	0.000042	0.000006	0.25	4.036641	4.243926	2.424492	...	0.201201	0.033114	0.148649	0.025810	0.314090	0.054497	True	0 vs Rest	Rest
Kif23	0.9796	0.0204	3.871609	0.000229	0.000041	0.25	3.767542	3.817360	2.176383	...	1.487993	0.194058	0.668168	0.110299	2.423138	0.367420	True	0 vs Rest	Rest
Sapcd2	0.9792	0.0208	3.851782	0.000012	0.000002	0.25	4.814266	4.931459	3.254749	...	0.039039	0.007305	0.036036	0.007305	0.053764	0.014499	True	0 vs Rest	Rest
Ccnb1	0.9792	0.0208	3.851782	0.000462	0.000095	0.25	3.523082	3.689767	2.410103	...	3.084096	0.465302	0.728228	0.154371	4.645929	0.852238	True	0 vs Rest	Rest

PythonによるscRNA-seq解析その1¶

使用データ¶

Pythonに関係ない部分の解析¶

データのダウンロード¶

Cell Rangerによる解析¶

ライブラリのimport¶

データの読み込み¶

前処理¶

n_genes_by_counts: 各細胞ごとにカウントがあった遺伝子数 #total_counts: 各細胞ごとの合計カウント数 #pct_mt_counts: ミトコンドリアのカウントのパーセント¶

正規化¶

特徴量選択（発現量の変動が大きい遺伝子）¶

次元削減¶

主成分分析（PCA）¶

t分布型確率的近傍埋め込み（t-SNE）¶

UMAP¶

クラスタリング¶

グラフベースのクラスタリング Leiden クラスタリング¶

深層生成モデルの利用¶

scVIモデルのトレーニング（バッチ補正）¶

scVI潜在表現+UMAPによる次元削減¶

scVI潜在表現+Leidenによるクラスタリング¶

Doublet検出¶

DEG解析¶

データの保存¶

	batch
AAACCTGAGATGTCGG-1-E2	E2
AAACCTGCAATCCAAC-1-E2	E2
AAACCTGCACATGGGA-1-E2	E2
AAACCTGCAGCAGTTT-1-E2	E2
AAACCTGGTTCCTCCA-1-E2	E2
...	...
TTTGTCACATCGATGT-1-F2	F2
TTTGTCAGTAGAGGAA-1-F2	F2
TTTGTCAGTCATCCCT-1-F2	F2
TTTGTCATCAGTTCGA-1-F2	F2
TTTGTCATCCGAATGT-1-F2	F2

PythonによるscRNA-seq解析 その1¶

使用データ¶

Pythonに関係ない部分の解析¶

データのダウンロード¶

Cell Rangerによる解析¶

ライブラリのimport¶

データの読み込み¶

前処理¶

n_genes_by_counts: 各細胞ごとにカウントがあった遺伝子数 #total_counts: 各細胞ごとの合計カウント数 #pct_mt_counts: ミトコンドリアのカウントのパーセント¶

正規化¶

特徴量選択（発現量の変動が大きい遺伝子）¶

次元削減¶

主成分分析（PCA）¶

t分布型確率的近傍埋め込み（t-SNE）¶

UMAP¶

クラスタリング¶

グラフベースのクラスタリング Leiden クラスタリング¶

深層生成モデルの利用¶

scVIモデルのトレーニング（バッチ補正）¶

scVI潜在表現+UMAPによる次元削減¶

scVI潜在表現+Leidenによるクラスタリング¶

Doublet検出¶

DEG解析¶

データの保存¶

PythonによるscRNA-seq解析その1¶