python - How to obtain random access of a gzip compressed file -
according this faq on zlib.net possible to:
access data randomly in compressed stream
i know module bio.bgzf of biopyton 1.60, which:
supports reading , writing bgzf files (blocked gnu zip format), variant of gzip efficient random access, commonly used part of bam file format , in tabix. uses python’s zlib library internally, , provides simple interface python’s gzip library.
but use case don't want use format. want something, emulates code below:
import gzip large_integer_new_line_start = 10**9 gzip.open('large_file.gz','rt') f: f.seek(large_integer_new_line_start)
but efficiency offered native zlib.net provide random access compressed stream. how leverage random access capability in python?
i gave on doing random access on gzipped file using python. instead converted gzipped file block gzipped file block compression/decompression utility on command line:
zcat large_file.gz | bgzip > large_file.bgz
then used biopython , tell virtual_offset of line number 1 million of bgzipped file. , able rapidly seek virtual_offset afterwards:
from bio import bgzf file='large_file.bgz' handle = bgzf.bgzfreader(file) in range(10**6): handle.readline() virtual_offset = handle.tell() line1 = handle.readline() handle.close() handle = bgzf.bgzfreader(file) handle.seek(virtual_offset) line2 = handle.readline() handle.close() assert line1==line2
i point so answer mark adler here on examples/zran.c in zlib distribution.
Comments
Post a Comment