Package htsjdk.variant.bcf2
Class BCF2Utils
java.lang.Object
htsjdk.variant.bcf2.BCF2Utils
Common utilities for working with BCF2 files
Includes convenience methods for encoding, decoding BCF2 type descriptors (size + type)
- Since:
- 5/12
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionstatic StringcollapseStringList(List<String> strings) Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"static intdecodeSize(byte typeDescriptor) static BCF2TypedecodeType(byte typeDescriptor) static intdecodeTypeID(byte typeDescriptor) static BCF2TypedetermineIntegerType(int value) static BCF2TypedetermineIntegerType(int[] values) static BCF2TypedetermineIntegerType(List<Integer> values) static byteencodeTypeDescriptor(int nElements, BCF2Type type) explodeStringList(String collapsed) Inverse operation of collapseStringList.static booleanheaderLinesAreOrderedConsistently(VCFHeader outputHeader, VCFHeader genotypesBlockHeader) Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order.static booleanmakeDictionary(VCFHeader header) Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields.static BCF2TypemaxIntegerType(BCF2Type t1, BCF2Type t2) Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16static bytereadByte(InputStream stream) static final FileReturns a good name for a shadow BCF file for vcfFile.static booleansizeIsOverflow(byte typeDescriptor) static <T> List<T> Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]
-
Field Details
-
MAX_ALLELES_IN_GENOTYPES
public static final int MAX_ALLELES_IN_GENOTYPES- See Also:
-
OVERFLOW_ELEMENT_MARKER
public static final int OVERFLOW_ELEMENT_MARKER- See Also:
-
MAX_INLINE_ELEMENTS
public static final int MAX_INLINE_ELEMENTS- See Also:
-
INTEGER_TYPES_BY_SIZE
-
ID_TO_ENUM
-
-
Method Details
-
makeDictionary
Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields. Note that its critical that the list be dedupped and sorted in a consistent manner each time, as the BCF2 offsets are encoded relative to this dictionary, and if it isn't determined exactly the same way as in the header each time it's very bad- Parameters:
header- the VCFHeader from which to build the dictionary- Returns:
- a non-null dictionary of elements, may be empty
-
encodeTypeDescriptor
-
decodeSize
public static int decodeSize(byte typeDescriptor) -
decodeTypeID
public static int decodeTypeID(byte typeDescriptor) -
decodeType
-
sizeIsOverflow
public static boolean sizeIsOverflow(byte typeDescriptor) -
readByte
- Throws:
IOException
-
collapseStringList
Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"- Parameters:
strings- size > 1 list of strings- Returns:
-
explodeStringList
Inverse operation of collapseStringList. ",s1,s2,s3" => ["s1", "s2", "s3"]- Parameters:
collapsed-- Returns:
-
isCollapsedString
-
shadowBCF
Returns a good name for a shadow BCF file for vcfFile. foo.vcf => foo.bcf foo.xxx => foo.xxx.bcf If the resulting BCF file cannot be written, return null. Happens when vcfFile = /dev/null for example- Parameters:
vcfFile-- Returns:
- the BCF
-
determineIntegerType
-
determineIntegerType
-
maxIntegerType
Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16- Parameters:
t1-t2-- Returns:
-
determineIntegerType
-
toList
Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]- Parameters:
c- the class of the objecto- the object to convert to a Java List- Returns:
-
headerLinesAreOrderedConsistently
public static boolean headerLinesAreOrderedConsistently(VCFHeader outputHeader, VCFHeader genotypesBlockHeader) Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order. If they are consistent, we can simply pass through the raw genotypes block bytes, which is a *huge* performance win for large blocks. Many common operations on BCF2 files (merging them for -nt, selecting a subset of records, etc) don't modify the ordering of the header fields and so can safely pass through the genotypes undecoded. Some operations -- those at add filters or info fields -- can change the ordering of the header fields and so produce invalid BCF2 files if the genotypes aren't decoded
-