Interface HtsCodec<D extends HtsDecoderOptions,E extends HtsEncoderOptions>
- Type Parameters:
D- the decoder options type for this codecE- the encoder options type for this codec
- All Superinterfaces:
Upgradeable
- All Known Subinterfaces:
HaploidReferenceCodec,ReadsCodec,VariantsCodec
- All Known Implementing Classes:
BAMCodec,BAMCodecV1_0,CRAMCodec,CRAMCodecV2_1,CRAMCodecV3_0,CRAMCodecV3_1,FASTACodecV1_0,HtsgetBAMCodec,HtsgetBAMCodecV1_2,SAMCodec,SAMCodecV1_0,VCFCodec,VCFCodecV3_2,VCFCodecV3_3,VCFCodecV4_0,VCFCodecV4_1,VCFCodecV4_2,VCFCodecV4_3
htsjdk.beta.plugin codecs.
Codec Components
Each version of a file format supported by the htsjdk.beta.plugin framework is
represented by a trio of components:
- a codec that implements
HtsCodec - an encoder that implements
HtsEncoder - a decoder that implements
HtsDecoder
The HtsCodec is a lightweight and long-lived object that resides in an
HtsCodecRegistry. A registry is used to resolve requests for
an HtsEncoder or HtsDecoder that matches a given resource. The HtsEncoder
and HtsDecoder objects do the work of actually writing and reading records to and from
underlying resources.
A default, static, immutable HtsCodecRegistry is populated with
HtsCodecs that are discovered and instantiated statically via a ServiceLoader,
and can be accessed using HtsDefaultRegistry. A private, mutable
registry can be created at runtime via HtsCodecRegistry.createPrivateRegistry(), and populated
dynamically by calls to HtsCodecRegistry.registerCodec(HtsCodec).
The primary responsibility of an HtsCodec is to satisfy requests made by the framework during
codec resolution, inspecting and recognizing input URIs and stream resources that match the
supported format and version, and providing an HtsEncoder or HtsDecoder on demand, once
a match is made.
Content Types
The plugin framework supports four different types of HTS data, called content types:
-
HtsContentType.ALIGNED_READS -
HtsContentType.HAPLOID_REFERENCE -
HtsContentType.VARIANT_CONTEXTS -
HtsContentType.FEATURES
For each content type, there is a corresponding set of codec/decoder/encoder interfaces that
are implemented by components that support that content type. These interfaces extend generic base
interfaces, and provide generic parameter type instantiations appropriate for that content type.
As an example, see ReadsDecoder which defines the interface for
all HtsDecoders for the HtsContentType.ALIGNED_READS content
type. The different implementations of component trios for a given content type all use the same
content-type-specific interfaces, but each over a different combination of underlying file format
and version.
The generic, base interfaces that are common to all codecs, encoders, and decoders are:
-
HtsCodec: base codec interface -
HtsEncoder: base encoder interface -
HtsEncoderOptions: base options interface for encoders -
HtsDecoder: base decoder interface -
HtsDecoderOptions: base options interface for decoders - a class with string constants for each supported file format for that type
-
Bundle: a optional type-specificBundleimplementation
The packages containing the content type-specific interface definitions for each of the four different content types are:
- For
HtsContentType.ALIGNED_READScodecs, see thehtsjdk.beta.plugin.readspackage - For
HtsContentType.HAPLOID_REFERENCEcodecs, see thehtsjdk.beta.plugin.haprefpackage - For
HtsContentType.VARIANT_CONTEXTScodecs, see thehtsjdk.beta.plugin.variantspackage - For
HtsContentType.FEATUREScodecs, see thepackageinvalid reference
htsjdk.beta.plugin.features
Example Content Type: Reads
As an example, the htsjdk.beta.plugin.reads package defines the following interfaces
that extend the generic base interfaces for codecs with content type HtsContentType.ALIGNED_READS:
-
ReadsCodec: reads codec interface, extends the genericHtsCodecinterface -
ReadsEncoder: reads encoder, extends the genericHtsEncoderinterface -
ReadsEncoderOptions: reads encoder options, extends the genericHtsDecoderOptionsinterface -
ReadsDecoder: reads decoder interface, extends the genericHtsDecoderinterface -
ReadsDecoderOptions: reads decoder options, extends the genericHtsDecoderOptionsinterface -
ReadsFormats: an class with string constants for each possible supported reads file format
Codec Resolution
The plugin framework uses registered codecs to conduct a series of probes into the structure and format of an input or output resource in order to find a matching codec that can produce an encoder or decoder for that resource. The values returned from the codec methods are used by the framework to prune a list of candidate codecs down, until a match is found. During codec resolution, the codec methods are called in the following order:
See the HtsCodecResolver methods for more detail on the resolution
protocol:
-
HtsCodecResolver.resolveForDecoding(Bundle) -
HtsCodecResolver.resolveForEncoding(Bundle) -
HtsCodecResolver.resolveForEncoding(Bundle, HtsVersion)
Formats That Use a Custom URI or Protocol Scheme
Many file formats consist of a single file that resides on a file system that is supported by a
java.nio file system provider. Codecs that support such formats are generally agnostic
about the IOPath or URI protocol scheme used to identify their resources, and assume that file contents
can be accessed directly via a single stream created via a java.nio file system provider.
However, some file formats use a specific, well known URI format or protocol scheme, often
to identify a remote or otherwise specially-formatted resource, such as a local database
that is distributed across multiple physical files. These codecs may bypass direct file java.nio
system access, and instead use specialized code to access their underlying resources.
For example, the BAMCodecV1_0 assumes that IOPath
resources can be accessed as a stream on a single file via either the "file://" protocol, or
other protocols such gs:// or hdfs:// that have java.nio file system providers. It does
not require or assume a particular URI format, and is agnostic about URI scheme.
In contrast, the HtsgetBAMCodecV1_2 codec
is a specialized codec that handles remote resources via the "http://" protocol.
It uses http to access the underlying resource, and bypasses direct java.nio
file system access.
Codecs for formats that use a custom URI format or protocol scheme such as htsget must be
able to determine if they can decode or encode a resource purely by inspecting the IOPath/URI, and
should follow these guidelines:
- return true when
ownsURI(IOPath)is presented with an IOPath with a conforming URI - return true when
canDecodeURI(IOPath)is presented with an IOPath with a conforming URI - ensure that for a given IOPath,
ownsURI(IOPath)==canDecodeURI(IOPath) - always return 0 from the
getSignatureProbeLength()method - always return 0 from the
getSignatureLength()method
Codec Implementation Guidelines
- An HtsCodec class should implement only a single version of a single file format.
- HtsCodec instances may be shared across multiple registries, and should generally be immutable (HtsEncoder and HtsDecoder implementations may be mutable).
-
For file formats that use a separate index resource to handle index queries, the
getDecoder(Bundle, HtsDecoderOptions)implementation should not attempt to automatically resolve the companion index in order to satisfy index queries, if the index resource is not provided in the input bundle.HtsDecoders for such file formats should only satisfy index queries if the input bundle explicitly specifies the index resource. For file formats that do no use a separate index resource to be specified (such as those that rely on a remote access mechanism), it is permissible to satisfy index queries without requiring the index resource to be included in the bundle. -
Codecs should avoid throwing exceptions from methods used during codec resolution (which includes all
methods other than
getDecoder(Bundle, HtsDecoderOptions)andgetEncoder(Bundle, HtsEncoderOptions)).
-
Method Summary
Modifier and TypeMethodDescriptionbooleancanDecodeSignature(SignatureStream signatureStream, String sourceName) Determine if the codec can decode an input stream by inspecting a signature embedded within the stream.booleancanDecodeURI(IOPath ioPath) Determine if the URI forioPath(obtained viaIOPath.getURI()) conforms to the expected URI format this codec's file format.Get theHtsContentTypefor this codec.HtsDecoder<?, ? extends HtsRecord> getDecoder(Bundle inputBundle, D decoderOptions) Get anHtsDecoderto decode the provided inputs.default StringGet a user-friendly display name for this codec.HtsEncoder<?, ? extends HtsRecord> getEncoder(Bundle outputBundle, E encoderOptions) Get anHtsEncoderto encode to the provided outputs.Get the name of the file format supported by this codec.intGet the number of bytes in the format and version signature used by the file format supported by this codec.default intGet the number of bytes of needed by this codec to probe an input stream for a format/version signature, and determine if it can supply a decoder for the stream.Get the version of the file format returned bygetFileFormat()that is supported by this codec.default booleanDetermine if this codec "owns" the URI contained inioPathsee (IOPath.getURI()).Methods inherited from interface htsjdk.beta.plugin.Upgradeable
runVersionUpgrade
-
Method Details
-
getContentType
HtsContentType getContentType()Get theHtsContentTypefor this codec.- Returns:
- the
HtsContentTypefor this codec. TheHtsContentTypedetermines the interfaces, including the HEADER and RECORD types, used by this codec'sHtsEncoderandHtsDecoder. Each implementation of a given content type exposes the same interfaces, but over a different file format or version. For example, both the BAM and HTSGET_BAM codecs have codec typeHtsContentType.ALIGNED_READS, and are derived fromReadsCodec, but the serialized file formats and access mechanisms for the two codecs are different).
-
getFileFormat
String getFileFormat()Get the name of the file format supported by this codec. The format name defines the underlying format handled by this codec, and also corresponds to the format of the primary bundle resource that is required when decoding or encoding (seeBundleResourceTypeandBundleResource.getFileFormat()).- Returns:
- the name of the underlying file format handled by this codec
-
getVersion
HtsVersion getVersion()Get the version of the file format returned bygetFileFormat()that is supported by this codec.- Returns:
- the file format version (
HtsVersion) supported by this codec
-
getDisplayName
Get a user-friendly display name for this codec. It is recommended that the display name minimally include both the name of the supported file format and the supported version.- Returns:
- a user-friendly display name for this codec
-
ownsURI
Determine if this codec "owns" the URI contained inioPathsee (IOPath.getURI()).A codec "owns" the URI only if it has specific requirements on the URI protocol scheme, URI format, or query parameters that go beyond a simple file extension, AND it explicitly recognizes the URI as conforming to those requirements. File formats that only require a specific file extension should always return false from
ownsURI(htsjdk.io.IOPath), and should instead use the extension as a filter incanDecodeURI(IOPath).Returning true from this method will cause the framework to bypass the stream-oriented signature probing that is used to resolve inputs to a codec handler. During codec resolution, if any registered codec returns true for this method on
ioPath, the signature probing protocol will instead:- immediately prune the list of candidate codecs to only those that return true for this method
on
ioPath - not attempt to obtain an InputStream on the IOPath containing the URI, on the assumption that special handling is required in order to access the underlying resource (i.e., htsget codec would claim an "http://" URI if the rest of the URI conforms to the expected format for that codec's protocol).
Any codec that returns true from
ownsURI(IOPath)for a given IOPath must also return true fromcanDecodeURI(IOPath)for the same IOPath. For custom URI handlers, codecs should avoid making remote calls to determine the suitability or accessibility of the input resource; the return value for this method should be based only on the format of the URI that is presented. Operations that require remote access that can fail, such as validating server connectivity, authentication, or authorization, should be deferred until data is requested by the caller via the codec'sHtsEncoderorHtsDecoder. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.- Parameters:
ioPath- the ioPath to inspect- Returns:
- true if the ioPath's URI represents a custom URI that this codec handles
- immediately prune the list of candidate codecs to only those that return true for this method
on
-
canDecodeURI
Determine if the URI forioPath(obtained viaIOPath.getURI()) conforms to the expected URI format this codec's file format. Most implementations only look at the file extension (seeIOPath.hasExtension(java.lang.String)). For codecs that implement formats that use specific, well known file extensions, the codec should reject inputs that do not conform to any of the accepted extensions. If the format does not use a specific extension, or if the codec cannot determine if it can decode the underlying resource without inspecting the underlying stream, it is safe to return true, after which the framework will subsequently call this codec'scanDecodeSignature(SignatureStream, String)method, at which time the codec can inspect the actual underlying stream via theSignatureStream.Implementations should generally not inspect the URI's protocol scheme unless the file format supported by the codec requires the use a specific protocol scheme. For codecs that do own a specific scheme or URI format, the return values for
ownsURI(IOPath)andcanDecodeURI(IOPath)must always be the same (both true or both false) for a given IOPath. For codecs that do not use a custom URI (and rely on NIO access), @link #ownsURI(IOPath)} should always return false, with only the value returned fromcanDecodeURI(IOPath)varying based on features such as file extension probes.It is never safe to attempt to directly inspect the underlying stream for
For custom URI handlers (seeioPathin this method. If the stream needs to be inspected, it should be done using the signature stream when thecanDecodeSignature(SignatureStream, String)method is called.ownsURI(IOPath), codecs should avoid making remote calls to determine the suitability of the input resource; the return value for this method should be based only on the format of the URI that is presented. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.- Parameters:
ioPath- to be decoded- Returns:
- true if the codec can provide a decoder to provide this URI
-
canDecodeSignature
Determine if the codec can decode an input stream by inspecting a signature embedded within the stream. The probingInputStream stream will contain only a fragment of the actual input stream, taken from the start of the stream, the size of which will be the lesser of:- the number of bytes returned by
getSignatureProbeLength() - the entire input stream, for streams that are smaller than
getSignatureProbeLength()
Codecs that handle custom URIs that reference remote resources (those that return true for
ownsURI(htsjdk.io.IOPath)) should generally not inspect the stream, and should return false from this method, since the method will never be called with any resource for whichownsURI(htsjdk.io.IOPath)returned true. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.- Parameters:
signatureStream- the stream to be inspect for the resource's embedded signature and versionsourceName- a display name describing the source of the input stream, for use in error messages- Returns:
- true if this codec recognizes the stream by it's signature, and can provide a decoder to decode the stream, otherwise false
- the number of bytes returned by
-
getSignatureLength
int getSignatureLength()Get the number of bytes in the format and version signature used by the file format supported by this codec.- Returns:
- if the file format supported by this codecs is not remote, and is accessible via a local file
or stream, the size of the unique signature/version for this file format. otherwise 0.
Note: Codecs that are custom URI handlers (those that return true for
ownsURI(htsjdk.io.IOPath)), should always return 0 from this method. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.
-
getSignatureProbeLength
default int getSignatureProbeLength()Get the number of bytes of needed by this codec to probe an input stream for a format/version signature, and determine if it can supply a decoder for the stream.- Returns:
- the number of bytes this codec must consume from a stream in order to determine whether
it can decode that stream. This number may differ from the actual signature size
as returned by
getSignatureLength()for codecs that support compressed or encrypted streams, since they may require a larger and more semantically meaningful input fragment (such as an entire encrypted or compressed block) in order to inspect the plaintext signature.Therefore
signatureProbeLengthshould be expressed in "compressed/encrypted" space rather than "plaintext" space. The length returned from this method is used to determine the size of theSignatureStreamthat is subsequently passed tocanDecodeSignature(SignatureStream, String).Note: Codecs that are custom URI handlers (those that return true for
ownsURI(IOPath)), should always return 0 from this method when it is called. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.
-
getDecoder
Get anHtsDecoderto decode the provided inputs. The input bundle must contain resources of the type required by this codec. To find a codec appropriate for decoding a given resource, use anHtsCodecResolverobtained from anHtsCodecRegistry.The framework will never call thi* method unless either
ownsURI(IOPath), orcanDecodeURI(IOPath)andcanDecodeSignature(SignatureStream, String)(IOPath)} return true forinputBundle.- Parameters:
inputBundle- input to be decoded. To get a decoder for use with index queries that useHtsQuerymethods, the bundle must contain an index resource.decoderOptions- options for the decoder to use- Returns:
- an
HtsDecoderthat can decode the provided inputs
-
getEncoder
Get anHtsEncoderto encode to the provided outputs. The output bundle must contain resources of the type required by this codec. To find a codec appropriate for encoding a given resource, use anHtsCodecResolverobtained from anHtsCodecRegistry. The framework will never call this method unless eitherownsURI(IOPath), orcanDecodeURI(IOPath)returned true foroutputBundle.- Parameters:
outputBundle- target output for the encoderencoderOptions- encoder options to use- Returns:
- an
HtsEncodersuitable for writing to the provided outputs
-