Package org.biojavax.bio.seq.io
Class EMBLFormat
- java.lang.Object
-
- org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
-
- org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
-
- org.biojavax.bio.seq.io.EMBLFormat
-
- All Implemented Interfaces:
SequenceFormat,RichSequenceFormat
public class EMBLFormat extends RichSequenceFormat.HeaderlessFormat
Format reader for EMBL files. This version of EMBL format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.EmblLikeFormat object.This format will read both Pre-87 and 87+ versions of EMBL. It will also write them both. By default, it will write the most recent version. If you want an earlier one, you must specify the format by passing one of the constants defined in this class to
writeSequence(Sequence, String, Namespace).- Since:
- 1.5
- Author:
- Richard Holland, Jolyon Holdstock, Mark Schreiber
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classEMBLFormat.TermsImplements some EMBL-specific terms.-
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
-
-
Field Summary
Fields Modifier and Type Field Description protected static java.lang.StringACCESSION_TAGprotected static java.lang.StringAUTHORS_TAGprotected static java.lang.StringCOMMENT_TAGprotected static java.lang.StringCONSORTIUM_TAGprotected static java.lang.StringCONTIG_TAGprotected static java.lang.StringDATABASE_XREF_TAGprotected static java.lang.StringDATE_TAGprotected static java.util.regex.Patterndbxpprotected static java.lang.StringDEFINITION_TAGprotected static java.lang.StringDELIMITER_TAGprotected static java.util.regex.Patterndpstatic java.lang.StringEMBL_FORMATThe name of the current formatstatic java.lang.StringEMBL_PRE87_FORMATThe name of the Pre-87 formatprotected static java.lang.StringEND_SEQUENCE_TAGprotected static java.lang.StringFEATURE_HEADER_TAGprotected static java.lang.StringFEATURE_TAGprotected static java.util.regex.PatternheaderLineprotected static java.lang.StringKEYWORDS_TAGprotected static java.lang.StringLOCATOR_TAGprotected static java.lang.StringLOCUS_TAGprotected static java.util.regex.Patternlpprotected static java.util.regex.PatternlpPre87protected static java.lang.StringORGANELLE_TAGprotected static java.lang.StringORGANISM_TAGprotected static java.util.regex.PatternreadableFileNamesprotected static java.lang.StringREFERENCE_POSITION_TAGprotected static java.lang.StringREFERENCE_TAGprotected static java.lang.StringREFERENCE_XREF_TAGprotected static java.lang.StringREMARK_TAGprotected static java.util.regex.Patternrppprotected static java.lang.StringSOURCE_TAGprotected static java.lang.StringSTART_SEQUENCE_TAGprotected static java.lang.StringTITLE_TAGprotected static java.lang.StringTPA_TAGprotected static java.lang.StringVERSION_TAGprotected static java.util.regex.Patternvp
-
Constructor Summary
Constructors Constructor Description EMBLFormat()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleancanRead(java.io.BufferedInputStream stream)Check to see if a given stream is in our format.booleancanRead(java.io.File file)Check to see if a given file is in our format.java.lang.StringgetDefaultFormat()getDefaultFormatreturns the String identifier for the default sub-format written by aSequenceFormatimplementation.SymbolTokenizationguessSymbolTokenization(java.io.BufferedInputStream stream)On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.SymbolTokenizationguessSymbolTokenization(java.io.File file)On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.booleanreadRichSequence(java.io.BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.booleanreadSequence(java.io.BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)Read a sequence and pass data on to a SeqIOListener.voidwriteSequence(Sequence seq, java.io.PrintStream os)writeSequencewrites a sequence to the specified PrintStream, using the default format.voidwriteSequence(Sequence seq, java.lang.String format, java.io.PrintStream os)writeSequencewrites a sequence to the specifiedPrintStream, using the specified format.voidwriteSequence(Sequence seq, java.lang.String format, Namespace ns)As perwriteSequence(Sequence, Namespace), except that it also takes a format parameter.voidwriteSequence(Sequence seq, Namespace ns)Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.-
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
beginWriting, finishWriting
-
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
-
-
-
-
Field Detail
-
EMBL_PRE87_FORMAT
public static final java.lang.String EMBL_PRE87_FORMAT
The name of the Pre-87 format- See Also:
- Constant Field Values
-
EMBL_FORMAT
public static final java.lang.String EMBL_FORMAT
The name of the current format- See Also:
- Constant Field Values
-
LOCUS_TAG
protected static final java.lang.String LOCUS_TAG
- See Also:
- Constant Field Values
-
ACCESSION_TAG
protected static final java.lang.String ACCESSION_TAG
- See Also:
- Constant Field Values
-
VERSION_TAG
protected static final java.lang.String VERSION_TAG
- See Also:
- Constant Field Values
-
DEFINITION_TAG
protected static final java.lang.String DEFINITION_TAG
- See Also:
- Constant Field Values
-
DATE_TAG
protected static final java.lang.String DATE_TAG
- See Also:
- Constant Field Values
-
DATABASE_XREF_TAG
protected static final java.lang.String DATABASE_XREF_TAG
- See Also:
- Constant Field Values
-
SOURCE_TAG
protected static final java.lang.String SOURCE_TAG
- See Also:
- Constant Field Values
-
ORGANISM_TAG
protected static final java.lang.String ORGANISM_TAG
- See Also:
- Constant Field Values
-
ORGANELLE_TAG
protected static final java.lang.String ORGANELLE_TAG
- See Also:
- Constant Field Values
-
REFERENCE_TAG
protected static final java.lang.String REFERENCE_TAG
- See Also:
- Constant Field Values
-
REFERENCE_POSITION_TAG
protected static final java.lang.String REFERENCE_POSITION_TAG
- See Also:
- Constant Field Values
-
REFERENCE_XREF_TAG
protected static final java.lang.String REFERENCE_XREF_TAG
- See Also:
- Constant Field Values
-
AUTHORS_TAG
protected static final java.lang.String AUTHORS_TAG
- See Also:
- Constant Field Values
-
CONSORTIUM_TAG
protected static final java.lang.String CONSORTIUM_TAG
- See Also:
- Constant Field Values
-
TITLE_TAG
protected static final java.lang.String TITLE_TAG
- See Also:
- Constant Field Values
-
LOCATOR_TAG
protected static final java.lang.String LOCATOR_TAG
- See Also:
- Constant Field Values
-
REMARK_TAG
protected static final java.lang.String REMARK_TAG
- See Also:
- Constant Field Values
-
KEYWORDS_TAG
protected static final java.lang.String KEYWORDS_TAG
- See Also:
- Constant Field Values
-
COMMENT_TAG
protected static final java.lang.String COMMENT_TAG
- See Also:
- Constant Field Values
-
FEATURE_HEADER_TAG
protected static final java.lang.String FEATURE_HEADER_TAG
- See Also:
- Constant Field Values
-
FEATURE_TAG
protected static final java.lang.String FEATURE_TAG
- See Also:
- Constant Field Values
-
CONTIG_TAG
protected static final java.lang.String CONTIG_TAG
- See Also:
- Constant Field Values
-
TPA_TAG
protected static final java.lang.String TPA_TAG
- See Also:
- Constant Field Values
-
START_SEQUENCE_TAG
protected static final java.lang.String START_SEQUENCE_TAG
- See Also:
- Constant Field Values
-
DELIMITER_TAG
protected static final java.lang.String DELIMITER_TAG
- See Also:
- Constant Field Values
-
END_SEQUENCE_TAG
protected static final java.lang.String END_SEQUENCE_TAG
- See Also:
- Constant Field Values
-
dp
protected static final java.util.regex.Pattern dp
-
lp
protected static final java.util.regex.Pattern lp
-
lpPre87
protected static final java.util.regex.Pattern lpPre87
-
vp
protected static final java.util.regex.Pattern vp
-
rpp
protected static final java.util.regex.Pattern rpp
-
dbxp
protected static final java.util.regex.Pattern dbxp
-
readableFileNames
protected static final java.util.regex.Pattern readableFileNames
-
headerLine
protected static final java.util.regex.Pattern headerLine
-
-
Method Detail
-
canRead
public boolean canRead(java.io.File file) throws java.io.IOExceptionCheck to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in EMBL format if its name contains the word eem or edat, or the first line matches the EMBL format for the ID line.- Specified by:
canReadin interfaceRichSequenceFormat- Overrides:
canReadin classRichSequenceFormat.BasicFormat- Parameters:
file- theFileto check.- Returns:
- true if the file is readable by this format, false if not.
- Throws:
java.io.IOException- in case the file is inaccessible.
-
guessSymbolTokenization
public SymbolTokenization guessSymbolTokenization(java.io.File file) throws java.io.IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.- Specified by:
guessSymbolTokenizationin interfaceRichSequenceFormat- Overrides:
guessSymbolTokenizationin classRichSequenceFormat.BasicFormat- Parameters:
file- theFileobject to guess the format of.- Returns:
- a
SymbolTokenizationto read the file with. - Throws:
java.io.IOException- if the file is unrecognisable or inaccessible.
-
canRead
public boolean canRead(java.io.BufferedInputStream stream) throws java.io.IOExceptionCheck to see if a given stream is in our format. A stream is in EMBL format if its first line matches the EMBL format for the ID line.- Parameters:
stream- theBufferedInputStreamto check.- Returns:
- true if the stream is readable by this format, false if not.
- Throws:
java.io.IOException- in case the stream is inaccessible.
-
guessSymbolTokenization
public SymbolTokenization guessSymbolTokenization(java.io.BufferedInputStream stream) throws java.io.IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.- Parameters:
stream- theBufferedInputStreamobject to guess the format of.- Returns:
- a
SymbolTokenizationto read the stream with. - Throws:
java.io.IOException- if the stream is unrecognisable or inaccessible.
-
readSequence
public boolean readSequence(java.io.BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener) throws IllegalSymbolException, java.io.IOException, ParseExceptionRead a sequence and pass data on to a SeqIOListener.- Parameters:
reader- The stream of data to parse.symParser- A SymbolParser defining a mapping from character data to Symbols.listener- A listener to notify when data is extracted from the stream.- Returns:
- a boolean indicating whether or not the stream contains any more sequences.
- Throws:
IllegalSymbolException- if it is not possible to translate character data from the stream into valid BioJava symbols.java.io.IOException- if an error occurs while reading from the stream.ParseException
-
readRichSequence
public boolean readRichSequence(java.io.BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns) throws IllegalSymbolException, java.io.IOException, ParseExceptionReads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.- Parameters:
reader- the input sourcesymParser- the tokenizer which understands the sequence being readrlistener- the listener to send sequence events tons- the namespace to read sequences into.- Returns:
- true if there is more to read after this, false otherwise.
- Throws:
IllegalSymbolException- if the tokenizer couldn't understand one of the sequence symbols in the file.java.io.IOException- if there was a read error.ParseException
-
writeSequence
public void writeSequence(Sequence seq, java.io.PrintStream os) throws java.io.IOException
writeSequencewrites a sequence to the specified PrintStream, using the default format.- Parameters:
seq- the sequence to write out.os- the printstream to write to.- Throws:
java.io.IOException
-
writeSequence
public void writeSequence(Sequence seq, java.lang.String format, java.io.PrintStream os) throws java.io.IOException
writeSequencewrites a sequence to the specifiedPrintStream, using the specified format.- Parameters:
seq- aSequenceto write out.format- aStringindicating which sub-format of those available from a particularSequenceFormatimplemention to use when writing.os- aPrintStreamobject.- Throws:
java.io.IOException- if an error occurs.
-
writeSequence
public void writeSequence(Sequence seq, Namespace ns) throws java.io.IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as EMBL has no concept of it.- Parameters:
seq- the sequence to writens- the namespace to write it with- Throws:
java.io.IOException- in case it couldn't write something
-
writeSequence
public void writeSequence(Sequence seq, java.lang.String format, Namespace ns) throws java.io.IOException
As perwriteSequence(Sequence, Namespace), except that it also takes a format parameter. This can be any of the formats defined as constants in this class.- Parameters:
seq- seewriteSequence(Sequence, Namespace)format- the format to use.ns- seewriteSequence(Sequence, Namespace)- Throws:
java.io.IOException- seewriteSequence(Sequence, Namespace)
-
getDefaultFormat
public java.lang.String getDefaultFormat()
getDefaultFormatreturns the String identifier for the default sub-format written by aSequenceFormatimplementation.- Returns:
- a
String.
-
-