Class WarcRecordUtils

  • All Implemented Interfaces:
    org.archive.format.ArchiveFileConstants, org.archive.format.warc.WARCConstants

    public final class WarcRecordUtils
    extends Object
    implements org.archive.format.warc.WARCConstants
    Utilities for working with WARCRecords (from archive.org APIs).
    • Nested Class Summary

      • Nested classes/interfaces inherited from interface org.archive.format.warc.WARCConstants

        org.archive.format.warc.WARCConstants.WARCRecordType
    • Field Summary

      • Fields inherited from interface org.archive.format.ArchiveFileConstants

        ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, ORIGIN_FIELD_KEY, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
      • Fields inherited from interface org.archive.format.warc.WARCConstants

        COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_REFERS_TO, HEADER_KEY_REFERS_TO_DATE, HEADER_KEY_REFERS_TO_FILE_OFFSET, HEADER_KEY_REFERS_TO_FILENAME, HEADER_KEY_REFERS_TO_TARGET_URI, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, HTTP_RESPONSE_MIMETYPE_NS, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, TRUNCATED_VALUE_UNSPECIFIED, TYPE, WARC_FIELDS_TYPE, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WSP
    • Method Detail

      • fromBytes

        public static org.archive.io.warc.WARCRecord fromBytes​(byte[] bytes)
                                                        throws IOException
        Converts raw bytes into an WARCRecord.
        Parameters:
        bytes - raw bytes
        Returns:
        parsed WARCRecord
        Throws:
        IOException - if there is an issue
      • toBytes

        public static byte[] toBytes​(org.archive.io.warc.WARCRecord record)
                              throws IOException
        Converts WARC record into raw bytes.
        Parameters:
        record - conents of WARC response record
        Returns:
        raw contents
        Throws:
        IOException - if there is an issue
      • getWarcResponseMimeType

        public static String getWarcResponseMimeType​(byte[] contents)
        Extracts the MIME type of WARC response records. "WARC-Type" is "response". Note that this is different from the "Content-Type" in the WARC header.
        Parameters:
        contents - raw contents of the WARC response record
        Returns:
        MIME type
      • getContent

        public static byte[] getContent​(org.archive.io.warc.WARCRecord record)
                                 throws IOException
        Extracts raw contents from a WARCRecord (including HTTP headers).
        Parameters:
        record - the WARCRecord
        Returns:
        raw contents
        Throws:
        IOException - if there is an issue
      • getBodyContent

        public static byte[] getBodyContent​(org.archive.io.warc.WARCRecord record)
                                     throws IOException
        Extracts contents of the body from a WARCRecord. Excludes HTTP headers.
        Parameters:
        record - the WARCRecord
        Returns:
        contents of the body
        Throws:
        IOException - if there is an issue