Package io.archivesunleashed.data
Class WarcRecordUtils
- java.lang.Object
-
- io.archivesunleashed.data.WarcRecordUtils
-
- All Implemented Interfaces:
org.archive.format.ArchiveFileConstants
,org.archive.format.warc.WARCConstants
public final class WarcRecordUtils extends Object implements org.archive.format.warc.WARCConstants
Utilities for working withWARCRecord
s (from archive.org APIs).
-
-
Field Summary
-
Fields inherited from interface org.archive.format.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, ORIGIN_FIELD_KEY, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
-
Fields inherited from interface org.archive.format.warc.WARCConstants
COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_REFERS_TO, HEADER_KEY_REFERS_TO_DATE, HEADER_KEY_REFERS_TO_FILE_OFFSET, HEADER_KEY_REFERS_TO_FILENAME, HEADER_KEY_REFERS_TO_TARGET_URI, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, HTTP_RESPONSE_MIMETYPE_NS, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, TRUNCATED_VALUE_UNSPECIFIED, TYPE, WARC_FIELDS_TYPE, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WSP
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.archive.io.warc.WARCRecord
fromBytes(byte[] bytes)
Converts raw bytes into anWARCRecord
.static byte[]
getBodyContent(org.archive.io.warc.WARCRecord record)
Extracts contents of the body from aWARCRecord
.static byte[]
getContent(org.archive.io.warc.WARCRecord record)
Extracts raw contents from aWARCRecord
(including HTTP headers).static String
getWarcResponseMimeType(byte[] contents)
Extracts the MIME type of WARC response records.static byte[]
toBytes(org.archive.io.warc.WARCRecord record)
Converts WARC record into raw bytes.
-
-
-
Method Detail
-
fromBytes
public static org.archive.io.warc.WARCRecord fromBytes(byte[] bytes) throws IOException
Converts raw bytes into anWARCRecord
.- Parameters:
bytes
- raw bytes- Returns:
- parsed
WARCRecord
- Throws:
IOException
- if there is an issue
-
toBytes
public static byte[] toBytes(org.archive.io.warc.WARCRecord record) throws IOException
Converts WARC record into raw bytes.- Parameters:
record
- conents of WARC response record- Returns:
- raw contents
- Throws:
IOException
- if there is an issue
-
getWarcResponseMimeType
public static String getWarcResponseMimeType(byte[] contents)
Extracts the MIME type of WARC response records. "WARC-Type" is "response". Note that this is different from the "Content-Type" in the WARC header.- Parameters:
contents
- raw contents of the WARC response record- Returns:
- MIME type
-
getContent
public static byte[] getContent(org.archive.io.warc.WARCRecord record) throws IOException
Extracts raw contents from aWARCRecord
(including HTTP headers).- Parameters:
record
- theWARCRecord
- Returns:
- raw contents
- Throws:
IOException
- if there is an issue
-
getBodyContent
public static byte[] getBodyContent(org.archive.io.warc.WARCRecord record) throws IOException
Extracts contents of the body from aWARCRecord
. Excludes HTTP headers.- Parameters:
record
- theWARCRecord
- Returns:
- contents of the body
- Throws:
IOException
- if there is an issue
-
-