Filters detected content (regex).
Filters detected content (regex).
a list of regular expressions
      
    
      Filters detected dates.
Filters detected dates.
a list of dates
      
    
      Filters detected domains (regex).
Filters detected domains (regex).
a list of urls for the source domains
      
    
      Filters detected HTTP status codes.
Filters detected HTTP status codes.
a list of HTTP status codes
      
    
      Filters detected language.
Filters detected language.
a set of ISO 639-2 codes
      
    
      Filters ArchiveRecord MimeTypes (web server).
Filters ArchiveRecord MimeTypes (web server).
a list of Mime Types
      
    
      Filters detected MimeTypes (Tika).
Filters detected MimeTypes (Tika).
a list of Mime Types
      
    
      Filters detected URL patterns (regex).
Filters detected URL patterns (regex).
a list of Regular expressions
      
    
      Filters detected URLs.
Filters detected URLs.
a list of urls
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
    
      
    
      
    
      
    
      
      
    
      Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of regular expressions to keep
      
    
      Removes all data that does not have selected date.
Removes all data that does not have selected date.
a list of dates
the selected DateComponent enum value
      
    
      Removes all data but selected source domains.
Removes all data but selected source domains.
a list of urls for the source domains
      
    
      Removes all data that does not have selected HTTP status codes.
Removes all data that does not have selected HTTP status codes.
a list of HTTP status codes
      
    
      Removes all data except images.
      
    
      Removes all data not in selected language.
Removes all data not in selected language.
a set of ISO 639-2 codes
      
    
      Removes all data but selected mimeTypes specified.
Removes all data but selected mimeTypes specified.
a list of Mime Types
      
    
      Removes all data but selected mimeTypes as detected by Tika.
Removes all data but selected mimeTypes as detected by Tika.
a list of Mime Types
      
    
      Removes all data but selected URL patterns.
Removes all data but selected URL patterns.
a list of regular expressions
      
    
      Removes all data but selected exact URLs.
Removes all data but selected exact URLs.
a list of URLs to keep
      
    
      Removes all non-html-based data (images, executables, etc.) from html text.
      
    
      
      
    
      
      
    
      
      
    
      
    
      
    
      
    
      
    
      
    
      
    
      
      
    
      
    
      
    
      
      
    
      
    
      
    
      
      
    
      
      
    
      
      
    
      Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.
      
    
      Extracts webpages with columns for crawl data, url, MIME type, and content.
      
    
      
    
A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an RDD, please see RecordLoader.