Filters MimeTypes from RDDs.
      
    
      
    
      
    
      
    
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of Regular expressions to keep
      
    
      Removes all data that does not have selected data.
Removes all data that does not have selected data.
a list of dates to keep
the selected DateComponent enum value
      
    
      Removes all data but selected source domains.
Removes all data but selected source domains.
a Set of urls for the source domains to keep
      
    
      Removes all data except images.
      
    
      Removes all data not in selected language.
Removes all data not in selected language.
a Set of ISO 639-2 codes
      
    
      Removes all data but selected mimeTypes.
Removes all data but selected mimeTypes.
a Set of Mimetypes to keep
      
    
      Removes all data but selected url patterns.
Removes all data but selected url patterns.
a Set of Regular Expressions to keep
      
    
      Removes all data but selected exact URLs
Removes all data but selected exact URLs
a Set of URLs to keep
      
    
      Removes all non-html-based data (images, executables etc.) from html text.
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
      
    
      
A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an RDD, please see RecordLoader.