Packages

c

io.archivesunleashed

WARecordRDD

implicit class WARecordRDD extends Serializable

A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.

To load such an RDD, please see RecordLoader.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordRDD
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordRDD(rdd: RDD[ArchiveRecord])

Value Members

  1. def all(): DataFrame
  2. def audio(): DataFrame
  3. def discardContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected content (regex).

    Filters detected content (regex).

    contentREs

    a list of regular expressions

  4. def discardDate(date: String): RDD[ArchiveRecord]

    Filters detected dates.

    Filters detected dates.

    date

    a list of dates

  5. def discardDomains(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected domains (regex).

    Filters detected domains (regex).

    urls

    a list of urls for the source domains

  6. def discardHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Filters detected HTTP status codes.

    Filters detected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  7. def discardLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Filters detected language.

    Filters detected language.

    lang

    a set of ISO 639-2 codes

  8. def discardMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters ArchiveRecord MimeTypes (web server).

    Filters ArchiveRecord MimeTypes (web server).

    mimeTypes

    a list of Mime Types

  9. def discardMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters detected MimeTypes (Tika).

    Filters detected MimeTypes (Tika).

    mimeTypes

    a list of Mime Types

  10. def discardUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected URL patterns (regex).

    Filters detected URL patterns (regex).

    urlREs

    a list of Regular expressions

  11. def discardUrls(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected URLs.

    Filters detected URLs.

    urls

    a list of urls

  12. def imagegraph(): DataFrame
  13. def images(): DataFrame
  14. def keepContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  15. def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Removes all data that does not have selected date.

    Removes all data that does not have selected date.

    dates

    a list of dates

    component

    the selected DateComponent enum value

  16. def keepDomains(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected source domains.

    Removes all data but selected source domains.

    urls

    a list of urls for the source domains

  17. def keepHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Removes all data that does not have selected HTTP status codes.

    Removes all data that does not have selected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  18. def keepImages(): RDD[ArchiveRecord]

    Removes all data except images.

  19. def keepLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  20. def keepMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes specified.

    Removes all data but selected mimeTypes specified.

    mimeTypes

    a list of Mime Types

  21. def keepMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes as detected by Tika.

    Removes all data but selected mimeTypes as detected by Tika.

    mimeTypes

    a list of Mime Types

  22. def keepUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all data but selected URL patterns.

    Removes all data but selected URL patterns.

    urlREs

    a list of regular expressions

  23. def keepUrls(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  24. def keepValidPages(): RDD[ArchiveRecord]

    Removes all non-html-based data (images, executables, etc.) from html text.

  25. def pdfs(): DataFrame
  26. def presentationProgramFiles(): DataFrame
  27. def removeFiledesc(): RDD[ArchiveRecord]

    Filters out filedesc:// and dns: records.

  28. def spreadsheets(): DataFrame
  29. def videos(): DataFrame
  30. def webgraph(): DataFrame

    Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.

  31. def webpages(): DataFrame

    Extracts webpages with columns for crawl data, url, MIME type, and content.

  32. def wordProcessorFiles(): DataFrame