Packages

c

io.archivesunleashed

WARecordRDD

implicit class WARecordRDD extends Serializable

A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.

To load such an RDD, please see RecordLoader.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordRDD
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordRDD(rdd: RDD[ArchiveRecord])

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def all(): DataFrame
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def audio(): DataFrame
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native() @HotSpotIntrinsicCandidate()
  8. def discardContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected content (regex).

    Filters detected content (regex).

    contentREs

    a list of regular expressions

  9. def discardDate(date: String): RDD[ArchiveRecord]

    Filters detected dates.

    Filters detected dates.

    date

    a list of dates

  10. def discardDomains(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected domains (regex).

    Filters detected domains (regex).

    urls

    a list of urls for the source domains

  11. def discardHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Filters detected HTTP status codes.

    Filters detected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  12. def discardLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Filters detected language.

    Filters detected language.

    lang

    a set of ISO 639-2 codes

  13. def discardMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters ArchiveRecord MimeTypes (web server).

    Filters ArchiveRecord MimeTypes (web server).

    mimeTypes

    a list of Mime Types

  14. def discardMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters detected MimeTypes (Tika).

    Filters detected MimeTypes (Tika).

    mimeTypes

    a list of Mime Types

  15. def discardUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected URL patterns (regex).

    Filters detected URL patterns (regex).

    urlREs

    a list of Regular expressions

  16. def discardUrls(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected URLs.

    Filters detected URLs.

    urls

    a list of urls

  17. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  18. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  19. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  20. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  21. def imagegraph(): DataFrame
  22. def images(): DataFrame
  23. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  24. def keepContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  25. def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Removes all data that does not have selected date.

    Removes all data that does not have selected date.

    dates

    a list of dates

    component

    the selected DateComponent enum value

  26. def keepDomains(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected source domains.

    Removes all data but selected source domains.

    urls

    a list of urls for the source domains

  27. def keepHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Removes all data that does not have selected HTTP status codes.

    Removes all data that does not have selected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  28. def keepImages(): RDD[ArchiveRecord]

    Removes all data except images.

  29. def keepLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  30. def keepMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes specified.

    Removes all data but selected mimeTypes specified.

    mimeTypes

    a list of Mime Types

  31. def keepMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes as detected by Tika.

    Removes all data but selected mimeTypes as detected by Tika.

    mimeTypes

    a list of Mime Types

  32. def keepUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all data but selected URL patterns.

    Removes all data but selected URL patterns.

    urlREs

    a list of regular expressions

  33. def keepUrls(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  34. def keepValidPages(): RDD[ArchiveRecord]

    Removes all non-html-based data (images, executables, etc.) from html text.

  35. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  36. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  37. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  38. def pdfs(): DataFrame
  39. def presentationProgramFiles(): DataFrame
  40. def removeFiledesc(): RDD[ArchiveRecord]

    Filters out filedesc:// and dns: records.

  41. def spreadsheets(): DataFrame
  42. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  43. def toString(): String
    Definition Classes
    AnyRef → Any
  44. def videos(): DataFrame
  45. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  46. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  47. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  48. def webgraph(): DataFrame

    Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.

  49. def webpages(): DataFrame

    Extracts webpages with columns for crawl data, url, MIME type, and content.

  50. def wordProcessorFiles(): DataFrame

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] ) @Deprecated @deprecated
    Deprecated

    (Since version ) see corresponding Javadoc for more information.

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped