Class

io.archivesunleashed

WARecordRDD

Related Doc: package archivesunleashed

Permalink

implicit class WARecordRDD extends Serializable

A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.

To load such an RDD, please see RecordLoader.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordRDD
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordRDD(rdd: RDD[ArchiveRecord])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def all(): DataFrame

    Permalink
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def audio(): DataFrame

    Permalink
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def discardContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Filters detected content (regex).

    Filters detected content (regex).

    contentREs

    a list of regular expressions

  9. def discardDate(date: String): RDD[ArchiveRecord]

    Permalink

    Filters detected dates.

    Filters detected dates.

    date

    a list of dates

  10. def discardDomains(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected domains (regex).

    Filters detected domains (regex).

    urls

    a list of urls for the source domains

  11. def discardHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected HTTP status codes.

    Filters detected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  12. def discardLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected language.

    Filters detected language.

    lang

    a set of ISO 639-2 codes

  13. def discardMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters ArchiveRecord MimeTypes (web server).

    Filters ArchiveRecord MimeTypes (web server).

    mimeTypes

    a list of Mime Types

  14. def discardMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected MimeTypes (Tika).

    Filters detected MimeTypes (Tika).

    mimeTypes

    a list of Mime Types

  15. def discardUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Filters detected URL patterns (regex).

    Filters detected URL patterns (regex).

    urlREs

    a list of Regular expressions

  16. def discardUrls(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected URLs.

    Filters detected URLs.

    urls

    a list of urls

  17. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  18. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  19. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  20. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  21. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  22. def imageLinks(): DataFrame

    Permalink
  23. def images(): DataFrame

    Permalink
  24. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  25. def keepContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  26. def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Permalink

    Removes all data that does not have selected date.

    Removes all data that does not have selected date.

    dates

    a list of dates

    component

    the selected DateComponent enum value

  27. def keepDomains(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected source domains.

    Removes all data but selected source domains.

    urls

    a list of urls for the source domains

  28. def keepHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data that does not have selected HTTP status codes.

    Removes all data that does not have selected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  29. def keepImages(): RDD[ArchiveRecord]

    Permalink

    Removes all data except images.

  30. def keepLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  31. def keepMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected mimeTypes specified.

    Removes all data but selected mimeTypes specified.

    mimeTypes

    a list of Mime Types

  32. def keepMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected mimeTypes as detected by Tika.

    Removes all data but selected mimeTypes as detected by Tika.

    mimeTypes

    a list of Mime Types

  33. def keepUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected URL patterns.

    Removes all data but selected URL patterns.

    urlREs

    a list of regular expressions

  34. def keepUrls(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  35. def keepValidPages(): RDD[ArchiveRecord]

    Permalink

    Removes all non-html-based data (images, executables, etc.) from html text.

  36. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  37. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  38. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  39. def pdfs(): DataFrame

    Permalink
  40. def presentationProgramFiles(): DataFrame

    Permalink
  41. def spreadsheets(): DataFrame

    Permalink
  42. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  43. def textFiles(): DataFrame

    Permalink
  44. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  45. def videos(): DataFrame

    Permalink
  46. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  47. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  48. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  49. def webgraph(): DataFrame

    Permalink

    Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.

  50. def webpages(): DataFrame

    Permalink

    Extracts webpages with columns for crawl data, url, MIME type, and content.

  51. def wordProcessorFiles(): DataFrame

    Permalink

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped