Package

io.archivesunleashed.spark

matchbox

Permalink

package matchbox

Visibility
  1. Public
  2. All

Type Members

  1. class NERCombinedJson extends Serializable

    Permalink

    Classifies records using NER and stores results as JSON

Value Members

  1. object ComputeImageSize

    Permalink

    Created by youngbinkim on 7/7/16.

  2. object ComputeMD5

    Permalink

    compute MD5 checksum..

  3. object DetectLanguage

    Permalink
  4. object DetectMimeTypeTika

    Permalink

    A UDF to detect mime types

  5. object ExtractAtMentions

    Permalink
  6. object ExtractBoilerpipeText

    Permalink

    UDF for extracting raw text content from an HTML page, minus "boilerplate" content (using boilerpipe).

  7. object ExtractDate

    Permalink

    Simple wrapper for getting different parts of a date

  8. object ExtractDomain

    Permalink
  9. object ExtractEntities

    Permalink

    Extracts entities

  10. object ExtractGraph

    Permalink

    e.g.

    e.g. when done: $ cat nodes.partjson/part-* > nodes.json && cat links.partjson/part-* > links.json $ jq -c -n --slurpfile nodes nodes.json --slurpfile links links.json '{nodes: $nodes, links: $links}' > graph.json

  11. object ExtractHashtags

    Permalink
  12. object ExtractImageLinks

    Permalink

    UDF for extracting image links from a webpage given the HTML content (using Jsoup).

  13. object ExtractLinks

    Permalink

    UDF for extracting links from a webpage given the HTML content (using Jsoup).

  14. object ExtractPopularImages

    Permalink

    Extract most popular images

    Extract most popular images

    limit: number of most popular images in the output timeoutVal: time allowed to connect to each image

  15. object ExtractTextFromPDFs

    Permalink
  16. object ExtractUrls

    Permalink
  17. object NER3Classifier

    Permalink

    UDF which reads in a text string, and returns entities identified by the configured Stanford NER classifier

  18. object RecordLoader

    Permalink
  19. object RemoveHTML

    Permalink
  20. object RemoveHttpHeader

    Permalink

    Created by youngbinkim on 7/9/16.

  21. object StringUtils

    Permalink
  22. object TupleFormatter

    Permalink
  23. object TweetUtils

    Permalink
  24. object WriteGEXF

    Permalink

    UDF for exporting an RDD representing a collection of links to a GDF file.

  25. object WriteGraphML

    Permalink

    UDF for exporting an RDD representing a collection of links to a GDF file.

Ungrouped