Classifies records using NER and stores results as JSON
Created by youngbinkim on 7/7/16.
compute MD5 checksum..
A UDF to detect mime types
UDF for extracting raw text content from an HTML page, minus "boilerplate" content (using boilerpipe).
Simple wrapper for getting different parts of a date
Extracts entities
e.g.
UDF for extracting image links from a webpage given the HTML content (using Jsoup).
UDF for extracting links from a webpage given the HTML content (using Jsoup).
Extract most popular images
Extract most popular images
limit: number of most popular images in the output timeoutVal: time allowed to connect to each image
UDF which reads in a text string, and returns entities identified by the configured Stanford NER classifier
Created by youngbinkim on 7/9/16.
UDF for exporting an RDD representing a collection of links to a GDF file.
UDF for exporting an RDD representing a collection of links to a GDF file.
e.g. when done: $ cat nodes.partjson/part-* > nodes.json && cat links.partjson/part-* > links.json $ jq -c -n --slurpfile nodes nodes.json --slurpfile links links.json '{nodes: $nodes, links: $links}' > graph.json