Notes on Spark2014.12.21Code

Like my notes on python dates, this is more for me than you :).

Parsing JSON Delimited Files into SQL

Parsing new line seperated json records.

{"order_id": 1, "customer_id": 1, "item": "A"}
{"order_id": 2, "customer_id": 1, "item": "B"}

Step 1: Create a SQLContext from the regular SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Step 2: Use the jsonFile method on the SQLContext.

val orders = sqlContext.jsonFile("./fake-orders.json")

You can then see the schema of the of the newly created RDD.

orders.printSchema()

Outputs:

root
  |-- customer_id: integer (nullable = true)
  |-- item: string (nullable = true)
  |-- order_id: integer (nullable = true)

Step 3: Register the Table with the Spark Context

orders.registerTempTable("orders")

Step 4: Query the Table with SQL

val orders_x = sqlContext.sql("SELECT * FROM orders WHERE item = 'x'")
PythonSpark