Machine Learning: What is Auto Loader?
Auto Loader is a Databricks feature which allows pitch and play data to be ingested automatically. This is because pitch and play data is continuously saved to cloud storage and auto loader scans files in cloud storage before loading the data to Databricks where data teams begin to transform it.
How to use Auto Loader?
Auto Loader works well for both small and large data sizes, mainly because it can reliably ingest large streams of data. This is the python code, showing how to stream data with Auto Loader:
df = spark.readStream.format(“cloudFiles”) \
.option(,) \
.schema() \
.load()
df.writeStream.format(“delta”) \
.option(“checkpointLocation”, ) \
.trigger() \
.start()
What are the challenges?
Auto Loader only processes data in the file format JSON (the file format which Statcast is saved). The JSON format organises data into arrays which is different to how we typically see data. For example, the CSV file format is organised in columns and rows. JSON data can of course still be interpreted and is widely used, however, it can be more difficult to work with when working with large quantities of data.
How to use Databricks’ semi-structured data support with Auto Loader
You can transform the nested data in the JSON format to the structured CSV file format by extracting the semi-structured data available in Databricks.
As the data is loaded in to Auto Loader, save it to a Delta table. Delta Lake is an open format storage layer that brings reliability, security and performance to a data lake. Semi-structured support with Delta allows you to retain some of the nested data as you need; you can maintain nested data objects as a column within the Delta table without the need to flatten out all of the JSON data.
spark.readStream.format(“cloudFiles”) \
.option(“cloudFiles.format”, “json”) \
.option(“cloudFiles.schemaLocation”, “”) \
.load(“”) \
.selectExpr(
“*”,
“tags:page.name”, # extracts {“tags”:{“page”:{“name”:…}}}
“tags:page.id::int”, # extracts {“tags”:{“page”:{“id”:…}}} and
casts to int
“tags:eventType” # extracts {“tags”:{“eventType”:…}}
)
How does Auto Loader benefit you?
Data teams are able to clearly interpret new data and build analytics quickly and that gives their team the competitive advantage.
Auto Loader continuously streams new data in after each pitch. Semi-structured data supports transforming it into a consumable format and Delta Lake organises it for effective use.
Example of Auto Loader writing data to a Delta table as a stream:
# Define the schema and the input, checkpoint, and output paths.
read_schema = (“id int, “ +
“firstName string, “ +
“middleName string, “ +
“lastName string, “ +
“gender string, “ +
“birthDate timestamp, “ +
“ssn string, “ +
“salary int”)
json_read_path = ‘/FileStore/streaming-uploads/people-10m’
checkpoint_path = ‘/mnt/delta/people-10m/checkpoints’
save_path = ‘/mnt/delta/people-10m’
people_stream = (spark \
.readStream \
.schema(read_schema) \
.option(‘maxFilesPerTrigger’, 1) \
.option(‘multiline’, True) \
.json(json_read_path))
people_stream.writeStream \
.format(‘delta’) \
.outputMode(‘append’) \
.option(‘checkpointLocation’, checkpoint_path) \
.start(save_path)
If you are a Machine Learning specialist looking for your next exciting opportunity, Noah, partners with some of the most exciting technology businesses in emerging markets.