This is the implementation of the Databricks data handler for MindsDB.

Databricks is a data lakehouse that unifies the best of data warehouses and data lakes in one simple platform to handle all your data, analytics, and AI use cases. It’s built on an open and reliable data foundation that efficiently handles all data types and applies one common security and governance approach across all of your data and cloud platforms.

Prerequisites

Before proceeding, ensure the following prerequisites are met:

  1. Install MindsDB locally via Docker or use MindsDB Cloud.
  2. To connect Databricks to MindsDB, install the required dependencies following this instruction.
  3. Install or ensure access to Databricks.

Implementation

This handler is implemented using databricks-sql-connector, a Python library that allows you to use Python code to run SQL commands on Databricks clusters and Databricks SQL warehouses.

The required arguments to establish a connection are as follows:

  • server_hostname is the server hostname for the cluster or SQL warehouse.
  • http_path is the HTTP path of the cluster or SQL warehouse.
  • access_token is a Databricks personal access token for the workspace.

There are several optional arguments as follows:

  • session_configuration is a dictionary of Spark session configuration parameters.
  • http_headers stores additional (key, value) pairs to set in HTTP headers on every RPC request the client makes.
  • catalog is the catalog to use for the connection. Typically, defaults to hive_metastore if not provided.
  • schema is the schema (database) to use for the connection. Defaults to default if not provided.

Usage

In order to make use of this handler and connect to the Databricks database in MindsDB, the following syntax can be used:

CREATE DATABASE databricks_datasource
WITH
  engine = 'databricks',
  parameters = {
      "server_hostname": "adb-1234567890123456.7.azuredatabricks.net",
      "http_path": "sql/protocolv1/o/1234567890123456/1234-567890-test123",
      "access_token": "dapi1234567890ab1cde2f3ab456c7d89efa",
      "schema": "example_db"
  };

You can use this established connection to query your table as follows:

SELECT *
FROM databricks_datasource.example_table;