data manipulation using Databricks

In order to prepare data for bulk insert into Storage Tables (Azure) using ADF it is easier to prepare the data into a format that can be easily used.

A recent example was that I needed to put a large number of guids into a storage table, and I had chosen to partition the entries based on the first 2 chars of the guid.

I was given a csv with the list of guids and used something like the following code in a Notebook within Databricks to create a new column in the csv for my PartitionKey.

import pandas as pd

# Adjust the path to where your file is located in the Workspace
csv_path = '/Workspace/Users/dwaineharding/origin.csv'

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(csv_path)

# Show the first few rows
df.head()

# Create a new column called 'rowKey' with the first 2 characters of the first column
df['rowKey'] = df[df.columns[0]].astype(str).str[:2]

# Display the result
df.head()

output_path = '/Workspace/Users/dwaineharding/output.csv'  # DBFS path

# Save the DataFrame to CSV
df.to_csv(output_path, index=False)

The next step was to export the csv from Databricks. As it was too large to simply download, I needed to store it in a Storage Account to access.

from azure.storage.blob import BlobServiceClient

connect_str = "conn-string-from-blob-account"
container_name = "container-name"
blob_name = "new-file-name.csv"

blob_service_client = BlobServiceClient.from_connection_string(connect_str)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

with open("/Workspace/Users/dwaineharding/output.csv", "rb") as data:
    blob_client.upload_blob(data, overwrite=True)

Following the above I was then able to use an ADF Data Import task to take the data from the account and import into the Storage Table en masse.

tip

If you want to manually add some rows, you can do something like this:

new_rows = [
    {'external_id': 'guid-in-here', 'another_field': 1, 'rowKey': 'row-key-here'} 
]

new_df = pd.DataFrame(new_rows)

df = pd.concat([df, new_df], ignore_index=True)

tip

You may need to remove duplicates from the data like below

df = df[~df["rowkey"].duplicated(keep=False)]